[已解决]新手求助！爬到的文章标题是乱码怎么办？

还要起名字呐 · 发表于 2023-8-18 16:11:37

马上注册，结交更多好友，享用更多功能^_^

您需要登录才可以下载或查看，没有账号？立即注册

x

我想扒学校网站新闻标题，发现结果是乱码，怎么办？

from bs4 import BeautifulSoup
import requests,chardet

url="  "
req = requests.get(url).text
# req.encoding = chardet.detect(req.content)['encoding'] #提取网页编码
soup = BeautifulSoup(req,"html.parser")
newnames = soup.findAll('a')
newtimes = soup.findAll('span',attrs={"class": "time"})
a = 1
for newname in newnames:
for newtime in newtimes:
      if newname.string == None:
         continue
      else:
         print(f"学校新闻版块第{a}条标题名称: {newname.string} 发布时间 {newtime.string}")
         a+=1

打印结果如下：（标题为乱码）
学校新闻版块第1条标题名称: å¦æ ¡ä¸»ç« 发布时间 2023-08-17
学校新闻版块第2条标题名称: å¦æ ¡ä¸»ç« 发布时间 2023-08-17
学校新闻版块第3条标题名称: å¦æ ¡ä¸»ç« 发布时间 2023-08-16
学校新闻版块第4条标题名称: å¦æ ¡ä¸»ç« 发布时间 2023-08-15
学校新闻版块第5条标题名称: å¦æ ¡ä¸»ç« 发布时间 2023-08-13
学校新闻版块第6条标题名称: å¦æ ¡ä¸»ç« 发布时间 2023-08-13
学校新闻版块第7条标题名称: å¦æ ¡ä¸»ç« 发布时间 2023-08-12
学校新闻版块第8条标题名称: å¦æ ¡ä¸»ç« 发布时间 2023-08-11
学校新闻版块第9条标题名称: å¦æ ¡ä¸»ç« 发布时间 2023-08-11
学校新闻版块第10条标题名称: å¦æ ¡ä¸»ç« 发布时间 2023-08-10

最佳答案

月排行榜 / 总排行榜

陶远航

2023-8-18 18:15:38

还要起名字呐发表于 2023-8-18 16:45
学校新闻版块第1条标题名称: 学校主站发布时间 2023-08-17
学校新闻版块第2条标题名称: 学校主站 ...

ok，这次加了循环，可以爬以前所有的新闻了
可以给一个最佳答案吗

from bs4 import BeautifulSoup
import requests,chardet
def get(url):
req = requests.get(url)
req.encoding="utf-8"
req=req.text
# req.encoding = chardet.detect(req.content)['encoding'] #提取网页编码
soup = BeautifulSoup(req,"html.parser")
names = soup.findAll('div',class_="list")
newnames=[a['title'] if 'title' in a.attrs else '' for a in soup.find_all('a')]
newnames = [name for name in newnames if name!='']s's's's
newtimes = soup.findAll('span',attrs={"class": "time"})
a = 1
for i in range(len(newnames)):
if newnames[i] == None:
continue
else:
try:
print("学校新闻版块第",i,"条标题名称:\t",newnames[i]," 发布时间 ",newtimes[i].text)
a+=1
except:
pass
get("https://news.wtu.edu.cn/xxxw.htm")
for j in range(295):
if 294-j!=0:
j=294-j
else:
break
get("https://news.wtu.edu.cn/xxxw/"+str(j)+".htm")

复制代码

跳转到最佳答案楼层

isdkz · 发表于 2023-8-18 16:11:55

你遇到的问题是爬取的文章标题出现乱码。这个问题通常是由于网页编码和解析方式不匹配导致的。解决这个问题的方法有两种：

1. 指定正确的网页编码：你可以使用 chardet 库来检测网页的编码，然后将编码应用到 requests 的 encoding 属性上。你可以取消注释 # req.encoding = chardet.detect(req.content)['encoding'] 这一行代码，并将其放在 req = requests.get(url).text 之后。这样做可以尝试使用正确的编码来解析网页内容，从而避免乱码问题。

2. 解析乱码的网页内容：如果指定了正确的编码后仍然无法解决乱码问题，你可以尝试使用 html.parser 解析器之外的其他解析器，例如 lxml 或 html5lib 。你可以修改 BeautifulSoup 的第二个参数为 "lxml" 或 "html5lib" ，并重新运行代码。这些解析器可能能够更好地处理乱码问题。

希望以上解决方法能够帮助你解决乱码问题！如果还有其他问题，请随时提问。

学习编程中的Ben · 发表于 2023-8-18 16:12:12

你遇到的问题可能是由于网页编码不匹配导致的乱码。你可以尝试以下方法解决：

1. 使用chardet库检测网页的编码，并将其设置为正确的编码。你可以取消注释代码中的 req.encoding 这一行，并将其修改为：

req.encoding = chardet.detect(req.content)['encoding']

复制代码

2. 如果步骤1不能解决问题，可以尝试使用不同的解析器来解析网页。你当前使用了BeautifulSoup的默认解析器"html.parser"，你可以尝试使用其他解析器，如"lxml"或"html5lib"。例如，将下面这行代码：

soup = BeautifulSoup(req, "html.parser")

复制代码

修改为：

soup = BeautifulSoup(req, "lxml")

复制代码

这些方法可以帮助你解决文章标题乱码的问题。如果仍然无法解决，请提供更多关于问题的详细信息，以便我能够给出更具体的解决方案。

如果回答对你有帮助，请给我一个最佳答案！

陶远航 · 发表于 2023-8-18 16:12:26

你遇到的问题是爬取到的文章标题显示为乱码。这个问题通常是由于编码不一致导致的。你可以尝试对爬取到的网页内容进行编码处理，确保使用正确的编码格式进行解码。

在你的代码中，可以使用 chardet 库来检测网页的编码。然后，使用正确的编码格式对网页内容进行解码。下面是修改后的代码：

from bs4 import BeautifulSoup
import requests
import chardet
url = "your_url"
req = requests.get(url).content
encoding = chardet.detect(req)['encoding']
text = req.decode(encoding)
soup = BeautifulSoup(text, "html.parser")
newnames = soup.findAll('a')
newtimes = soup.findAll('span', attrs={"class": "time"})
a = 1
for newname in newnames:
for newtime in newtimes:
if newname.string == None:
continue
else:
print(f"学校新闻版块第{a}条标题名称: {newname.string} 发布时间 {newtime.string}")
a += 1

复制代码

这样修改之后，应该能够正确显示文章标题了。请将 your_url 替换为你要爬取的学校网站新闻页面的链接。

希望这能帮助到你！如果还有其他问题，请随时提问。
如果问题已经解决，请设置最佳答案

15760997030 · 发表于 2023-8-18 16:15:21

还要起名字呐 · 发表于 2023-8-18 16:21:29

不行，取消注释，修改soup语句之后，报错：
AttributeError: 'str' object has no attribute 'content'

陶远航 · 发表于 2023-8-18 16:22:53

还要起名字呐发表于 2023-8-18 16:21
不行，取消注释，修改soup语句之后，报错：
AttributeError: 'str' object has no attribute 'content'

你牛逼，爬学校网站。。。

我帮你看看代码哈，别急

陶远航 · 发表于 2023-8-18 16:24:29

还要起名字呐发表于 2023-8-18 16:21
不行，取消注释，修改soup语句之后，报错：
AttributeError: 'str' object has no attribute 'content'

你学校网站url不好公布吗

你试一下
response.encoding='utf-8'

还要起名字呐 · 发表于 2023-8-18 16:26:13

陶远航发表于 2023-8-18 16:24
你学校网站url不好公布吗

你试一下

fishc不让我上传网站链接。"https:/news.wtu.edu.cn/xxxw.htm"

陶远航 · 发表于 2023-8-18 16:34:10

看看这个

from bs4 import BeautifulSoup
import requests,chardet
url="https:/news.wtu.edu.cn/xxxw.htm"
req = requests.get(url)
req.encoding="utf-8"
req=req.text
# req.encoding = chardet.detect(req.content)['encoding'] #提取网页编码
soup = BeautifulSoup(req,"html.parser")
newnames = soup.findAll('a')
newtimes = soup.findAll('span',attrs={"class": "time"})
a = 1
for newname in newnames:
for newtime in newtimes:
if newname.string == None:
continue
else:
print(f"学校新闻版块第{a}条标题名称: {newname.string} 发布时间 {newtime.string}")
a+=1

复制代码

还要起名字呐 · 发表于 2023-8-18 16:36:28

isdkz 发表于 2023-8-18 16:11
你遇到的问题是爬取的文章标题出现乱码。这个问题通常是由于网页编码和解析方式不匹配导致的。解决这个问题 ...

尝试过后，报错：AttributeError: 'str' object has no attribute 'content'

还要起名字呐 · 发表于 2023-8-18 16:45:33

陶远航发表于 2023-8-18 16:34
看看这个

学校新闻版块第1条标题名称: 学校主站发布时间 2023-08-17
学校新闻版块第2条标题名称: 学校主站发布时间 2023-08-17
学校新闻版块第3条标题名称: 学校主站发布时间 2023-08-16
学校新闻版块第4条标题名称: 学校主站发布时间 2023-08-15
学校新闻版块第28条标题名称: 收藏本站发布时间 2023-08-16
学校新闻版块第29条标题名称: 收藏本站发布时间 2023-08-15

倒是不乱码了，但我的标题呢。。
尊敬的VIP高级会员，你再帮我看看

陶远航 · 发表于 2023-8-18 16:46:33

还要起名字呐发表于 2023-8-18 16:45
学校新闻版块第1条标题名称: 学校主站发布时间 2023-08-17
学校新闻版块第2条标题名称: 学校主站 ...

好的

陶远航 · 发表于 2023-8-18 18:01:54

还要起名字呐发表于 2023-8-18 16:45
学校新闻版块第1条标题名称: 学校主站发布时间 2023-08-17
学校新闻版块第2条标题名称: 学校主站 ...

久等了，我再来一个循环获取所有标题

from bs4 import BeautifulSoup
import requests,chardet
url="https://news.wtu.edu.cn/xxxw.htm"
req = requests.get(url)
req.encoding="utf-8"
req=req.text
# req.encoding = chardet.detect(req.content)['encoding'] #提取网页编码
soup = BeautifulSoup(req,"html.parser")
names = soup.findAll('div',class_="list")
newnames=[a['title'] if 'title' in a.attrs else '' for a in soup.find_all('a')]
newnames = [name for name in newnames if name!='']
print(newnames)
newtimes = soup.findAll('span',attrs={"class": "time"})
a = 1
for i in range(len(newnames)):
if newnames[i] == None:
continue
else:
print("学校新闻版块第",i,"条标题名称:\t",newnames[i]," 发布时间 ",newtimes[i].text)
a+=1

复制代码

陶远航 · 发表于 2023-8-18 18:15:38

还要起名字呐发表于 2023-8-18 16:45
学校新闻版块第1条标题名称: 学校主站发布时间 2023-08-17
学校新闻版块第2条标题名称: 学校主站 ...

ok，这次加了循环，可以爬以前所有的新闻了
可以给一个最佳答案吗

from bs4 import BeautifulSoup
import requests,chardet
def get(url):
req = requests.get(url)
req.encoding="utf-8"
req=req.text
# req.encoding = chardet.detect(req.content)['encoding'] #提取网页编码
soup = BeautifulSoup(req,"html.parser")
names = soup.findAll('div',class_="list")
newnames=[a['title'] if 'title' in a.attrs else '' for a in soup.find_all('a')]
newnames = [name for name in newnames if name!='']s's's's
newtimes = soup.findAll('span',attrs={"class": "time"})
a = 1
for i in range(len(newnames)):
if newnames[i] == None:
continue
else:
try:
print("学校新闻版块第",i,"条标题名称:\t",newnames[i]," 发布时间 ",newtimes[i].text)
a+=1
except:
pass
get("https://news.wtu.edu.cn/xxxw.htm")
for j in range(295):
if 294-j!=0:
j=294-j
else:
break
get("https://news.wtu.edu.cn/xxxw/"+str(j)+".htm")

复制代码

还要起名字呐 · 发表于 2023-8-18 20:01:49

陶远航发表于 2023-8-18 18:15
ok，这次加了循环，可以爬以前所有的新闻了
可以给一个最佳答案吗

哇 newnames这么写我得学3天！你真的好厉害！感谢大佬

陶远航 · 发表于 2023-8-18 20:03:30

还要起名字呐发表于 2023-8-18 20:01
哇 newnames这么写我得学3天！你真的好厉害！感谢大佬

哈哈哈，多练你就懂了。
~~这是装逼的写法，小白还是建议老老实实写~~

账号		自动登录	找回密码
密码			立即注册

[已解决]新手求助！爬到的文章标题是乱码怎么办？

马上注册，结交更多好友，享用更多功能^_^

点评