python爬取也面乱码
亲们,第一次发帖,最近在研究python爬虫,在爬取页面前,已分析页面数据,页面中明确标识编码为UTF-8,在存储时也标注了encode为utf-8,但页面认为乱码,请问如何解决呢?import requests
import chardet #用于测试网站编码
from bs4 import BeautifulSoup
url = 'http://www.shicimingju.com/book/sanguoyanyi.html'
head = {
'user-agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36 Edg/91.0.864.64'
}
r = requests.get(url=url,headers=head).text
r_date = requests.get(url=url,headers=head).content
print(chardet.detect(r_date))
with open('./sgyy.html','w',encoding='UTF-8') as fp:
fp.write(r)
soup_r = BeautifulSoup(r,'lxml')
爬取结果为:
<!DOCTYPE html>
<html lang="zh">
<head>
<script type="text/javascript" src="https://ip.ws.126.net/ipquery"></script>
<script src="/newpage/js/all.js"></script>
<meta charset="UTF-8">
<title>ãä¸å½æ¼ä1ãå ¨éå¨ço¿é èˉ»_å2ä1|å ¸ç±_èˉèˉåå¥ç½</title> import requests
url = 'http://www.shicimingju.com/book/sanguoyanyi.html'
head = {
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36 Edg/91.0.864.64'
}
r = requests.get(url=url, headers=head)
r.encoding = 'utf-8'
with open('test.html', 'w', encoding='utf-8') as f:
f.write(r.text) 呃, 这个应该不是乱码, 就是HTML用于表达特殊字符的方式, 如小于号就是<(实际<)
页:
[1]