马上注册,结交更多好友,享用更多功能^_^
您需要 登录 才可以下载或查看,没有账号?立即注册
x
亲们,第一次发帖,最近在研究python爬虫,在爬取页面前,已分析页面数据,页面中明确标识编码为UTF-8,在存储时也标注了encode为utf-8,但页面认为乱码,请问如何解决呢?import requests
import chardet #用于测试网站编码
from bs4 import BeautifulSoup
url = 'http://www.shicimingju.com/book/sanguoyanyi.html'
head = {
'user-agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36 Edg/91.0.864.64'
}
r = requests.get(url=url,headers=head).text
r_date = requests.get(url=url,headers=head).content
print(chardet.detect(r_date))
with open('./sgyy.html','w',encoding='UTF-8') as fp:
fp.write(r)
soup_r = BeautifulSoup(r,'lxml')
爬取结果为:
<!DOCTYPE html>
<html lang="zh">
<head>
<script type="text/javascript" src="https://ip.ws.126.net/ipquery"></script>
<script src="/newpage/js/all.js"></script>
<meta charset="UTF-8">
<title>《三国演ä1‰ã€‹å…¨é›†åœ¨ço¿é˜…èˉ»_å2ä1|典籍_èˉ—èˉåå¥网</title> |