capf526526 发表于 2021-7-20 10:14:54

python爬取也面乱码

亲们,第一次发帖,最近在研究python爬虫,在爬取页面前,已分析页面数据,页面中明确标识编码为UTF-8,在存储时也标注了encode为utf-8,但页面认为乱码,请问如何解决呢?
import requests
import chardet   #用于测试网站编码
from bs4 import BeautifulSoup
url = 'http://www.shicimingju.com/book/sanguoyanyi.html'
head = {
    'user-agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36 Edg/91.0.864.64'
}


r = requests.get(url=url,headers=head).text
r_date = requests.get(url=url,headers=head).content
print(chardet.detect(r_date))
with open('./sgyy.html','w',encoding='UTF-8') as fp:
    fp.write(r)
soup_r = BeautifulSoup(r,'lxml')

爬取结果为:
<!DOCTYPE html>
<html lang="zh">
<head>
        <script type="text/javascript" src="https://ip.ws.126.net/ipquery"></script>
    <script src="/newpage/js/all.js"></script>
    <meta charset="UTF-8">
    <title>《三国演ä1‰ã€‹å…¨é›†åœ¨ço¿é˜…èˉ»_å2ä1|典籍_èˉ—èˉåå¥网</title>

suchocolate 发表于 2021-7-20 16:42:32

import requests

url = 'http://www.shicimingju.com/book/sanguoyanyi.html'
head = {
    'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36 Edg/91.0.864.64'
}
r = requests.get(url=url, headers=head)
r.encoding = 'utf-8'
with open('test.html', 'w', encoding='utf-8') as f:
    f.write(r.text)

学渣李某人 发表于 2021-7-20 21:25:13

呃, 这个应该不是乱码, 就是HTML用于表达特殊字符的方式, 如小于号就是&lt(实际<)
页: [1]
查看完整版本: python爬取也面乱码