|
发表于 2021-6-21 00:02:38
|
显示全部楼层
本帖最后由 fc5igm 于 2021-6-21 00:16 编辑
- import urllib.request as urr
- import chardet as chd
- def get_code(url,headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.101 Safari/537.36 Edg/91.0.864.48'}):
-
- request=urr.Request(url=url,headers=headers)
- response=urr.urlopen(request).read()
- coding=chd.detect(response)['encoding']
- if coding=='GB2312':
- coding='GBK'
- return coding
复制代码
你可以拿这个函数来获取网页编码
这样更好看一些
需要安装bs4和lxml
- import urllib.request as ulr
- from bs4 import BeautifulSoup as BS
- import chardet as chd
- def get_code(url,headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.101 Safari/537.36 Edg/91.0.864.48'}):
-
- request=urr.Request(url=url,headers=headers)
- response=urr.urlopen(request).read()
- coding=chd.detect(response)['encoding']
- if coding=='GB2312':
- coding='GBK'
- return coding
- header={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.101 Safari/537.36 Edg/91.0.864.48'}
- url='https://www.shicimingju.com/book/shiji.html'
- opener=ulr.build_opener()
- request=ulr.Request(url,headers=header)
- response=opener.open(request).read().decode(get_code(url),'ignore')
- soup=BS(response,'lxml')
- print(soup.prettify())
复制代码 |
|