本帖最后由 591821661 于 2017-3-10 14:44 编辑
我恰巧最近也在学爬虫,建议你用chardet获取网页编码格式试试。
举个例子import urllib.request
import chardet
response = urllib.request.urlopen(url)
html = response.read()
html = response.read()
a=chardet.detect(html)
encode=a['encoding']
if encode == 'GB2312':
encode='GBK'
html = html.decode(encode)
补充!:但是我帮你看了下这个网页的编码http://www.23us.com/html/0/298/1964116.html
它的编码格式是GBK啊!<!DOCTYPE html><head>
<title>斗破苍穹-正文 第六百一十五章 幽海纳戒-顶点小说</title>
<meta http-equiv="Content-Type" content="text/html; charset=gbk" />
<meta name="keywords" content="斗破苍穹,天蚕土豆,正文 第六百一十五章 幽海纳戒,在线阅读" />
<meta name="description" content="顶点小说整理斗破苍穹全集无弹窗在线阅读,当前章节:正文 第六百一十五章 幽海纳戒" />
<meta name="mobile-agent" content="format=html5;url=http://m.23us.com/html//0/298/1964116.html">
<meta http-equiv="Cache-Control" content="no-transform" />
<meta http-equiv="Cache-Control" content="no-siteapp" />
<link rel="stylesheet" href="/themes/xiaoshuo/style.css" type="text/css"/>
<script language="javascript" type="text/javascript" src="/scripts/xiaoshuo.js"></script>
<script type="text/javascript">var preview_page = "1964113.html",next_page = "1964119.html",index_page = "/html//0/298/",article_id = "298",chapter_id = "1964116";function jumpPage(event){var evt =event?event:window.event;if(evt.keyCode==37) location=preview_page;if (evt.keyCode==39) location=next_page;if (evt.keyCode==13) location=index_page;}document.onkeydown=jumpPage;</script>
|