|
马上注册,结交更多好友,享用更多功能^_^
您需要 登录 才可以下载或查看,没有账号?立即注册
x
代码如下:
#coding=utf-8
import requests
url = 'http://www.xiaohuar.com/2014.html'
header = {'user-agent':'Mozilla/5.0'}
r = requests.get(url,headers=header,timeout=30)
html = r.text
print type(html)
print '\n'
print html[:1000]
print '\n\n'
print r.encoding
print r.apparent_encoding
返回结果:
<type 'unicode'>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=gb2312" />
<title>2015Äê´óѧУ»¨ÅÅÐаñ100Ç¿</title>
<meta http-equiv="Cache-Control" content="no-transform" />
<meta http-equiv="Cache-Control" content="no-siteapp" />
<meta name="keywords" content="´óѧУ»¨,¸ßУУ»¨,У»¨ÅÅÐÐ,У»¨Íø,У»¨">
<meta name="description" content="2015Öйú´óѧУ»¨ÅÅÐаñǰ100Ç¿£¬¿´¿´ÄãѧУµÄУ»¨ÈËÆøÓжà°ô£¬¿ìÀ´¸øÄãϲ»¶µÄУ»¨Í¶ÉÏһƱ°É£¬Öйú´óѧУ»¨ÅÅÐаñ TOP100ÐÂÏʳö¯£¬2015ÄêÈ«¹úУ»¨ÅÅÐаñ£¬±¾Õ¾ÆÀÑ¡±¾Õ¾Í¨¹ýͶƱ½«ÆÀÑ¡³ö¡¶2015ÄêУ»¨ÅÅÐаñ¡·">
<link rel="stylesheet" type="text/css" />
<script type="text/javascript" src="http://www.xiaohuar.com/skin/default/js/jquery-1.4.2.min.js"></script>
<SCRIPT type=text/javascript src="http://www.xiaohuar.com/skin/default/js
ISO-8859-1
GB2312
可以看出html使用gb2312编码的,于是对代码改进:
#coding=utf-8
import requests
url = 'http://www.xiaohuar.com/2014.html'
header = {'user-agent':'Mozilla/5.0'}
r = requests.get(url,headers=header,timeout=30)
html = r.text.encode('GB2312')
print type(html)
print '\n'
print html[:1000]
报错:
UnicodeEncodeError: 'gb2312' codec can't encode character u'\xc4' in position 259: illegal multibyte sequence
请问:应该如何解决,以及为何使用GB2312不可行呢? |
|