爬虫网站无法print出text或者输出到文件内

SylarPu · 发表于 2017-12-12 15:40:21

您需要登录才可以下载或查看，没有账号？立即注册

x

一下是代码，这个代码是爬去一个网站的。可以通过bs4进行处理，但是无法进行print和write的操作。
怀疑是计算机自身的编码问题。请问有大神可以解答一下么

# -*- coding: utf-8 -*-
import requests,zlib,gzip
# from pdfkit import *
import pdfkit
from io import StringIO
from bs4 import BeautifulSoup
import sys
# sys.setdefaultencoding("utf-8")
URL="https://daily.zhihu.com"
def GetUrl(url):
header={
"Accept-Encoding":"gzip, deflate",
"Accept-Language":"zh-CN,zh;q=0.8",
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 UBrowser/6.2.3831.602 Safari/537.36"
}
a=requests.get(url,headers=header)
if a.status_code ==200:
bs=BeautifulSoup(a.text,"lxml")
bs.prettify()
return bs
# def Download(path,)
bs=GetUrl(URL)
title=bs.find_all("a",class_="link-button")
for i in title:
print(i)
uid=i["href"]
img=i.find("img")["src"]
name=i.find("span").text
# ir=requests.get(img)
# open('text.png',"wb").write(ir.content)
break
print(URL+uid)
proce=GetUrl(URL+uid)
a=open("text.html","w")
a.write(proce.text)

复制代码

这是报错

<a class="link-button" href="/story/9660557"><img class="preview-image" src="https://pic2.zhimg.com/v2-6be190072ce0664f1549accc74bc51c5.jpg"/><span class="title">伤口愈合这事，你以为简单吧？可是我花了博士四年都还没搞懂</span></a>
https://daily.zhihu.com/story/9660557
Traceback (most recent call last):
File "D:/OFFICE/learing-program/fishc/chapion/4-4/zhihu.py", line 43, in <module>
a.write(proce.text)
UnicodeEncodeError: 'gbk' codec can't encode character '\xf6' in position 2050: illegal multibyte sequence
utf-8
Process finished with exit code 1

复制代码

SylarPu · 发表于 2017-12-12 15:48:58

@wei_Y 为自己艾特一下大神

chakyam · 发表于 2017-12-12 20:32:56

a=open("text.html","w",encoding='utf-8')

账号		自动登录	找回密码
密码			立即注册