|
马上注册,结交更多好友,享用更多功能^_^
您需要 登录 才可以下载或查看,没有账号?立即注册
x
- import time
- import codecs
- import requests
- import lxml.html
- with codecs.open('movies.txt','w','utf-8') as f:
- myheaders = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36"}
- url_tpl = 'https://movie.douban.com/top250?start={}&filter='
-
- for page in range(10):
- print('正在获取第{}页'.format(page+1))
- start = page * 25
- url = url_tpl.format(start)
-
- http_response = requests.get(url,headers=myheaders)
- http_response.encoding = 'utf-8'
- html = lxml.html.fromstring(http_response.text)
- movies = html.xpath('//*[@id="content"]/div/div[1]/ol/li')
- for movie in movies:
- movie_text = str(movie.text_content())
- clean_movie_text = movie_text.replace('\n',"")
- print(clean_movie_text,file=f)
-
- time.sleep(5)
复制代码
为什么打开文件要用 codecs.open,试了一下去掉codecs报错,百度了codecs的作用是编码转化,还是不大理解。
加上这个免去了转码的繁琐操作呀,最终以utf-8形式写入文件 中间过程免去了 decode 然后 encode 转回utf-8
|
|