|
马上注册,结交更多好友,享用更多功能^_^
您需要 登录 才可以下载或查看,没有账号?立即注册
x
该贴用来学习爬虫。初步编了一个爬虫,用来爬电影天堂中欧美电影的网址。已能实现在第一页里爬取电影名称及网页网址,放在一个csv文件里。
疑问是beautifulsoup对象的编码存在问题,网页是gb2312编码,用了gbk,但内容仍显示错码,请大神解决。
- #-*- coding:utf8 -*-
- import requests
- import re
- from bs4 import BeautifulSoup
- import csv
- def getHtml(url): #获取网址并形成requests对象并编码
- res=requests.get(url)
- html=res.text.encode('utf-8',errors='ignore')
- return html
- def getPageUrl(html): #解析网页,获取网址对象
- bs0bj=BeautifulSoup(html,from_encoding='gbk')
- reg=re.compile(r'/html/gndy/\w{4}/\d{8}/\d{4,10}.html')
- pages=bs0bj.findAll('a', {'href':reg})
- return pages
- if __name__ == '__main__':
- url='http://www.ygdy8.net/html/gndy/oumei/index.html'
- html=getHtml(url)
- pages=getPageUrl(html)
- csvFile=open('I:/编程学习/spider/movie.csv','w+')
- sheet=csv.writer(csvFile)
- sheet.writerow(('电影名称','电影介绍及下载网页网址'))
- preurl='http://www.ygdy8.net/html/gndy'
- for page in pages:
- sheet.writerow((page.get_text(),preurl+page['href']))
- csvFile.close()
复制代码
以下为运行后输出:
C:\Users\chennan\AppData\Local\Programs\Python\Python35-32\python.exe I:\编程学习\spider\.idea\dytt8-movie.py
C:\Users\chennan\AppData\Local\Programs\Python\Python35-32\lib\site-packages\bs4\__init__.py:181: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html.parser"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.
The code that caused this warning is on line 20 of the file I:\编程学习\spider\.idea\dytt8-movie.py. To get rid of this warning, change code that looks like this:
BeautifulSoup([your markup])
to this:
BeautifulSoup([your markup], "html.parser")
markup_type=markup_type))
进程已结束,退出代码0 |
|