|
马上注册,结交更多好友,享用更多功能^_^
您需要 登录 才可以下载或查看,没有账号?立即注册
x
本帖最后由 myqf123 于 2021-12-28 16:01 编辑
- import urllib.request
- import urllib.parse
- import re
- from bs4 import BeautifulSoup
- import random
- import time
- def main():
- ## keyword = input("请输入关键词:")
- keyword = urllib.parse.urlencode({"word":'猪八戒'})
-
- url = ("http://baike.baidu.com/search/word?%s" % keyword)
- iplist = ['113.28.90.67:9480']
- proxy_support = urllib.request.ProxyHandler({'http':random.choice(iplist)})
- opener = urllib.request.build_opener(proxy_support)
- opener.addheaders = [('User-Agent','Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36')]
- urllib.request.install_opener(opener)
-
- response = urllib.request.urlopen(url)
- html = response.read().decode('UTF-8')
- soup = BeautifulSoup(html,"html.parser")
- for each in soup.find_all(href=re.compile("view")):
- content = ''.join([each.text])
- url2 = ''.join(["http://baike.baidu.com",each["href"]])
-
- response2 = urllib.request.urlopen(url2)
- html2 = response2.read().decode('UTF-8')
- soup2 = BeautifulSoup(html2,"html.parser") #从网页抓取数据
- if soup2:
- content = ''.join([content, soup2.h2.text])
- content = ''.join([content, " -> ", url2])
- print(content)
- time.sleep(3)
- if __name__ == "__main__":
- main()
复制代码
运行结果总是:UnicodeEncodeError: 'ascii' codec can't encode characters in position 58-61: ordinal not in range(128)
求问题出在哪里了?
本帖最后由 伏惜寒 于 2021-12-30 10:31 编辑
问题1:urllib这个模块其实已经有点过时了,不推荐在现在的环境中使用,如果要学爬虫,建议用requests库
问题2:urllib.request.urlopen(url2)这个方法不支持中文url,这是你代码中打印出来的url,可以看到第二个url包含中文,这个方法无法解析,需要对中文进行转码
- 1http://baike.baidu.com/search/word?word=%E7%8C%AA%E5%85%AB%E6%88%92
- 2http://baike.baidu.com/wikicategory/view?categoryName=恐龙大全
复制代码
使用这个方法转码urllib.parse.quote("恐龙大全")
问题3:后面的信息获取也写错了,因为不知道你要什么信息,我就没改了,你自己看着玩吧。
- import urllib.request
- import urllib.parse
- import re
- from bs4 import BeautifulSoup
- import random
- import time
- def main():
- ## keyword = input("请输入关键词:")
- keyword = urllib.parse.urlencode({"word":'猪八戒'})
-
- url = ("http://baike.baidu.com/search/word?%s" % keyword)
- iplist = ['113.28.90.67:9480']
- proxy_support = urllib.request.ProxyHandler({'http':random.choice(iplist)})
- opener = urllib.request.build_opener(proxy_support)
- opener.addheaders = [('User-Agent','Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36')]
- urllib.request.install_opener(opener)
- print(url)
- response = urllib.request.urlopen(url)
- html = response.read().decode('UTF-8')
- soup = BeautifulSoup(html,"html.parser")
- for each in soup.find_all(href=re.compile("view")):
- content = ''.join([each.text])
- #对中文字符进行转码
- href = urllib.parse.quote(each["href"])
- #urllib.request.urlopen中只有转码后的中文才能作为参数传入进去
- url2 = ''.join(["http://baike.baidu.com",href])
- print(url2)
- response2 = urllib.request.urlopen(url2)
- html2 = response2.read().decode('UTF-8')
- soup2 = BeautifulSoup(html2,"html.parser") #从网页抓取数据
- if soup2:
- #soup2是返回的网页信息,content是"恐龙百科"
- #.join([content, soup2.h2.text])这个方法写错了
- #我不知道你要什么信息所以后面我没改了
- content = ''.join([content, soup2.h2.text])
- content = ''.join([content, " -> ", url2])
- print(content)
- time.sleep(3)
- if __name__ == "__main__":
- main()
复制代码
备注:建议你学习lxml模块+requests模块,一般的爬虫这两个模块就能解决了,比你用的模块容易很多
|
|