|
|
马上注册,结交更多好友,享用更多功能^_^
您需要 登录 才可以下载或查看,没有账号?立即注册
x
本帖最后由 梦想一事无成 于 2017-10-18 12:23 编辑
这段代码是爬取百度百科胡歌这个页面的链接,我想要在爬取每个链接的外链,如此循环下去
但是我的函数 chaozhaourl(keyname),参数keyname是解码后的,爬取的链接没有解码
新手一枚,都不知道怎么描述自己的想法。
求各位大佬看出我的代码给出指教。
- import urllib.request
- import random
- import bs4
- import re
- headerss = ['Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0'
- 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1) ; Maxthon/3.0'
- 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1) ; QIHU 360EE'
- 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 UBrowser/6.2.3637.220 Safari/537.36']
- def agent():
- thisagent = random.choice(headerss)
- print(thisagent)
- headers = ('User-Agent',thisagent)
- opener = urllib.request.build_opener()
- opener.addheaders = [headers]
- urllib.request.install_opener(opener)
- def chaozhaourl(keyname):
- agent()
- key = urllib.request.quote(keyname)
- url = 'https://baike.baidu.com/item/'+key
- data = urllib.request.urlopen(url).read().decode('utf-8')
- bsobj = bs4.BeautifulSoup(data,'html.parser')
- for link in bsobj.find('div',{'class':'body-wrapper feature large-feature starLarge'}).findAll('a',href = re.compile('^(/item/)(.*)')):
- if 'href' in link.attrs:
- print(link.attrs['href'])
-
- chaozhaourl('胡歌')
复制代码 |
|