|
马上注册,结交更多好友,享用更多功能^_^
您需要 登录 才可以下载或查看,没有账号?立即注册
x
额- -事情是这样的,今天拿了个段子网站练手,爬去里面的段子,当前页面的用户名和段子的爬去都木有问题,但获取下一页的段子时就出问题了,一直重复着第一页的内容,求大神们指点一下额。感激不尽T^T
- import urllib.request
- import re
- def url_open(url):
- req = urllib.request.Request(url)
- req.add_header('User-Agent','Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36')
- dakai = urllib.request.urlopen(req)
- html = dakai.read().decode('utf-8')
- return html
- def get_page(html):
- link = r'<a href="http://www.fanjian.net/(.+)">'
- find_link = re.findall(link,html)
- link_url = []
-
- for each in find_link:
-
- link_url.append(each)
- return link_url
-
-
- def get_duan(html):
- user = r'target="_blank" title="(.*?)" class="fc-gblue"'
- find_user = re.findall(user,html)
- cont = r'<div class="joke-list-txt">(.+)</div>'
- find_cont = re.findall(cont,html)
-
- x = 1
-
- for content in find_cont:
- content=content.replace("\n","")
-
- name="content"+str(x)
-
- exec(name+'=content')
- x+=1
-
- y = 1
- for user in find_user:
- name="content" + str(y)
- print(user+ ':')
- exec("print("+name+")")
- print("\n")
- y+=1
- if __name__ == '__main__':
- url = 'http://www.fanjian.net/duanzi'
- urllist = get_page(url_open(url))
-
- for i in urllist:
- get_duan(url_open(url))
-
复制代码
本帖最后由 ooxx7788 于 2017-4-26 09:50 编辑
其实你这个根本就不需要第一段,前面getpage里面的毫无作用。
- if __name__ == '__main__':
- for i in range(1, 10):
- url = 'http://www.fanjian.net/duanzi-'+str(i)
- # urllist = get_page(url_open(url))
- # print(urllist)
- # for i in urllist:
- # print(i)
- get_duan(url_open(url))
复制代码
最后改成这个就行了。之所以之前改的你感觉没用,是因为要把前面那个重复很多遍,才能刷到第二页。
|
|