|
马上注册,结交更多好友,享用更多功能^_^
您需要 登录 才可以下载或查看,没有账号?立即注册
x
本帖最后由 chunguang 于 2018-8-6 15:20 编辑
怎样爬取不同页面的数据
我的第一页的网址是 http://www.echinatobacco.com/html/site27/ynzlyns/index.html
第二页的网址是 http://www.echinatobacco.com/html/site27/ynzlyns/index_2.html
第三页的网址是 http://www.echinatobacco.com/html/site27/ynzlyns/index_3.html
........
感觉没有什么规律,求大神
- import requests
- import json
- import re
- from multiprocessing import Pool
- from requests.exceptions import RequestException
- def get_one_page(url):
- headers={
- 'user-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'
- }
- try:
- response=requests.get(url,headers=headers)
- if response.status_code==200:
- return response.text
- return None
- except RequestException:
- return None
- def parse_one_page(html):
- pattern=re.compile('<li>.*?blank">(.*?)</a></em><span>(.*?)</span></li>',re.S)
- items=re.findall(pattern,html)
- for item in items:
- yield{
- 'title':item[0],
- 'time':item[1]
- }
- def main(page):
- url='http://www.echinatobacco.com/html/site27/ynzlyns/index'+'_'+str(page)+'.html'
- html=get_one_page(url)
- for item in parse_one_page(html):
- print(item)
- if __name__=='__main__':
- pool=Pool()
- pool.map(main,[i+2 for i in range(53)])
复制代码
- import requests
- import json
- import re
- from multiprocessing import Pool
- from requests.exceptions import RequestException
- def get_one_page(url):
- headers={
- 'user-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'
- }
- try:
- response=requests.get(url,headers=headers)
- if response.status_code==200:
- return response.text
- return None
- except RequestException:
- return None
- def parse_one_page(html):
- pattern=re.compile('<li>.*?blank">(.*?)</a></em><span>(.*?)</span></li>',re.S)
- items=re.findall(pattern,html)
- for item in items:
- yield{
- 'title':item[0],
- 'time':item[1]
- }
- def main(page):
- if page != 1:#第一页不用_page的形式
- url='http://www.echinatobacco.com/html/site27/ynzlyns/index'+'_'+str(page)+'.html'
- else:
- url='http://www.echinatobacco.com/html/site27/ynzlyns/index.html'
- html=get_one_page(url)
- for item in parse_one_page(html):
- print(item)
- if __name__=='__main__':
- pool=Pool()
- pool.map(main,[i+2 for i in range(53)])
复制代码
试试这个?我好像忘记设置page==1的时候的url了。
|
|