|
|

楼主 |
发表于 2019-3-13 09:41:25
|
显示全部楼层
本帖最后由 wiselin 于 2019-3-13 09:43 编辑
- # -*- coding: GBK -*-
- import scrapy
- class NumSpider(scrapy.Spider):
- name = 'num'
- allowed_domains = ['www.gdfc.org.cn']
-
- def start_requests(self):
- header = {'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
- 'Accept-Language':'zh-CN,zh;q=0.8',
- 'Cache-Control':'max-age=0',
- 'Connection':'keep-alive',
- 'Host':'www.gdfc.org.cn',
- 'If-Modified-Since':'Tue, 12 Mar 2019 01:32:25 GMT',
- 'If-None-Match':'"5c870c29-29217"',
- 'Referer':'http://www.gdfc.org.cn/sjfx/tjzb10_50.html',
- 'Upgrade-Insecure-Requests':'1',
- 'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36 LBBROWSER'}
- urls = ['http://www.gdfc.org.cn/sjfx/tjzb10_200.html']
- for url in urls:
- yield scrapy.Request(url=url,headers = header,callback=self.parse)
-
- def parse(self, response):
- pass
复制代码
请求头是完全复制浏览器的,我用SplashRequest就可以爬下来,用普通的request就出现开始说的问题 |
|