|
马上注册,结交更多好友,享用更多功能^_^
您需要 登录 才可以下载或查看,没有账号?立即注册
x
这是代理ip的代码 已经测试可用的
- proxy_support = urllib.request.ProxyHandler({'http':'183.56.177.130:808'})
- opener = urllib.request.build_opener(proxy_support)
- opener.addheaders = [('User-Agent','Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36')]
- urllib.request.install_opener(opener)
- response = urllib.request.urlopen(start_urls)
- html = response.read().decode('utf-8')
复制代码
然后自己写的爬虫因为爬的频繁被禁止访问了...囧
目前只是想使用代理ip 然后继续去爬那个网站...以下是已经实现的爬虫代码
- # -*- coding: utf-8 -*-
- import scrapy
- from lagou.items import LagouItem
- from scrapy.http import Request
- class LagouSpiderSpider(scrapy.Spider):
- name = "lagou_spider"
- allowed_domains = ["www.lagou.com"]
- url2 = 'https://www.lagou.com/zhaopin/houduankaifa/'
- start_urls = [str(url2)]+['1']
- def parse(self,response):
-
-
- sites = response.xpath('//*[@id="s_position_list"]/ul/li/div[1]')
- for site in sites:
- item = LagouItem()
- item['position_name'] = site.xpath('div[1]/div[1]/a/h2/text()').extract()
- item['addr'] = site.xpath('div[1]/div[1]/a/span/em/text()').extract()
- item['company_name'] = site.xpath('div[2]/div[1]/a/text()').extract()
- item['salary'] = site.xpath('div[1]/div[2]/div[1]/span/text()').extract()
- exp = site.xpath('div[1]/div[2]/div[1]/text()').extract()[2].rstrip()
- item['experience'] = exp
- item['url'] = site.xpath('div[1]/div[1]/a/@href').extract()
- for url in response.xpath('//*[@id="s_position_list"]/ul/li/div[1]/div[1]/div[1]/a/@href').extract():
- yield Request(url,meta={'item':item},callback=self.parse2)
-
- urls = site.xpath('//*[@id="order"]/li/div[4]/a[2]/@href').extract()
- for li in urls:
- yield Request(li, callback = self.parse)
- def parse2(self,response):
- item = response.meta['item']
- item['desc'] = response.xpath('//*[@id="job_detail"]/dd[2]/div/p').extract()
- yield item
复制代码
就是不太清楚要将两个代码结合到一块的时候那个代理ip代码的位置应该放在哪里呢...好像放哪都出错...求指导
http://brucedone.com/archives/88
参考下用这个 亲测可用
只需要将里面把代理服务器这一段修改一下
- def process_request(self, request, spider):
- ua = random.choice(settings.get('USER_AGENT_LIST'))
- spider.logger.info(msg='now entring download midware')
- if ua:
- request.headers['User-Agent'] = ua
- # Add desired logging message here.
- spider.logger.info(
- u'User-Agent is : {} {}'.format(request.headers.get('User-Agent'), request)
- )
- pass
复制代码
把你的代理IP加进去就可以了
|
|