|
马上注册,结交更多好友,享用更多功能^_^
您需要 登录 才可以下载或查看,没有账号?立即注册
x
最近在看一本书,上面有一个爬虫案例始终完成不了,爬取的是链家网的上海黄埔二手房信息https://sh.lianjia.com/ershoufang/huangpu/
爬虫返回的是
- 2022-03-03 20:44:40 [houseSpider] INFO: Spider opened: houseSpider
- 2022-03-03 20:44:40 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
- 2022-03-03 20:44:40 [scrapy.core.engine] DEBUG: Crawled (403) <GET http://https//sh,lianjia.com/ershoufang/huangpu/pg1/> (referer: None)
- 2022-03-03 20:44:40 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 http://https//sh,lianjia.com/ershoufang/huangpu/pg1/>: HTTP status code is not handled or not allowed
- 2022-03-03 20:44:44 [scrapy.core.engine] DEBUG: Crawled (403) <GET http://https//sh,lianjia.com/ershoufang/huangpu/pg2/> (referer: None)
- 2022-03-03 20:44:44 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 http://https//sh,lianjia.com/ershoufang/huangpu/pg2/>: HTTP status code is not handled or not allowed
- 2022-03-03 20:44:48 [scrapy.core.engine] DEBUG: Crawled (403) <GET http://https//sh,lianjia.com/ershoufang/huangpu/pg3/> (referer: None)
- 2022-03-03 20:44:49 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 http://https//sh,lianjia.com/ershoufang/huangpu/pg3/>: HTTP status code is not handled or not allowed
- 2022-03-03 20:44:51 [scrapy.core.engine] DEBUG: Crawled (403) <GET http://https//sh,lianjia.com/ershoufang/huangpu/pg4/> (referer: None)
- 2022-03-03 20:44:51 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 http://https//sh,lianjia.com/ershoufang/huangpu/pg4/>: HTTP status code is not handled or not allowed
- 2022-03-03 20:44:55 [scrapy.core.engine] DEBUG: Crawled (403) <GET http://https//sh,lianjia.com/ershoufang/huangpu/pg5/> (referer: None)
- 2022-03-03 20:44:55 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 http://https//sh,lianjia.com/ershoufang/huangpu/pg5/>: HTTP status code is not handled or not allowed
- 2022-03-03 20:44:59 [scrapy.core.engine] DEBUG: Crawled (403) <GET http://https//sh,lianjia.com/ershoufang/huangpu/pg6/> (referer: None)
- 2022-03-03 20:44:59 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 http://https//sh,lianjia.com/ershoufang/huangpu/pg6/>: HTTP status code is not handled or not allowed
- 2022-03-03 20:45:04 [scrapy.core.engine] DEBUG: Crawled (403) <GET http://https//sh,lianjia.com/ershoufang/huangpu/pg7/> (referer: None)
- 2022-03-03 20:45:04 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 http://https//sh,lianjia.com/ershoufang/huangpu/pg7/>: HTTP status code is not handled or not allowed
- 2022-03-03 20:45:07 [scrapy.core.engine] DEBUG: Crawled (403) <GET http://https//sh,lianjia.com/ershoufang/huangpu/pg8/> (referer: None)
- 2022-03-03 20:45:07 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 http://https//sh,lianjia.com/ershoufang/huangpu/pg8/>: HTTP status code is not handled or not allowed
- 2022-03-03 20:45:10 [scrapy.core.engine] DEBUG: Crawled (403) <GET http://https//sh,lianjia.com/ershoufang/huangpu/pg9/> (referer: None)
- 2022-03-03 20:45:10 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 http://https//sh,lianjia.com/ershoufang/huangpu/pg9/>: HTTP status code is not handled or not allowed
- 2022-03-03 20:45:15 [scrapy.core.engine] DEBUG: Crawled (403) <GET http://https//sh,lianjia.com/ershoufang/huangpu/pg10/> (referer: None)
- 2022-03-03 20:45:15 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 http://https//sh,lianjia.com/ershoufang/huangpu/pg10/>: HTTP status code is not handled or not allowed
- 2022-03-03 20:45:15 [scrapy.core.engine] INFO: Closing spider (finished)
复制代码
我己经在setting.py里面加了UserAgent和关闭了ROBOTSTXT_OBEY和COOKIES_ENABLE,
还加上了延迟DOWNLOAD_DELAY=3,但是还是只返回403。
Spider.py, items.py, pipelines都是抄书本上的,应该是没问题的。
代码的缩进没问题的,只是复制上来的时候用编辑器它就自动没了缩进
setting部分:
- BOT_NAME = 'houseScrapy'
- SPIDER_MODULES = ['houseScrapy.spiders']
- NEWSPIDER_MODULE = 'houseScrapy.spiders'
- Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36
- DOWNLOADER_MIDDLEWARES = {
- 'houseScrapy.middlewares.HousescrapyDownloaderMiddleware': 543,
- }
- ITEM_PIPELINES = {
- 'houseScrapy.pipelines.HousescrapyPipeline': 300,
- }
- COOKIES_ENABLED = False
- DOWNLOAD_DELAY = 3
复制代码
items.py:
- import scrapy
- class HousescrapyItem(scrapy.Item):
- # define the fields for your item here like:
- # name = scrapy.Field()
- title = scrapy.Field()
- price = scrapy.Field()
- pricePerSqM2 = scrapy.Field()
- community = scrapy.Field()
- address = scrapy.Field()
- houseType = scrapy.Field()
- area = scrapy.Field()
- orientation = scrapy.Field()
- repairStatus = scrapy.Field()
- floor = scrapy.Field()
- noticeNum = scrapy.Field()
- mortgage = scrapy.Field()
- transProperty = scrapy.Field()
复制代码 houseSpider.py
- import scrapy
- from scrapy.spiders import CrawlSpider, Rule
- from scrapy.linkextractors import LinkExtractor
- from houseScrapy.items import HousescrapyItem
- class HousespiderSpider(scrapy.Spider):
- name = 'houseSpider'
- allowed_domains = ['sh,lianjia.com/ershoufang/']
- start_urls = ['http://https://sh,lianjia.com/ershoufang/huangpu/pg{}/'.format(i) for i in range(1, 11)]
- def parse(self, response):
- houses = response.xpath("//ul[@class='sellListContent']/li/div[@class='info clear']")
- for house in houses:
- item = HousescrapyItem()
复制代码 pipelines.py:
- # Define your item pipelines here
- #
- # Don't forget to add your pipeline to the ITEM_PIPELINES setting
- # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
- # useful for handling different item types with a single interface
- from itemadapter import ItemAdapter
- import codecs
- import json
- import os
- class HousescrapyPipeline(object):
- def __init__(self):
- filename = 'house.json'
- if (os.path.exists(filename)):
- os.remove(filename)
- self.file = codecs.open(filename, 'w', encoding="utf-8")
-
- def process_item(self, item, spider):
- lines = json.dumps(dict(item), ensure_ascii=False) + "\n"
- self.file.write(lines)
- return item
- def spider_closed(self, spider):
- self.file.close()
复制代码
我查了链家的robots.txt
- User-agent: Baiduspider
- Allow: *?utm_source=office*
- User-agent: *
- sitemap: https://sh.lianjia.com/sitemap/sh_index.xml
- Disallow: /rs
- Disallow: /huxingtu/p
- Disallow: /user
- Disallow: /login
- Disallow: /userinfo
- Disallow: *?project_name=*
- Disallow: *?id=*
- Disallow: *?house_codes=*
- Disallow: *?sug=*
复制代码 最后放上爬虫文件夹的图片
你的 url 写错了,
houseSpider.py
- import scrapy
- from scrapy.spiders import CrawlSpider, Rule
- from scrapy.linkextractors import LinkExtractor
- from houseScrapy.items import HousescrapyItem
- class HousespiderSpider(scrapy.Spider):
- name = 'houseSpider'
- allowed_domains = ['sh.lianjia.com/ershoufang/'] # sh, 改成 sh.
- start_urls = ['https://sh.lianjia.com/ershoufang/huangpu/pg{}/'.format(i) for i in range(1, 11)] # http://https://sh, 改成 https://sh.
- def parse(self, response):
- houses = response.xpath("//ul[@class='sellListContent']/li/div[@class='info clear']")
- for house in houses:
- item = HousescrapyItem()
复制代码
|
|