鱼C论坛

 找回密码
 立即注册
查看: 1437|回复: 2

[已解决]scrapy爬取链家二手房信息返回403

[复制链接]
发表于 2022-3-3 21:05:10 | 显示全部楼层 |阅读模式

马上注册,结交更多好友,享用更多功能^_^

您需要 登录 才可以下载或查看,没有账号?立即注册

x
最近在看一本书,上面有一个爬虫案例始终完成不了,爬取的是链家网的上海黄埔二手房信息https://sh.lianjia.com/ershoufang/huangpu/

爬虫返回的是
  1. 2022-03-03 20:44:40 [houseSpider] INFO: Spider opened: houseSpider
  2. 2022-03-03 20:44:40 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
  3. 2022-03-03 20:44:40 [scrapy.core.engine] DEBUG: Crawled (403) <GET http://https//sh,lianjia.com/ershoufang/huangpu/pg1/> (referer: None)
  4. 2022-03-03 20:44:40 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 http://https//sh,lianjia.com/ershoufang/huangpu/pg1/>: HTTP status code is not handled or not allowed
  5. 2022-03-03 20:44:44 [scrapy.core.engine] DEBUG: Crawled (403) <GET http://https//sh,lianjia.com/ershoufang/huangpu/pg2/> (referer: None)
  6. 2022-03-03 20:44:44 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 http://https//sh,lianjia.com/ershoufang/huangpu/pg2/>: HTTP status code is not handled or not allowed
  7. 2022-03-03 20:44:48 [scrapy.core.engine] DEBUG: Crawled (403) <GET http://https//sh,lianjia.com/ershoufang/huangpu/pg3/> (referer: None)
  8. 2022-03-03 20:44:49 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 http://https//sh,lianjia.com/ershoufang/huangpu/pg3/>: HTTP status code is not handled or not allowed
  9. 2022-03-03 20:44:51 [scrapy.core.engine] DEBUG: Crawled (403) <GET http://https//sh,lianjia.com/ershoufang/huangpu/pg4/> (referer: None)
  10. 2022-03-03 20:44:51 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 http://https//sh,lianjia.com/ershoufang/huangpu/pg4/>: HTTP status code is not handled or not allowed
  11. 2022-03-03 20:44:55 [scrapy.core.engine] DEBUG: Crawled (403) <GET http://https//sh,lianjia.com/ershoufang/huangpu/pg5/> (referer: None)
  12. 2022-03-03 20:44:55 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 http://https//sh,lianjia.com/ershoufang/huangpu/pg5/>: HTTP status code is not handled or not allowed
  13. 2022-03-03 20:44:59 [scrapy.core.engine] DEBUG: Crawled (403) <GET http://https//sh,lianjia.com/ershoufang/huangpu/pg6/> (referer: None)
  14. 2022-03-03 20:44:59 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 http://https//sh,lianjia.com/ershoufang/huangpu/pg6/>: HTTP status code is not handled or not allowed
  15. 2022-03-03 20:45:04 [scrapy.core.engine] DEBUG: Crawled (403) <GET http://https//sh,lianjia.com/ershoufang/huangpu/pg7/> (referer: None)
  16. 2022-03-03 20:45:04 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 http://https//sh,lianjia.com/ershoufang/huangpu/pg7/>: HTTP status code is not handled or not allowed
  17. 2022-03-03 20:45:07 [scrapy.core.engine] DEBUG: Crawled (403) <GET http://https//sh,lianjia.com/ershoufang/huangpu/pg8/> (referer: None)
  18. 2022-03-03 20:45:07 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 http://https//sh,lianjia.com/ershoufang/huangpu/pg8/>: HTTP status code is not handled or not allowed
  19. 2022-03-03 20:45:10 [scrapy.core.engine] DEBUG: Crawled (403) <GET http://https//sh,lianjia.com/ershoufang/huangpu/pg9/> (referer: None)
  20. 2022-03-03 20:45:10 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 http://https//sh,lianjia.com/ershoufang/huangpu/pg9/>: HTTP status code is not handled or not allowed
  21. 2022-03-03 20:45:15 [scrapy.core.engine] DEBUG: Crawled (403) <GET http://https//sh,lianjia.com/ershoufang/huangpu/pg10/> (referer: None)
  22. 2022-03-03 20:45:15 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 http://https//sh,lianjia.com/ershoufang/huangpu/pg10/>: HTTP status code is not handled or not allowed
  23. 2022-03-03 20:45:15 [scrapy.core.engine] INFO: Closing spider (finished)
复制代码



我己经在setting.py里面加了UserAgent和关闭了ROBOTSTXT_OBEY和COOKIES_ENABLE,
还加上了延迟DOWNLOAD_DELAY=3,但是还是只返回403。

Spider.py, items.py, pipelines都是抄书本上的,应该是没问题的。

代码的缩进没问题的,只是复制上来的时候用编辑器它就自动没了缩进
setting部分:
  1. BOT_NAME = 'houseScrapy'

  2. SPIDER_MODULES = ['houseScrapy.spiders']
  3. NEWSPIDER_MODULE = 'houseScrapy.spiders'

  4. Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36

  5. DOWNLOADER_MIDDLEWARES = {
  6.    'houseScrapy.middlewares.HousescrapyDownloaderMiddleware': 543,
  7. }
  8. ITEM_PIPELINES = {
  9.    'houseScrapy.pipelines.HousescrapyPipeline': 300,
  10. }
  11. COOKIES_ENABLED = False
  12. DOWNLOAD_DELAY = 3
复制代码

items.py:
  1. import scrapy


  2. class HousescrapyItem(scrapy.Item):
  3.     # define the fields for your item here like:
  4.     # name = scrapy.Field()
  5.     title = scrapy.Field()
  6.     price = scrapy.Field()
  7.     pricePerSqM2 = scrapy.Field()
  8.     community = scrapy.Field()
  9.     address = scrapy.Field()
  10.     houseType = scrapy.Field()
  11.     area = scrapy.Field()
  12.     orientation = scrapy.Field()
  13.     repairStatus = scrapy.Field()
  14.     floor = scrapy.Field()
  15.     noticeNum = scrapy.Field()
  16.     mortgage = scrapy.Field()
  17.     transProperty = scrapy.Field()
复制代码
houseSpider.py
  1. import scrapy
  2. from scrapy.spiders import CrawlSpider, Rule
  3. from scrapy.linkextractors import LinkExtractor
  4. from houseScrapy.items import HousescrapyItem

  5. class HousespiderSpider(scrapy.Spider):
  6.     name = 'houseSpider'
  7.     allowed_domains = ['sh,lianjia.com/ershoufang/']
  8.     start_urls = ['http://https://sh,lianjia.com/ershoufang/huangpu/pg{}/'.format(i) for i in range(1, 11)]

  9.     def parse(self, response):
  10.     houses = response.xpath("//ul[@class='sellListContent']/li/div[@class='info clear']")
  11.     for house in houses:
  12.         item = HousescrapyItem()
复制代码
pipelines.py:
  1. # Define your item pipelines here
  2. #
  3. # Don't forget to add your pipeline to the ITEM_PIPELINES setting
  4. # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


  5. # useful for handling different item types with a single interface
  6. from itemadapter import ItemAdapter
  7. import codecs
  8. import json
  9. import os


  10. class HousescrapyPipeline(object):
  11.     def __init__(self):
  12.         filename = 'house.json'
  13.         if (os.path.exists(filename)):
  14.     os.remove(filename)
  15.     self.file = codecs.open(filename, 'w', encoding="utf-8")
  16.    
  17.     def process_item(self, item, spider):
  18.     lines = json.dumps(dict(item), ensure_ascii=False) + "\n"
  19.     self.file.write(lines)
  20.     return item

  21.     def spider_closed(self, spider):
  22.     self.file.close()
复制代码


我查了链家的robots.txt
  1. User-agent: Baiduspider
  2. Allow: *?utm_source=office*

  3. User-agent: *
  4. sitemap: https://sh.lianjia.com/sitemap/sh_index.xml
  5. Disallow: /rs
  6. Disallow: /huxingtu/p
  7. Disallow: /user
  8. Disallow: /login
  9. Disallow: /userinfo
  10. Disallow: *?project_name=*
  11. Disallow: *?id=*
  12. Disallow: *?house_codes=*
  13. Disallow: *?sug=*
复制代码
最后放上爬虫文件夹的图片

                               
登录/注册后可看大图


                               
登录/注册后可看大图



最佳答案
2022-3-3 23:47:46
你的 url 写错了,

houseSpider.py
  1. import scrapy
  2. from scrapy.spiders import CrawlSpider, Rule
  3. from scrapy.linkextractors import LinkExtractor
  4. from houseScrapy.items import HousescrapyItem

  5. class HousespiderSpider(scrapy.Spider):
  6.     name = 'houseSpider'
  7.     allowed_domains = ['sh.lianjia.com/ershoufang/']                  # sh, 改成 sh.
  8.     start_urls = ['https://sh.lianjia.com/ershoufang/huangpu/pg{}/'.format(i) for i in range(1, 11)]       # http://https://sh, 改成 https://sh.

  9.     def parse(self, response):
  10.     houses = response.xpath("//ul[@class='sellListContent']/li/div[@class='info clear']")
  11.     for house in houses:
  12.         item = HousescrapyItem()
复制代码
小甲鱼最新课程 -> https://ilovefishc.com
回复

使用道具 举报

发表于 2022-3-3 23:47:46 | 显示全部楼层    本楼为最佳答案   
你的 url 写错了,

houseSpider.py
  1. import scrapy
  2. from scrapy.spiders import CrawlSpider, Rule
  3. from scrapy.linkextractors import LinkExtractor
  4. from houseScrapy.items import HousescrapyItem

  5. class HousespiderSpider(scrapy.Spider):
  6.     name = 'houseSpider'
  7.     allowed_domains = ['sh.lianjia.com/ershoufang/']                  # sh, 改成 sh.
  8.     start_urls = ['https://sh.lianjia.com/ershoufang/huangpu/pg{}/'.format(i) for i in range(1, 11)]       # http://https://sh, 改成 https://sh.

  9.     def parse(self, response):
  10.     houses = response.xpath("//ul[@class='sellListContent']/li/div[@class='info clear']")
  11.     for house in houses:
  12.         item = HousescrapyItem()
复制代码
小甲鱼最新课程 -> https://ilovefishc.com
回复 支持 反对

使用道具 举报

 楼主| 发表于 2022-3-4 10:30:24 | 显示全部楼层
isdkz 发表于 2022-3-3 23:47
你的 url 写错了,

houseSpider.py

非常感谢!!!
小甲鱼最新课程 -> https://ilovefishc.com
回复 支持 反对

使用道具 举报

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

小黑屋|手机版|Archiver|鱼C工作室 ( 粤ICP备18085999号-1 | 粤公网安备 44051102000585号)

GMT+8, 2025-4-30 00:19

Powered by Discuz! X3.4

© 2001-2023 Discuz! Team.

快速回复 返回顶部 返回列表