鱼C论坛

 找回密码
 立即注册
查看: 1268|回复: 2

[已解决]scrapy爬取链家二手房信息返回403

[复制链接]
发表于 2022-3-3 21:05:10 | 显示全部楼层 |阅读模式

马上注册,结交更多好友,享用更多功能^_^

您需要 登录 才可以下载或查看,没有账号?立即注册

x
最近在看一本书,上面有一个爬虫案例始终完成不了,爬取的是链家网的上海黄埔二手房信息https://sh.lianjia.com/ershoufang/huangpu/

爬虫返回的是
2022-03-03 20:44:40 [houseSpider] INFO: Spider opened: houseSpider
2022-03-03 20:44:40 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-03-03 20:44:40 [scrapy.core.engine] DEBUG: Crawled (403) <GET http://https//sh,lianjia.com/ershoufang/huangpu/pg1/> (referer: None)
2022-03-03 20:44:40 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 http://https//sh,lianjia.com/ershoufang/huangpu/pg1/>: HTTP status code is not handled or not allowed
2022-03-03 20:44:44 [scrapy.core.engine] DEBUG: Crawled (403) <GET http://https//sh,lianjia.com/ershoufang/huangpu/pg2/> (referer: None)
2022-03-03 20:44:44 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 http://https//sh,lianjia.com/ershoufang/huangpu/pg2/>: HTTP status code is not handled or not allowed
2022-03-03 20:44:48 [scrapy.core.engine] DEBUG: Crawled (403) <GET http://https//sh,lianjia.com/ershoufang/huangpu/pg3/> (referer: None)
2022-03-03 20:44:49 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 http://https//sh,lianjia.com/ershoufang/huangpu/pg3/>: HTTP status code is not handled or not allowed
2022-03-03 20:44:51 [scrapy.core.engine] DEBUG: Crawled (403) <GET http://https//sh,lianjia.com/ershoufang/huangpu/pg4/> (referer: None)
2022-03-03 20:44:51 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 http://https//sh,lianjia.com/ershoufang/huangpu/pg4/>: HTTP status code is not handled or not allowed
2022-03-03 20:44:55 [scrapy.core.engine] DEBUG: Crawled (403) <GET http://https//sh,lianjia.com/ershoufang/huangpu/pg5/> (referer: None)
2022-03-03 20:44:55 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 http://https//sh,lianjia.com/ershoufang/huangpu/pg5/>: HTTP status code is not handled or not allowed
2022-03-03 20:44:59 [scrapy.core.engine] DEBUG: Crawled (403) <GET http://https//sh,lianjia.com/ershoufang/huangpu/pg6/> (referer: None)
2022-03-03 20:44:59 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 http://https//sh,lianjia.com/ershoufang/huangpu/pg6/>: HTTP status code is not handled or not allowed
2022-03-03 20:45:04 [scrapy.core.engine] DEBUG: Crawled (403) <GET http://https//sh,lianjia.com/ershoufang/huangpu/pg7/> (referer: None)
2022-03-03 20:45:04 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 http://https//sh,lianjia.com/ershoufang/huangpu/pg7/>: HTTP status code is not handled or not allowed
2022-03-03 20:45:07 [scrapy.core.engine] DEBUG: Crawled (403) <GET http://https//sh,lianjia.com/ershoufang/huangpu/pg8/> (referer: None)
2022-03-03 20:45:07 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 http://https//sh,lianjia.com/ershoufang/huangpu/pg8/>: HTTP status code is not handled or not allowed
2022-03-03 20:45:10 [scrapy.core.engine] DEBUG: Crawled (403) <GET http://https//sh,lianjia.com/ershoufang/huangpu/pg9/> (referer: None)
2022-03-03 20:45:10 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 http://https//sh,lianjia.com/ershoufang/huangpu/pg9/>: HTTP status code is not handled or not allowed
2022-03-03 20:45:15 [scrapy.core.engine] DEBUG: Crawled (403) <GET http://https//sh,lianjia.com/ershoufang/huangpu/pg10/> (referer: None)
2022-03-03 20:45:15 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 http://https//sh,lianjia.com/ershoufang/huangpu/pg10/>: HTTP status code is not handled or not allowed
2022-03-03 20:45:15 [scrapy.core.engine] INFO: Closing spider (finished)


我己经在setting.py里面加了UserAgent和关闭了ROBOTSTXT_OBEY和COOKIES_ENABLE,
还加上了延迟DOWNLOAD_DELAY=3,但是还是只返回403。

Spider.py, items.py, pipelines都是抄书本上的,应该是没问题的。

代码的缩进没问题的,只是复制上来的时候用编辑器它就自动没了缩进
setting部分:
BOT_NAME = 'houseScrapy'

SPIDER_MODULES = ['houseScrapy.spiders']
NEWSPIDER_MODULE = 'houseScrapy.spiders'

Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36

DOWNLOADER_MIDDLEWARES = {
   'houseScrapy.middlewares.HousescrapyDownloaderMiddleware': 543,
}
ITEM_PIPELINES = {
   'houseScrapy.pipelines.HousescrapyPipeline': 300,
}
COOKIES_ENABLED = False
DOWNLOAD_DELAY = 3
items.py:
import scrapy


class HousescrapyItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    price = scrapy.Field()
    pricePerSqM2 = scrapy.Field()
    community = scrapy.Field()
    address = scrapy.Field()
    houseType = scrapy.Field()
    area = scrapy.Field()
    orientation = scrapy.Field()
    repairStatus = scrapy.Field()
    floor = scrapy.Field()
    noticeNum = scrapy.Field()
    mortgage = scrapy.Field()
    transProperty = scrapy.Field()
houseSpider.py
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from houseScrapy.items import HousescrapyItem

class HousespiderSpider(scrapy.Spider):
    name = 'houseSpider'
    allowed_domains = ['sh,lianjia.com/ershoufang/']
    start_urls = ['http://https://sh,lianjia.com/ershoufang/huangpu/pg{}/'.format(i) for i in range(1, 11)]

    def parse(self, response):
    houses = response.xpath("//ul[@class='sellListContent']/li/div[@class='info clear']")
    for house in houses:
        item = HousescrapyItem()
pipelines.py:
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
import codecs
import json
import os


class HousescrapyPipeline(object):
    def __init__(self):
        filename = 'house.json'
        if (os.path.exists(filename)):
    os.remove(filename)
    self.file = codecs.open(filename, 'w', encoding="utf-8")
    
    def process_item(self, item, spider):
    lines = json.dumps(dict(item), ensure_ascii=False) + "\n"
    self.file.write(lines)
    return item

    def spider_closed(self, spider):
    self.file.close()

我查了链家的robots.txt
User-agent: Baiduspider
Allow: *?utm_source=office*

User-agent: *
sitemap: https://sh.lianjia.com/sitemap/sh_index.xml
Disallow: /rs
Disallow: /huxingtu/p
Disallow: /user
Disallow: /login
Disallow: /userinfo
Disallow: *?project_name=*
Disallow: *?id=*
Disallow: *?house_codes=*
Disallow: *?sug=*
最后放上爬虫文件夹的图片

                               
登录/注册后可看大图


                               
登录/注册后可看大图



最佳答案
2022-3-3 23:47:46
你的 url 写错了,

houseSpider.py
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from houseScrapy.items import HousescrapyItem

class HousespiderSpider(scrapy.Spider):
    name = 'houseSpider'
    allowed_domains = ['sh.lianjia.com/ershoufang/']                  # sh, 改成 sh.
    start_urls = ['https://sh.lianjia.com/ershoufang/huangpu/pg{}/'.format(i) for i in range(1, 11)]       # http://https://sh, 改成 https://sh.

    def parse(self, response):
    houses = response.xpath("//ul[@class='sellListContent']/li/div[@class='info clear']")
    for house in houses:
        item = HousescrapyItem()
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复

使用道具 举报

发表于 2022-3-3 23:47:46 | 显示全部楼层    本楼为最佳答案   
你的 url 写错了,

houseSpider.py
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from houseScrapy.items import HousescrapyItem

class HousespiderSpider(scrapy.Spider):
    name = 'houseSpider'
    allowed_domains = ['sh.lianjia.com/ershoufang/']                  # sh, 改成 sh.
    start_urls = ['https://sh.lianjia.com/ershoufang/huangpu/pg{}/'.format(i) for i in range(1, 11)]       # http://https://sh, 改成 https://sh.

    def parse(self, response):
    houses = response.xpath("//ul[@class='sellListContent']/li/div[@class='info clear']")
    for house in houses:
        item = HousescrapyItem()
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复 支持 反对

使用道具 举报

 楼主| 发表于 2022-3-4 10:30:24 | 显示全部楼层
isdkz 发表于 2022-3-3 23:47
你的 url 写错了,

houseSpider.py

非常感谢!!!
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复 支持 反对

使用道具 举报

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

小黑屋|手机版|Archiver|鱼C工作室 ( 粤ICP备18085999号-1 | 粤公网安备 44051102000585号)

GMT+8, 2025-1-12 06:44

Powered by Discuz! X3.4

© 2001-2023 Discuz! Team.

快速回复 返回顶部 返回列表