[已解决]scrapy爬取链家二手房信息返回403

Jinvador · 发表于 2022-3-3 21:05:10

马上注册，结交更多好友，享用更多功能^_^

您需要登录才可以下载或查看，没有账号？立即注册

x

最近在看一本书，上面有一个爬虫案例始终完成不了，爬取的是链家网的上海黄埔二手房信息https://sh.lianjia.com/ershoufang/huangpu/

爬虫返回的是

2022-03-03 20:44:40 [houseSpider] INFO: Spider opened: houseSpider
2022-03-03 20:44:40 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-03-03 20:44:40 [scrapy.core.engine] DEBUG: Crawled (403) <GET http://https//sh,lianjia.com/ershoufang/huangpu/pg1/> (referer: None)
2022-03-03 20:44:40 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 http://https//sh,lianjia.com/ershoufang/huangpu/pg1/>: HTTP status code is not handled or not allowed
2022-03-03 20:44:44 [scrapy.core.engine] DEBUG: Crawled (403) <GET http://https//sh,lianjia.com/ershoufang/huangpu/pg2/> (referer: None)
2022-03-03 20:44:44 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 http://https//sh,lianjia.com/ershoufang/huangpu/pg2/>: HTTP status code is not handled or not allowed
2022-03-03 20:44:48 [scrapy.core.engine] DEBUG: Crawled (403) <GET http://https//sh,lianjia.com/ershoufang/huangpu/pg3/> (referer: None)
2022-03-03 20:44:49 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 http://https//sh,lianjia.com/ershoufang/huangpu/pg3/>: HTTP status code is not handled or not allowed
2022-03-03 20:44:51 [scrapy.core.engine] DEBUG: Crawled (403) <GET http://https//sh,lianjia.com/ershoufang/huangpu/pg4/> (referer: None)
2022-03-03 20:44:51 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 http://https//sh,lianjia.com/ershoufang/huangpu/pg4/>: HTTP status code is not handled or not allowed
2022-03-03 20:44:55 [scrapy.core.engine] DEBUG: Crawled (403) <GET http://https//sh,lianjia.com/ershoufang/huangpu/pg5/> (referer: None)
2022-03-03 20:44:55 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 http://https//sh,lianjia.com/ershoufang/huangpu/pg5/>: HTTP status code is not handled or not allowed
2022-03-03 20:44:59 [scrapy.core.engine] DEBUG: Crawled (403) <GET http://https//sh,lianjia.com/ershoufang/huangpu/pg6/> (referer: None)
2022-03-03 20:44:59 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 http://https//sh,lianjia.com/ershoufang/huangpu/pg6/>: HTTP status code is not handled or not allowed
2022-03-03 20:45:04 [scrapy.core.engine] DEBUG: Crawled (403) <GET http://https//sh,lianjia.com/ershoufang/huangpu/pg7/> (referer: None)
2022-03-03 20:45:04 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 http://https//sh,lianjia.com/ershoufang/huangpu/pg7/>: HTTP status code is not handled or not allowed
2022-03-03 20:45:07 [scrapy.core.engine] DEBUG: Crawled (403) <GET http://https//sh,lianjia.com/ershoufang/huangpu/pg8/> (referer: None)
2022-03-03 20:45:07 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 http://https//sh,lianjia.com/ershoufang/huangpu/pg8/>: HTTP status code is not handled or not allowed
2022-03-03 20:45:10 [scrapy.core.engine] DEBUG: Crawled (403) <GET http://https//sh,lianjia.com/ershoufang/huangpu/pg9/> (referer: None)
2022-03-03 20:45:10 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 http://https//sh,lianjia.com/ershoufang/huangpu/pg9/>: HTTP status code is not handled or not allowed
2022-03-03 20:45:15 [scrapy.core.engine] DEBUG: Crawled (403) <GET http://https//sh,lianjia.com/ershoufang/huangpu/pg10/> (referer: None)
2022-03-03 20:45:15 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 http://https//sh,lianjia.com/ershoufang/huangpu/pg10/>: HTTP status code is not handled or not allowed
2022-03-03 20:45:15 [scrapy.core.engine] INFO: Closing spider (finished)

复制代码

我己经在setting.py里面加了UserAgent和关闭了ROBOTSTXT_OBEY和COOKIES_ENABLE,
还加上了延迟DOWNLOAD_DELAY=3,但是还是只返回403。

Spider.py, items.py, pipelines都是抄书本上的，应该是没问题的。

代码的缩进没问题的，只是复制上来的时候用编辑器它就自动没了缩进
setting部分:

BOT_NAME = 'houseScrapy'
SPIDER_MODULES = ['houseScrapy.spiders']
NEWSPIDER_MODULE = 'houseScrapy.spiders'
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36
DOWNLOADER_MIDDLEWARES = {
'houseScrapy.middlewares.HousescrapyDownloaderMiddleware': 543,
}
ITEM_PIPELINES = {
'houseScrapy.pipelines.HousescrapyPipeline': 300,
}
COOKIES_ENABLED = False
DOWNLOAD_DELAY = 3

复制代码

items.py:

import scrapy
class HousescrapyItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
title = scrapy.Field()
price = scrapy.Field()
pricePerSqM2 = scrapy.Field()
community = scrapy.Field()
address = scrapy.Field()
houseType = scrapy.Field()
area = scrapy.Field()
orientation = scrapy.Field()
repairStatus = scrapy.Field()
floor = scrapy.Field()
noticeNum = scrapy.Field()
mortgage = scrapy.Field()
transProperty = scrapy.Field()

复制代码

houseSpider.py

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from houseScrapy.items import HousescrapyItem
class HousespiderSpider(scrapy.Spider):
name = 'houseSpider'
allowed_domains = ['sh,lianjia.com/ershoufang/']
start_urls = ['http://https://sh,lianjia.com/ershoufang/huangpu/pg{}/'.format(i) for i in range(1, 11)]
def parse(self, response):
houses = response.xpath("//ul[@class='sellListContent']/li/div[@class='info clear']")
for house in houses:
item = HousescrapyItem()

复制代码

pipelines.py:

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
import codecs
import json
import os
class HousescrapyPipeline(object):
def __init__(self):
filename = 'house.json'
if (os.path.exists(filename)):
os.remove(filename)
self.file = codecs.open(filename, 'w', encoding="utf-8")
def process_item(self, item, spider):
lines = json.dumps(dict(item), ensure_ascii=False) + "\n"
self.file.write(lines)
return item
def spider_closed(self, spider):
self.file.close()

复制代码

我查了链家的robots.txt

User-agent: Baiduspider
Allow: *?utm_source=office*
User-agent: *
sitemap: https://sh.lianjia.com/sitemap/sh_index.xml
Disallow: /rs
Disallow: /huxingtu/p
Disallow: /user
Disallow: /login
Disallow: /userinfo
Disallow: *?project_name=*
Disallow: *?id=*
Disallow: *?house_codes=*
Disallow: *?sug=*

复制代码

最后放上爬虫文件夹的图片

登录/注册后可看大图

最佳答案

月排行榜 / 总排行榜

isdkz

2022-3-3 23:47:46

你的 url 写错了，

houseSpider.py

import scrapy

from scrapy.spiders import CrawlSpider, Rule

from scrapy.linkextractors import LinkExtractor

from houseScrapy.items import HousescrapyItem

class HousespiderSpider(scrapy.Spider):

name = 'houseSpider'

allowed_domains = ['sh.lianjia.com/ershoufang/']                # sh, 改成 sh.

start_urls = ['https://sh.lianjia.com/ershoufang/huangpu/pg{}/'.format(i) for i in range(1, 11)]    # http://https://sh, 改成 https://sh.

def parse(self, response):

houses = response.xpath("//ul[@class='sellListContent']/li/div[@class='info clear']")

for house in houses:

      item = HousescrapyItem()
复制代码

跳转到最佳答案楼层

isdkz · 发表于 2022-3-3 23:47:46

你的 url 写错了，

houseSpider.py

import scrapy

from scrapy.spiders import CrawlSpider, Rule

from scrapy.linkextractors import LinkExtractor

from houseScrapy.items import HousescrapyItem

class HousespiderSpider(scrapy.Spider):

name = 'houseSpider'

allowed_domains = ['sh.lianjia.com/ershoufang/']                # sh, 改成 sh.

start_urls = ['https://sh.lianjia.com/ershoufang/huangpu/pg{}/'.format(i) for i in range(1, 11)]    # http://https://sh, 改成 https://sh.

def parse(self, response):

houses = response.xpath("//ul[@class='sellListContent']/li/div[@class='info clear']")

for house in houses:

      item = HousescrapyItem()
复制代码

Jinvador · 发表于 2022-3-4 10:30:24

isdkz 发表于 2022-3-3 23:47
你的 url 写错了，

houseSpider.py

非常感谢！！！

账号		自动登录	找回密码
密码			立即注册

[已解决]scrapy爬取链家二手房信息返回403

马上注册，结交更多好友，享用更多功能^_^

浏览过的版块