马上注册,结交更多好友,享用更多功能^_^
您需要 登录 才可以下载或查看,没有账号?立即注册
x
最近在看一本书,上面有一个爬虫案例始终完成不了,爬取的是链家网的上海黄埔二手房信息https://sh.lianjia.com/ershoufang/huangpu/
爬虫返回的是2022-03-03 20:44:40 [houseSpider] INFO: Spider opened: houseSpider
2022-03-03 20:44:40 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-03-03 20:44:40 [scrapy.core.engine] DEBUG: Crawled (403) <GET http://https//sh,lianjia.com/ershoufang/huangpu/pg1/> (referer: None)
2022-03-03 20:44:40 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 http://https//sh,lianjia.com/ershoufang/huangpu/pg1/>: HTTP status code is not handled or not allowed
2022-03-03 20:44:44 [scrapy.core.engine] DEBUG: Crawled (403) <GET http://https//sh,lianjia.com/ershoufang/huangpu/pg2/> (referer: None)
2022-03-03 20:44:44 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 http://https//sh,lianjia.com/ershoufang/huangpu/pg2/>: HTTP status code is not handled or not allowed
2022-03-03 20:44:48 [scrapy.core.engine] DEBUG: Crawled (403) <GET http://https//sh,lianjia.com/ershoufang/huangpu/pg3/> (referer: None)
2022-03-03 20:44:49 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 http://https//sh,lianjia.com/ershoufang/huangpu/pg3/>: HTTP status code is not handled or not allowed
2022-03-03 20:44:51 [scrapy.core.engine] DEBUG: Crawled (403) <GET http://https//sh,lianjia.com/ershoufang/huangpu/pg4/> (referer: None)
2022-03-03 20:44:51 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 http://https//sh,lianjia.com/ershoufang/huangpu/pg4/>: HTTP status code is not handled or not allowed
2022-03-03 20:44:55 [scrapy.core.engine] DEBUG: Crawled (403) <GET http://https//sh,lianjia.com/ershoufang/huangpu/pg5/> (referer: None)
2022-03-03 20:44:55 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 http://https//sh,lianjia.com/ershoufang/huangpu/pg5/>: HTTP status code is not handled or not allowed
2022-03-03 20:44:59 [scrapy.core.engine] DEBUG: Crawled (403) <GET http://https//sh,lianjia.com/ershoufang/huangpu/pg6/> (referer: None)
2022-03-03 20:44:59 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 http://https//sh,lianjia.com/ershoufang/huangpu/pg6/>: HTTP status code is not handled or not allowed
2022-03-03 20:45:04 [scrapy.core.engine] DEBUG: Crawled (403) <GET http://https//sh,lianjia.com/ershoufang/huangpu/pg7/> (referer: None)
2022-03-03 20:45:04 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 http://https//sh,lianjia.com/ershoufang/huangpu/pg7/>: HTTP status code is not handled or not allowed
2022-03-03 20:45:07 [scrapy.core.engine] DEBUG: Crawled (403) <GET http://https//sh,lianjia.com/ershoufang/huangpu/pg8/> (referer: None)
2022-03-03 20:45:07 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 http://https//sh,lianjia.com/ershoufang/huangpu/pg8/>: HTTP status code is not handled or not allowed
2022-03-03 20:45:10 [scrapy.core.engine] DEBUG: Crawled (403) <GET http://https//sh,lianjia.com/ershoufang/huangpu/pg9/> (referer: None)
2022-03-03 20:45:10 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 http://https//sh,lianjia.com/ershoufang/huangpu/pg9/>: HTTP status code is not handled or not allowed
2022-03-03 20:45:15 [scrapy.core.engine] DEBUG: Crawled (403) <GET http://https//sh,lianjia.com/ershoufang/huangpu/pg10/> (referer: None)
2022-03-03 20:45:15 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 http://https//sh,lianjia.com/ershoufang/huangpu/pg10/>: HTTP status code is not handled or not allowed
2022-03-03 20:45:15 [scrapy.core.engine] INFO: Closing spider (finished)
我己经在setting.py里面加了UserAgent和关闭了ROBOTSTXT_OBEY和COOKIES_ENABLE,
还加上了延迟DOWNLOAD_DELAY=3,但是还是只返回403。
Spider.py, items.py, pipelines都是抄书本上的,应该是没问题的。
代码的缩进没问题的,只是复制上来的时候用编辑器它就自动没了缩进
setting部分:BOT_NAME = 'houseScrapy'
SPIDER_MODULES = ['houseScrapy.spiders']
NEWSPIDER_MODULE = 'houseScrapy.spiders'
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36
DOWNLOADER_MIDDLEWARES = {
'houseScrapy.middlewares.HousescrapyDownloaderMiddleware': 543,
}
ITEM_PIPELINES = {
'houseScrapy.pipelines.HousescrapyPipeline': 300,
}
COOKIES_ENABLED = False
DOWNLOAD_DELAY = 3
items.py:import scrapy
class HousescrapyItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
title = scrapy.Field()
price = scrapy.Field()
pricePerSqM2 = scrapy.Field()
community = scrapy.Field()
address = scrapy.Field()
houseType = scrapy.Field()
area = scrapy.Field()
orientation = scrapy.Field()
repairStatus = scrapy.Field()
floor = scrapy.Field()
noticeNum = scrapy.Field()
mortgage = scrapy.Field()
transProperty = scrapy.Field()
houseSpider.pyimport scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from houseScrapy.items import HousescrapyItem
class HousespiderSpider(scrapy.Spider):
name = 'houseSpider'
allowed_domains = ['sh,lianjia.com/ershoufang/']
start_urls = ['http://https://sh,lianjia.com/ershoufang/huangpu/pg{}/'.format(i) for i in range(1, 11)]
def parse(self, response):
houses = response.xpath("//ul[@class='sellListContent']/li/div[@class='info clear']")
for house in houses:
item = HousescrapyItem()
pipelines.py:# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
import codecs
import json
import os
class HousescrapyPipeline(object):
def __init__(self):
filename = 'house.json'
if (os.path.exists(filename)):
os.remove(filename)
self.file = codecs.open(filename, 'w', encoding="utf-8")
def process_item(self, item, spider):
lines = json.dumps(dict(item), ensure_ascii=False) + "\n"
self.file.write(lines)
return item
def spider_closed(self, spider):
self.file.close()
我查了链家的robots.txtUser-agent: Baiduspider
Allow: *?utm_source=office*
User-agent: *
sitemap: https://sh.lianjia.com/sitemap/sh_index.xml
Disallow: /rs
Disallow: /huxingtu/p
Disallow: /user
Disallow: /login
Disallow: /userinfo
Disallow: *?project_name=*
Disallow: *?id=*
Disallow: *?house_codes=*
Disallow: *?sug=*
最后放上爬虫文件夹的图片
你的 url 写错了,
houseSpider.pyimport scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from houseScrapy.items import HousescrapyItem
class HousespiderSpider(scrapy.Spider):
name = 'houseSpider'
allowed_domains = ['sh.lianjia.com/ershoufang/'] # sh, 改成 sh.
start_urls = ['https://sh.lianjia.com/ershoufang/huangpu/pg{}/'.format(i) for i in range(1, 11)] # http://https://sh, 改成 https://sh.
def parse(self, response):
houses = response.xpath("//ul[@class='sellListContent']/li/div[@class='info clear']")
for house in houses:
item = HousescrapyItem()
|