鱼C论坛

 找回密码
 立即注册

58租房信息爬虫

已有 803 次阅读2017-3-23 15:46 |个人分类:学习笔记

rent_spider.py
import scrapy
import re
from rent_58.items import Rent58Item

class Rent58Spider(scrapy.Spider):
name = "Rent58"
allowed_domains = ['58.com']
start_urls = ['http://sz.58.com/shangsha/zufang/0/j1/?PGTID=0d300008-017c-9fb0-8e13-ca6fcc905ad8&ClickID=4']
def parse(self, response):
items = []
sel = scrapy.selector.Selector(response)
sites = sel.xpath('//div[@class="listBox"]/ul[@class="listUl"]/li')
for site in sites:
item = Rent58Item()
try:
title = site.xpath('div[@class="des"]/h2/a/text()').extract()
link = site.xpath('div[@class="des"]/h2/a/@href').extract()
room = site.xpath('div[@class="des"]/p[@class="room"]/text()').extract()
info = site.xpath('div[@class="des"]/p[@class="add"]/a/text()').extract()
name = site.xpath('div[@class="des"]/p[@class="geren"]/text()').extract()
money = site.xpath('div[@class="listliright"]/div[@class="money"]/b/text()').extract()
time = site.xpath('div[@class="listliright"]/div[@class="sendTime"]/text()').extract()
item['title'] = re.sub(r'\s+','', title.pop())
item['link'] = link.pop()
item['room'] = re.sub(r'\s+','', room.pop())
item['info'] = info.pop()
item['name'] = re.sub(r'\s+','', name.pop(1))
item['money'] = money.pop()
item['time'] = re.sub(r'\s+','', time.pop())
except:
pass
items.append(item)
return items

pipelines.py 重要部分,会自动保存json格式
import json
import codecs

class Rent58Pipeline(object):
def __init__(self):
self.file = codecs.open('item.json', 'wb', encoding='utf-8')
def process_item(self, item, spider):
line = json.dumps(dict(item),ensure_ascii=False) + '\n' #第二个参数转换成中文
self.file.write(line)
return item

settings.py去掉ITEM_PIPELINES相关注释

items.py
class Rent58Item(scrapy.Item):
title = scrapy.Field()
link = scrapy.Field()
room = scrapy.Field()
info = scrapy.Field()
name = scrapy.Field()
money = scrapy.Field()
time = scrapy.Field()


路过

鸡蛋

鲜花

握手

雷人

全部作者的其他最新日志

评论 (0 个评论)

facelist

您需要登录后才可以评论 登录 | 立即注册

小黑屋|手机版|Archiver|鱼C工作室 ( 粤ICP备18085999号-1 | 粤公网安备 44051102000585号)

GMT+8, 2024-3-28 18:10

Powered by Discuz! X3.4

© 2001-2023 Discuz! Team.

返回顶部