Scrapy爬取漫画

lhgzbxhz · 发表于 2020-7-8 15:59:58

您需要登录才可以下载或查看，没有账号？立即注册

x

本帖最后由 lhgzbxhz 于 2020-7-8 16:01 编辑

废话不多说，直接上代码
1、

复制代码

2、

复制代码

3、

复制代码

爬取到的ch.json节选如下：

[
{“title”: “\u7b2c553\u8bdd \u6838\u5fc3\u533a2”, “url”: “https://www.mkzhan.com/211692/903557.html”},
{“title”: “\u7b2c552\u8bdd \u6838\u5fc3\u533a1”, “url”: “https://www.mkzhan.com/211692/903553.html”},
{“title”: “\u7b2c551\u8bdd \u9053\u6b492”, “url”: “https://www.mkzhan.com/211692/902374.html”},
{“title”: “\u7b2c550\u8bdd \u9053\u6b491”, “url”: “https://www.mkzhan.com/211692/902375.html”},
{“title”: “\u7b2c549\u8bdd \u82cf\u91922”, “url”: “https://www.mkzhan.com/211692/901306.html”},
…
]

复制代码

4、

# in spiders\images.py
import scrapy
import requests
import json
import os
class ImageSpider(scrapy.Spider):
name = "images"
def start_requests(self):
with open("ch.json", 'r') as f:
chapters_list = json.load(f)
for chapter in chapters_list:
# 此处使用了requests对象的meta属性，具体的大家可以自行百度
yield scrapy.Request(chapter["url"], callback=self.parse, meta={"title": chapter["title"]})
def img_parse(self, response):
with open(response.meta["path"], 'wb') as f:
f.write(response.body)
def parse(self, response):
title = response.meta["title"]
img_tags = response.xpath('//div[@class="rd-article__pic hide"]')
page_ids = []
image_urls = {}
for tag in img_tags:
# 按照data-page_id排序
page_id = int(tag.xpath('./@data-page_id').extract_first())
page_ids.append(page_id)
image_urls[page_id] = response.urljoin(tag.xpath('./img/@data-src').extract_first())
page_ids.sort()
# 创建文件夹
try:
os.mkdir(".\\images\" + title)
except FileExistsError:
pass # 如果文件夹已存在
for page_id in page_ids:
url = image_urls[page_id]
# 文件格式：".\images\%title%\%page_id%.jpg"
# 为了使爬取到的图片有序，必须这么干
yield scrapy.Request(url, callback=self.img_parse, meta={"path": ".\\images\" + title + '\\' + str(page_id) + ".jpg"})

复制代码

5、

复制代码

爬取效果：

登录/注册后可看大图

lhgzbxhz · 发表于 2020-7-17 17:33:54

怎么一直没人呀~

兢兢 · 发表于 2020-7-17 22:45:49

不隐藏，沉得快

账号		自动登录	找回密码
密码			立即注册

[作品展示] Scrapy爬取漫画