Scrapy爬取漫画,Python交流,编程语言专区,鱼C论坛

lhgzbxhz 发表于 2020-7-8 15:59:58

Scrapy爬取漫画

本帖最后由 lhgzbxhz 于 2020-7-8 16:01 编辑

废话不多说，直接上代码
1、
cd F:\编程\Python\Scrapy
Scrapy startproject mkz

2、
# in spiders\chapters.py
import scrapy

class ChapterSpider(scrapy.Spider):
name = "chapter"
start_urls = ["https://www.mkzhan.com/211692/"]

def parse(self, response: scrapy.http.Response):
   for chapter in response.css("a.j-chapter-link"):
      # 在<a>标签内有另外的标签
      # 所以要用//text()获取所有文本
         title = chapter.xpath("..//text()").extract()
         if title is None:
            self.log("None!")
            continue
         # 获取到的文本中有"\n"以及" "一类
         # 所以要先strip()，再 if t != ""
         for i in range(len(title)):
            title = title.strip()
         yield {
            "title": ,
            "url": response.urljoin(chapter.css("::attr(data-hreflink)").extract_first()),
         }

3、
Scrapy crawl chapter -o ch.json
爬取到的ch.json节选如下：
[
{“title”: “\u7b2c553\u8bdd \u6838\u5fc3\u533a2”, “url”: “https://www.mkzhan.com/211692/903557.html”},
{“title”: “\u7b2c552\u8bdd \u6838\u5fc3\u533a1”, “url”: “https://www.mkzhan.com/211692/903553.html”},
{“title”: “\u7b2c551\u8bdd \u9053\u6b492”, “url”: “https://www.mkzhan.com/211692/902374.html”},
{“title”: “\u7b2c550\u8bdd \u9053\u6b491”, “url”: “https://www.mkzhan.com/211692/902375.html”},
{“title”: “\u7b2c549\u8bdd \u82cf\u91922”, “url”: “https://www.mkzhan.com/211692/901306.html”},
…
]

4、
# in spiders\images.py
import scrapy
import requests
import json
import os

class ImageSpider(scrapy.Spider):
name = "images"

def start_requests(self):
   with open("ch.json", 'r') as f:
         chapters_list = json.load(f)
   for chapter in chapters_list:
      # 此处使用了requests对象的meta属性，具体的大家可以自行百度
         yield scrapy.Request(chapter["url"], callback=self.parse, meta={"title": chapter["title"]})

def img_parse(self, response):
   with open(response.meta["path"], 'wb') as f:
         f.write(response.body)

def parse(self, response):
   title = response.meta["title"]
   img_tags = response.xpath('//div[@class="rd-article__pic hide"]')
   page_ids = []
   image_urls = {}
   for tag in img_tags:
         # 按照data-page_id排序
         page_id = int(tag.xpath('./@data-page_id').extract_first())
         page_ids.append(page_id)
         image_urls = response.urljoin(tag.xpath('./img/@data-src').extract_first())
   page_ids.sort()
   # 创建文件夹
   try:
         os.mkdir(".\\images\\" + title)
   except FileExistsError:
         pass# 如果文件夹已存在

   for page_id in page_ids:
         url = image_urls
         # 文件格式：".\images\%title%\%page_id%.jpg"
         # 为了使爬取到的图片有序，必须这么干
         yield scrapy.Request(url, callback=self.img_parse, meta={"path": ".\\images\\" + title + '\\' + str(page_id) + ".jpg"})

5、
Scrapy crawl images
爬取效果：
https://img-blog.csdnimg.cn/20200707100315969.jpg
https://img-blog.csdnimg.cn/20200707100428764.png

lhgzbxhz 发表于 2020-7-17 17:33:54

怎么一直没人呀~

兢兢发表于 2020-7-17 22:45:49

不隐藏，沉得快

页: [1]

鱼C论坛's Archiver

Scrapy爬取漫画