为避免尬聊，我用Python爬取了一千多张斗图！

叼辣条闯世界 · 发表于 2021-9-11 20:59:54

马上注册，结交更多好友，享用更多功能^_^

您需要登录才可以下载或查看，没有账号？立即注册

x

本次爬虫所需要的工具库我先列举出来

import requests
from lxml import etree
import threading
from queue import Queue
import re

复制代码

缺少哪些就自行安装。

抓取目标
本次实战所要抓取的网站是斗图吧。网址如下：

https://www.doutub.com/

定睛一看，好家伙，居然有26页的表情包，这不起飞？

首先来分析一下不同页面url的地址变化。

# 第一页
https://www.doutub.com/img_lists/new/1

# 第二页
https://www.doutub.com/img_lists/new/2

# 第三页
https://www.doutub.com/img_lists/new/3
看到这种变化的方式之后难道你不先窃喜一下。

页面url地址已经搞定，那接下来要弄清楚的就是每一张表情包的url地址了。

这不是很容易就被聪明的你发现了吗？这些链接我们采用xpath将其提取出来即可。

生产者的实现
首先，我们先创建两个队列，一个用于存储每一页的url地址，另一个便用于存储图片链接。

具体代码，如下所示：

page_queue = Queue() # 页面url
img_queue = Queue() # 图片url
for page in range(1, 27):
url = f'https://www.doutub.com/img_lists/new/{page}'
page_queue.put(url)

复制代码

通过上面的代码，便将每一页的url地址放入了page_queue。

接下来再通过创建一个类，将图片url放入img_queue中。

具体代码如下所示：

class ImageParse(threading.Thread):
def __init__(self, page_queue, img_queue):
      super(ImageParse, self).__init__()
      self.page_queue = page_queue
      self.img_queue = img_queue
      self.headers = {
         'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36'
      }

def run(self):
      while True:
         if self.page_queue.empty():
            break
         url = self.page_queue.get()
         self.parse_img(url)

def parse_img(self, url):
      response = requests.get(url, headers=self.headers).content.decode('utf-8')
      html = etree.HTML(response)
      img_lists = html.xpath('//div[@class="expression-list clearfix"]')
      for img_list in img_lists:
         img_urls = img_list.xpath('./div/a/img/@src')
         img_names = img_list.xpath('./div/a/span/text()')
         for img_url, img_name in zip(img_urls, img_names):
            self.img_queue.put((img_url, img_name))
消费者的实现
其实消费者很简单，我们只需要不断的从img_page中获取到图片的url链接并不停的进行访问即可。直到两个队列中有一个队列为空即可退出。

class DownLoad(threading.Thread):
def __init__(self, page_queue, img_queue):
      super(DownLoad, self).__init__()
      self.page_queue = page_queue
      self.img_queue = img_queue
      self.headers = {
         'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36'
      }

def run(self):
      while True:
         if self.page_queue.empty() and self.img_queue.empty():
            break
         img_url, filename = self.img_queue.get()
         fix = img_url.split('.')[-1]
         name = re.sub(r'[?？.，。！!*\\/|]', '', filename)
         # print(fix)
         data = requests.get(img_url, headers=self.headers).content
         print('正在下载' + filename)
         with open('../image/' + name + '.' + fix, 'wb') as f:
            f.write(data)
最后，再让创建好的两个线程跑起来

for x in range(5):
t1 = ImageParse(page_queue, img_queue)
t1.start()
t2 = DownLoad(page_queue, img_queue)
t2.start()
t1.join()
t2.join()

复制代码

最后结果

一共抓取了1269张图片。

从今往后谁还能比得上你？就这？这不有爬虫就行！
完整代码回复可见

游客，如果您要查看本帖隐藏内容请回复

[/hide]

暴躁老头 · 发表于 2021-10-17 21:23:07

天天天蓝88 · 发表于 2021-10-18 07:34:29

老铁就能

伏惜寒 · 发表于 2021-10-18 12:29:28

学习

cccxx · 发表于 2021-10-18 15:56:40

奇迹男子 · 发表于 2021-10-18 19:00:03

lingoo1980 · 发表于 2021-10-18 19:37:04

提示: 作者被禁止或删除内容自动屏蔽

ForGot_227 · 发表于 2021-10-18 20:54:50

学习学习！

嘉岳呀 · 发表于 2021-10-22 20:07:40

xiwenfei · 发表于 2021-11-1 19:26:54

hornwong · 发表于 2021-11-1 19:56:49

tianyamingyue · 发表于 2021-11-1 20:16:14

让我瞧瞧

consummee · 发表于 2021-11-2 11:48:17

瞧瞧

N1ghtdev0 · 发表于 2021-12-27 22:03:58

anheidarktp · 发表于 2021-12-28 10:03:34

ThreeCat · 发表于 2021-12-28 10:15:33

小单车 · 发表于 2021-12-28 17:49:17

做最好的自己520 · 发表于 2021-12-28 18:34:55

学习中，请指教

fang232629 · 发表于 2021-12-28 21:43:10

学习学习

crissqiang · 发表于 2021-12-30 09:21:48

看看

账号		自动登录	找回密码
密码			立即注册

lingoo1980 lingoo1980 当前离线 UID 104408 日志相册贡献荣誉积分 291 狗仔卡头像被屏蔽	发表于 2021-10-18 19:37:04 \| 显示全部楼层提示: 作者被禁止或删除内容自动屏蔽
	小甲鱼最新课程 -> https://ilovefishc.com
	回复使用道具举报显身卡

[技术交流] 为避免尬聊，我用Python爬取了一千多张斗图！

马上注册，结交更多好友，享用更多功能^_^

浏览过的版块