[分享]爬取漫画，多线程的爬虫

六小鸭 · 发表于 2020-4-14 09:03:17

马上注册，结交更多好友，享用更多功能^_^

您需要登录才可以下载或查看，没有账号？立即注册

x

本帖最后由六小鸭于 2020-4-14 09:11 编辑

我不是原创，然后我也找不到原来的链接了
直接上代码

import requests
from urllib import parse
from bs4 import BeautifulSoup
import threading
import os
import sys
_name=input('请输入你想看的漫画:')
try:
os.mkdir('./{}'.format(_name))
except:
print('已经存在相同的文件夹了,程序无法在继续进行！')
sys.exit()
name_=parse.urlencode({'keyword':_name})
url='https://www.mkzhan.com/search/?{}'.format(name_)
html=requests.get(url=url)
content=html.text
soup=BeautifulSoup(content,'lxml')
list1=soup.select('div.common-comic-item')
names=[]
hrefs=[]
keywords=[]
for str1 in list1:
names.append(str1.select('p.comic__title>a')[0].get_text()) # 匹配到的漫画名称
hrefs.append(str1.select('p.comic__title>a')[0]['href']) # 漫画的网址
keywords.append(str1.select('p.comic-feature')[0].get_text()) # 漫画的主题
print('匹配到的结果如下：')
for i in range(len(names)):
print('【{}】-{} {}'.format(i+1,names[i],keywords[i]))
i=int(input('请输入你想看的漫画序号:'))
print('你选择的是{}'.format(names[i-1]))
url1='https://www.mkzhan.com'+hrefs[i-1] # 漫画的链接
html1=requests.get(url=url1)
content1=html1.text
soup1=BeautifulSoup(content1,'lxml')
str2=soup1.select('ul.chapter__list-box.clearfix.hide')[0]
list2=str2.select('li>a')
name1=[]
href1=[]
for str3 in list2:
href1.append(str3['data-hreflink']) # 漫画一章的链接
name1.append(str3.get_text().strip()) # 漫画一章的题目,去空格
def Downlad(href1,path):
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3756.400 QQBrowser/10.5.4039.400'}
url2='https://www.mkzhan.com'+href1
html2=requests.get(url=url2,headers=headers)
content2=html2.text
soup2=BeautifulSoup(content2,'lxml')
list_1=soup2.select('div.rd-article__pic.hide>img.lazy-read') # 漫画一章中的所有内容列表
urls=[]
for str_1 in list_1:
urls.append(str_1['data-src'])
for i in range(len(urls)):
url=urls[i]
content3=requests.get(url=url,headers=headers)
with open(file=path+'/{}.jpg'.format(i+1),mode='wb') as f:
f.write(content3.content)
return True
def Main_Downlad(href1:list,name1:list):
while True:
if len(href1)==0:
break
href=href1.pop()
name=name1.pop()
try:
path='./{}/{}'.format(_name,name)
os.mkdir(path=path)
if Downlad(href, path):
print('线程{}正在下载章节{}'.format(threading.current_thread().getName(),name))
except:
pass
threading_1=[]
for i in range(30):
threading1=threading.Thread(target=Main_Downlad,args=(href1,name1,))
threading1.start()
threading_1.append(threading1)
for i in threading_1:
i.join()
print('当前线程为{}'.format(threading.current_thread().getName()))

复制代码

下载是真心快
记得评分@一个账号 @编程鱼C

zedi · 发表于 2020-4-14 09:09:28

不错有我当年的风范

编程鱼C · 发表于 2020-4-14 09:12:38

编程鱼C · 发表于 2020-4-14 09:13:47

你的回复在审核，我看不到

乘号 · 发表于 2020-4-14 09:51:24

斗破苍穹？？？！！！

乘号 · 发表于 2020-4-14 10:31:20

斗破苍穹到840话怎么不动了

六小鸭 · 发表于 2020-4-14 10:38:03

乘号发表于 2020-4-14 10:31
斗破苍穹到840话怎么不动了

应该是完了？

六小鸭 · 发表于 2020-4-14 10:40:21

乘号发表于 2020-4-14 10:31
斗破苍穹到840话怎么不动了

一共840话

乘号 · 发表于 2020-4-14 10:51:49

六小鸭发表于 2020-4-14 10:40
一共840话

肿么可冷

六小鸭 · 发表于 2020-4-14 11:28:52

乘号发表于 2020-4-14 10:51
肿么可冷

真的
我爬的漫客斋

六小鸭 · 发表于 2020-4-14 11:29:51

乘号发表于 2020-4-14 10:51
肿么可冷

我这么好的爬虫为什么没人看呢

丶小小少年 · 发表于 2020-4-15 17:21:47

为什么我用不了啊

六小鸭 · 发表于 2020-4-15 17:24:03

丶小小少年发表于 2020-4-15 17:21
为什么我用不了啊

没下载第三方库

lpfight · 发表于 2020-6-18 15:18:52

输入漫画编号为什么有问题

小小怪士兵 · 发表于 2020-10-7 19:36:48

Traceback (most recent call last):
File "E:/Python小程序/漫画爬取.py", line 20, in <module>
soup=BeautifulSoup(content,'lxml')
File "D:\电脑软件\Python38\lib\site-packages\bs4\__init__.py", line 243, in __init__
raise FeatureNotFound(
bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?
大佬这怎么搞

nnzp · 发表于 2020-10-8 02:05:53

小小怪士兵发表于 2020-10-7 19:36
Traceback (most recent call last):
File "E:/Python小程序/漫画爬取.py", line 20, in
soup=Bea ...

安装个html解析器

pip3 install lxml

复制代码

账号		自动登录	找回密码
密码			立即注册

[作品展示] [分享]爬取漫画，多线程的爬虫

马上注册，结交更多好友，享用更多功能^_^

评分

本帖被以下淘专辑推荐:

浏览过的版块