【爬虫】初次使用多线程,Python交流,编程语言专区,鱼C论坛

昨非发表于 2021-4-17 23:38:57

【爬虫】初次使用多线程

今天抽了两三个钟头学了学多线程相关的东西
比葫芦画瓢，算是画出来东西了{:10_250:}
内容：下载《赘婿》全文并保存为txt
提示：初次尝试，全文保存为txt是会把记事本卡爆的哦
仅作参考
"""
多线程爬虫首次尝试：
爬取赘婿小说全部章节，解析并保存为txt
目标url：http://www.xbiquge.la/0/885/
@author: 昨非
"""

from threading import Thread
from queue import Queue
from fake_useragent import UserAgent
import requests
from lxml import etree

headers = {
"User-Agent": UserAgent().random
}
# 爬虫类
class GetInfo(Thread):
def __init__(self,url_queue,html_queue):
   Thread.__init__(self)
   self.url_queue = url_queue
   self.html_queue = html_queue

def run(self):
   while self.url_queue.empty() == False:
         url = self.url_queue.get()
         response = requests.get(url, headers=headers)
         if response.status_code == 200:
            response.encoding = 'utf-8'# 这步很关键
            self.html_queue.put(response.text)

# 解析类
class ParseInfo(Thread):
def __init__(self, html_queue):
   Thread.__init__(self)
   self.html_queue = html_queue

def run(self):
   while self.html_queue.empty() == False:
         e = etree.HTML(self.html_queue.get())
         chapter_names = e.xpath('//div[@class = "bookname"]/h1/text()')
         chapter_contents = e.xpath('//div[@id = "content"]/text()')
         for chapter_name in chapter_names:

            txt = ''
            for i in chapter_contents:# 先拼接
               if i != '\n':
                     i = repr(i).replace(r'\xa0', '').replace("'", '')
                     txt += i
            txt = repr(txt).replace("\\n", '\n').replace('\\', '')
            txt = repr(txt).replace('rr', '\n')# 最终处理

            with open('赘婿.txt', 'a', encoding='utf-8') as f:
               f.write(chapter_name + '\n'+txt)

if __name__ == '__main__':
# 存储url的容器
url_queue = Queue()
# 存储内容的容器
html_queue = Queue()

first_url = 'http://www.xbiquge.la/0/885/'
response = requests.get(first_url, headers=headers)
e = etree.HTML(response.content.decode('utf-8'))# 返回字符串
urls = e.xpath('//div[@class="box_con"]/div[@id="list"]/dl/dd/a/@href')
for url in urls:
   chapter_url = 'http://www.xbiquge.la' + url
   url_queue.put(chapter_url)

# 创建一个爬虫
crawl_list = []
for i in range(0, 100):
   crawl1 = GetInfo(url_queue, html_queue)
   crawl_list.append(crawl1)
   crawl1.start()

for crawl in crawl_list:
   crawl.join()

parse_list = []
for i in range(0, 100):
   parse = ParseInfo(html_queue)
   parse_list.append(parse)
   parse.start()
for parse in parse_list:
   parse.join()

那么问题来了，这队列我也是第一次用，章节不按顺序该咋整呢{:10_266:}
（我的“葫芦”就是这么教的，瓢只能做到这程度，求指点，大佬勿喷{:10_266:} ）

昨非发表于 2021-4-21 17:38:36

优先队列，顺序问题已解决，更新代码如下：
"""
多线程爬虫首次尝试：
爬取赘婿小说全部章节，解析并保存为txt
目标url：http://www.xbiquge.la/0/885/
@author: 昨非
"""

from threading import Thread
from queue import PriorityQueue
from fake_useragent import UserAgent
import requests
from lxml import etree

headers = {
"User-Agent": UserAgent().random
}
# 爬虫类
class GetInfo(Thread):
def __init__(self,url_queue,html_queue):
   Thread.__init__(self)
   self.url_queue = url_queue
   self.html_queue = html_queue

def run(self):
   while self.url_queue.empty() == False:
         item = self.url_queue.get()
         url = item
         response = requests.get(url, headers=headers)
         if response.status_code == 200:
            response.encoding = 'utf-8'# 这步很关键
            self.html_queue.put((item,response.text))

# 解析类
class ParseInfo(Thread):
def __init__(self, html_queue):
   Thread.__init__(self)
   self.html_queue = html_queue

def run(self):
   while self.html_queue.empty() == False:
         item2 = self.html_queue.get()
         e = etree.HTML(item2)
         chapter_names = e.xpath('//div[@class = "bookname"]/h1/text()')
         chapter_contents = e.xpath('//div[@id = "content"]/text()')
         for chapter_name in chapter_names:

            txt = ''
            for i in chapter_contents:# 先拼接
               if i != '\n':
                     i = repr(i).replace(r'\xa0', '').replace("'", '')
                     txt += i
            txt = repr(txt).replace("\\n", '\n').replace('\\', '')
            txt = repr(txt).replace('rr', '\n')# 最终处理

            with open('赘婿.txt', 'a', encoding='utf-8') as f:
               f.write(chapter_name + '\n'+txt + '\n')

if __name__ == '__main__':
# 存储url的容器
url_queue = PriorityQueue()
# 存储内容的容器
html_queue = PriorityQueue()

first_url = 'http://www.xbiquge.la/0/885/'
response = requests.get(first_url, headers=headers)
e = etree.HTML(response.content.decode('utf-8'))# 返回字符串
urls = e.xpath('//div[@class="box_con"]/div[@id="list"]/dl/dd/a/@href')
i = 0
for url in urls:
   chapter_url = 'http://www.xbiquge.la' + url
   url_queue.put((i,chapter_url))
   i += 1

# 创建一个爬虫
crawl_list = []
for i in range(0, 100):
   crawl1 = GetInfo(url_queue, html_queue)
   crawl_list.append(crawl1)
   crawl1.start()

for crawl in crawl_list:
   crawl.join()

parse_list = []
for i in range(0, 100):
   parse = ParseInfo(html_queue)
   parse_list.append(parse)
   parse.start()
for parse in parse_list:
   parse.join()

昨非发表于 2021-4-17 23:51:33

体验还行{:10_256:}

sinaop 发表于 2021-4-18 07:56:05

{:5_109:}

wp231957 发表于 2021-4-18 09:12:36

无序的小说还能看吗？

昨非发表于 2021-4-18 09:32:21

wp231957 发表于 2021-4-18 09:12
无序的小说还能看吗？

不能啊，我就试试{:10_245:}
粗略搜了下，好像多线程乱序是挺经典的问题了

昨非发表于 2021-4-18 14:02:35

找到突破口了：
优先队列加个序号来处理

Daniel_Zhang 发表于 2021-4-20 00:17:22

本帖最后由 Daniel_Zhang 于 2021-4-20 00:19 编辑

我有一个想法,比如说你有 1000 个章节,那就分成 50 份,每一份做 20 个线程,然后 for 循环 50 次

每次 start 20 个线程

每个章节作为一个单独的 txt 文件

给一个线程锁,在完成 20 个线程之前不能继续接下来的操作

如果你想要多个 txt 合并的话,下载完成以后,单独搞一个 function 去合并,按照 txt 文件的名称什么的

昨非发表于 2021-4-20 00:23:57

Daniel_Zhang 发表于 2021-4-20 00:17
我有一个想法,比如说你有 1000 个章节,那就分成 50 份,每一份做 20 个线程,然后 for 循环 50 次

每次 st ...

emmmm大晚上的不想思考，回头再说{:10_280:}

小小の白 发表于 2021-4-21 19:53:15

{:10_256:}

°蓝鲤歌蓝 发表于 2021-4-24 21:05:58

用 concurrent.future 下的 map函数或者用协程

昨非发表于 2021-4-24 21:10:17

°蓝鲤歌蓝发表于 2021-4-24 21:05
用 concurrent.future 下的 map函数或者用协程

听不懂，我先百度下{:10_245:}

页: [1]

鱼C论坛's Archiver

【爬虫】初次使用多线程