[已解决]异步爬取文件写入问题

specail · 发表于 2021-12-14 11:38:44

异步的写法应该没有问题。已经能创建文件了，但是应该是写入部分出了问题。创建的文件没有写入操作，里面内容是空，写入时有报错。请指导。附源码

import requests
from bs4 import BeautifulSoup
import aiohttp
import aiofiles
import asyncio
import bs4
import re
from lxml import html
def getHtml(url,headers):
try:
r = requests.get(url,headers)
r.raise_for_status()
r.encoding = r.apparent_encoding
return r.text
except Exception as e:
print("访问页面有误！",e)
return ""
def getChater(html):
domain = "https://www.bbiquge.net/book_84680/"
resultls = {}
gcbs = BeautifulSoup(html,"html.parser")
for dd in gcbs.find('div',class_="zjbox").dl: #章节内容对应的标签
if isinstance(dd,bs4.element.Tag):
if dd.name == "dd" and dd.string != None:
#print(dd.a.string,dd.a.get("href"))
resultls[dd.a.string]=domain + dd.a.get("href") #返回章节名及链接
return resultls
async def getContent(url,filename): #传过来的字典是章节名：链接
async with aiohttp.ClientSession() as session:
async with session.get(url) as resp:
html =await resp.text()
# bs = BeautifulSoup(html,"html.parser").find("div",id="content")
# content = bs.find("div",id="content")
async with open(filename+".txt","w",encoding="utf-8") as f:
await f.write(BeautifulSoup(html,"html.parser").find("div",id="content").text.replace('\xa0'*4,'\n\n ')) #处理符号，整理格式
print("done!"+filename)
async def main():
url = "https://www.bbiquge.net/book_84680/"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.93 Safari/537.36 Edg/96.0.1054.53"
}
html = getHtml(url,headers)
#获得章节名及链接
resultls = getChater(html)
# 异步下载章节内容
tasks = []
for item in resultls.items():
tasks.append(asyncio.create_task(getContent(item[1],item[0])))
await asyncio.wait(tasks)
if __name__ == "__main__":
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
loop.run_until_complete(main())

复制代码

最佳答案

月排行榜 / 总排行榜

wp231957

2021-12-14 11:38:45

specail 发表于 2021-12-15 08:46

python 本身就是同步的，并不擅长处理异步问题，（异步我也不懂）

所以，很想知道你的动机是练习用吗否则同步爬取就可以啦

跳转到最佳答案楼层

specail · 发表于 2021-12-16 11:59:30

open时改用，aiofiles.open 解决了
但是新的问题是爬到后面好像有编码问题无法解析，估计是aiohttp库的问题

wp231957 · 发表于 2021-12-14 11:38:45

这个最佳答案由 wp231957 给出，感谢 wp231957 的回答。

单击隐藏图章

specail 发表于 2021-12-15 08:46

python 本身就是同步的，并不擅长处理异步问题，（异步我也不懂）

所以，很想知道你的动机是练习用吗否则同步爬取就可以啦

specail · 发表于 2021-12-14 15:29:05

自顶

specail · 发表于 2021-12-15 08:46:01

specail · 发表于 2021-12-15 10:07:06

wp231957 发表于 2021-12-15 10:04
python 本身就是同步的，并不擅长处理异步问题，（异步我也不懂）

所以，很想知道你的动机是练习用 ...

异步爬取的效率更高，更多情况需要异步协程跟多线程结合，这个我自己发现问题了。是我导了包忘记用了

z5560636 · 发表于 2021-12-15 10:12:39

specail 发表于 2021-12-15 10:07
异步爬取的效率更高，更多情况需要异步协程跟多线程结合，这个我自己发现问题了。是我导了包忘记用了{:10 ...

我还以为是异步往同一个文件里面写，脑海中重在构思上次看到的资料，python目前不支持文件锁。

账号		自动登录	找回密码
密码			立即注册

[已解决]异步爬取文件写入问题

最佳答案

浏览过的版块