[已解决]爬虫遇到的小问题

代码小白liu · 发表于 2021-8-1 20:19:08

您需要登录才可以下载或查看，没有账号？立即注册

x

源代码如下：

import requests
from lxml import etree
import os
url = "https://www.ibswtan.com/0/425/"
headers = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36"
}
response = requests.get(url,headers=headers)
response.encoding = response.apparent_encoding
html_text = response.text
html_tree = etree.HTML(html_text)
div_list = html_tree.xpath('//div[@id="list"]/dl/dd')
url_list = []
if not os.path.exists("./斗破苍穹"):
os.mkdir("./斗破苍穹")
for dd in div_list:
new_url = dd.xpath("./a/@href")[0]
url_list.append(new_url)
for i in url_list:
download_url = "https://www.ibswtan.com/0/425/" + i
# print(download_url)
r = requests.get(download_url, headers=headers)
r.encoding = r.apparent_encoding
detal_html = r.text
detal_tree = etree.HTML(detal_html)
book_name = detal_tree.xpath('//div[@class="bookname"]/h1/text()')[0]
text = detal_tree.xpath('//div[@id="content"]/text()')[0]
file_path = "./斗破苍穹/" + book_name + ".txt"
with open(file_path, "w", encoding="utf-8") as f:
f.write(text)
print(book_name,'下载完毕')

复制代码

爬虫之后报错：

复制代码

到报错的位置，我看了【土豆强力推荐——《》网游】这个页面的标题代码是一样的啊，单独爬取能爬取到数据的，为啥会超出索引那

最佳答案

大马强

2021-8-1 22:45:53

登录/注册后可看大图

import requests
from lxml import etree
import os
import time
url = "https://www.ibswtan.com/0/425/"
headers = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36"
}
response = requests.get(url, headers=headers)
response.encoding = response.apparent_encoding
html_text = response.text
html_tree = etree.HTML(html_text)
div_list = html_tree.xpath('//div[@id="list"]/dl/dd')
url_list = []
if not os.path.exists("./斗破苍穹"):
# os.mkdir("./斗破苍穹")
for dd in div_list:
new_url = dd.xpath("./a/@href")[0]
url_list.append(new_url)
for i in url_list:
download_url = "https://www.ibswtan.com/0/425/" + i
print(download_url)
time.sleep(1) # 睡上1秒就可以了
r = requests.get(download_url, headers=headers)
r.encoding = r.apparent_encoding
detal_html = r.text
detal_tree = etree.HTML(detal_html)
book_name = detal_tree.xpath(
'//div[@class="bookname"]/h1/text()')[0]
text = detal_tree.xpath('//div[@id="content"]/text()')[0]
file_path = "./斗破苍穹/" + book_name + ".txt"
with open(file_path, "w", encoding="utf-8") as f:
f.write(text)
print(book_name, '下载完毕')

复制代码

可能是请求的太频繁了，睡上1秒就应该没问题了

wp231957 · 发表于 2021-8-1 21:43:05

[0]都超出范围，明显是一个空列表

大马强 · 发表于 2021-8-1 22:23:03

改了一点你的代码，运行是没有问题，但有一些章节就是爬不了

大马强 · 发表于 2021-8-1 22:24:25

登录/注册后可看大图

都是会断了几章

大马强 · 发表于 2021-8-1 22:42:55

我发现了，你好像是被反爬了

大马强 · 发表于 2021-8-1 22:45:53

登录/注册后可看大图

import requests
from lxml import etree
import os
import time
url = "https://www.ibswtan.com/0/425/"
headers = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36"
}
response = requests.get(url, headers=headers)
response.encoding = response.apparent_encoding
html_text = response.text
html_tree = etree.HTML(html_text)
div_list = html_tree.xpath('//div[@id="list"]/dl/dd')
url_list = []
if not os.path.exists("./斗破苍穹"):
# os.mkdir("./斗破苍穹")
for dd in div_list:
new_url = dd.xpath("./a/@href")[0]
url_list.append(new_url)
for i in url_list:
download_url = "https://www.ibswtan.com/0/425/" + i
print(download_url)
time.sleep(1) # 睡上1秒就可以了
r = requests.get(download_url, headers=headers)
r.encoding = r.apparent_encoding
detal_html = r.text
detal_tree = etree.HTML(detal_html)
book_name = detal_tree.xpath(
'//div[@class="bookname"]/h1/text()')[0]
text = detal_tree.xpath('//div[@id="content"]/text()')[0]
file_path = "./斗破苍穹/" + book_name + ".txt"
with open(file_path, "w", encoding="utf-8") as f:
f.write(text)
print(book_name, '下载完毕')

复制代码

可能是请求的太频繁了，睡上1秒就应该没问题了

suchocolate · 发表于 2021-8-1 22:46:22

我这运行正常输出。

账号		自动登录	找回密码
密码			立即注册