代码小白liu 发表于 2021-8-1 20:19:08

爬虫遇到的小问题

源代码如下:

import requests
from lxml import etree
import os

url = "https://www.ibswtan.com/0/425/"
headers = {
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36"
}

response = requests.get(url,headers=headers)
response.encoding = response.apparent_encoding
html_text = response.text
html_tree = etree.HTML(html_text)
div_list = html_tree.xpath('//div[@id="list"]/dl/dd')
url_list = []
if not os.path.exists("./斗破苍穹"):
    os.mkdir("./斗破苍穹")
for dd in div_list:
    new_url = dd.xpath("./a/@href")
    url_list.append(new_url)

for i in url_list:
    download_url = "https://www.ibswtan.com/0/425/" + i
    # print(download_url)
    r = requests.get(download_url, headers=headers)
    r.encoding = r.apparent_encoding
    detal_html = r.text
    detal_tree = etree.HTML(detal_html)
    book_name = detal_tree.xpath('//div[@class="bookname"]/h1/text()')
    text = detal_tree.xpath('//div[@id="content"]/text()')
    file_path = "./斗破苍穹/" + book_name + ".txt"
    with open(file_path, "w", encoding="utf-8") as f:
      f.write(text)
      print(book_name,'下载完毕')




爬虫之后报错:
四章完毕 下载完毕
萧炎,我们的主角的图片哦~ 下载完毕
Traceback (most recent call last):
File "E:/推土机/py-lzy/Requests-bs4-xpath/爬虫-笔趣阁-斗破苍穹.py", line 29, in <module>
    book_name = detal_tree.xpath('//div[@class="bookname"]/h1/text()')
IndexError: list index out of range

到报错的位置,我看了【土豆强力推荐——《》网游】这个页面的标题代码是一样的啊,单独爬取能爬取到数据的,为啥会超出索引那

wp231957 发表于 2021-8-1 21:43:05

都超出范围,明显是一个空列表

大马强 发表于 2021-8-1 22:23:03

改了一点你的代码,运行是没有问题,但有一些章节就是爬不了

大马强 发表于 2021-8-1 22:24:25

https://static01.imgkr.com/temp/7dcb822078634ffc80df8a8b81e8d71a.jpg
都是会断了几章

大马强 发表于 2021-8-1 22:42:55

我发现了,你好像是被反爬了

大马强 发表于 2021-8-1 22:45:53

https://static01.imgkr.com/temp/59d1d460f1bb48d39116b26b3640fe34.jpg
import requests
from lxml import etree
import os
import time


url = "https://www.ibswtan.com/0/425/"
headers = {
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36"
}

response = requests.get(url, headers=headers)
response.encoding = response.apparent_encoding
html_text = response.text
html_tree = etree.HTML(html_text)
div_list = html_tree.xpath('//div[@id="list"]/dl/dd')
url_list = []
if not os.path.exists("./斗破苍穹"):
    # os.mkdir("./斗破苍穹")
    for dd in div_list:
      new_url = dd.xpath("./a/@href")
      url_list.append(new_url)

    for i in url_list:
      download_url = "https://www.ibswtan.com/0/425/" + i
      print(download_url)
      time.sleep(1)# 睡上1秒就可以了
      r = requests.get(download_url, headers=headers)
      r.encoding = r.apparent_encoding
      detal_html = r.text
      detal_tree = etree.HTML(detal_html)

      book_name = detal_tree.xpath(
            '//div[@class="bookname"]/h1/text()')
      text = detal_tree.xpath('//div[@id="content"]/text()')
      file_path = "./斗破苍穹/" + book_name + ".txt"
      with open(file_path, "w", encoding="utf-8") as f:
            f.write(text)
            print(book_name, '下载完毕')

可能是请求的太频繁了,睡上1秒就应该没问题了

suchocolate 发表于 2021-8-1 22:46:22

我这运行正常输出。
页: [1]
查看完整版本: 爬虫遇到的小问题