|
马上注册,结交更多好友,享用更多功能^_^
您需要 登录 才可以下载或查看,没有账号?立即注册
x
源代码如下:
- import requests
- from lxml import etree
- import os
- url = "https://www.ibswtan.com/0/425/"
- headers = {
- "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36"
- }
- response = requests.get(url,headers=headers)
- response.encoding = response.apparent_encoding
- html_text = response.text
- html_tree = etree.HTML(html_text)
- div_list = html_tree.xpath('//div[@id="list"]/dl/dd')
- url_list = []
- if not os.path.exists("./斗破苍穹"):
- os.mkdir("./斗破苍穹")
- for dd in div_list:
- new_url = dd.xpath("./a/@href")[0]
- url_list.append(new_url)
- for i in url_list:
- download_url = "https://www.ibswtan.com/0/425/" + i
- # print(download_url)
- r = requests.get(download_url, headers=headers)
- r.encoding = r.apparent_encoding
- detal_html = r.text
- detal_tree = etree.HTML(detal_html)
- book_name = detal_tree.xpath('//div[@class="bookname"]/h1/text()')[0]
- text = detal_tree.xpath('//div[@id="content"]/text()')[0]
- file_path = "./斗破苍穹/" + book_name + ".txt"
- with open(file_path, "w", encoding="utf-8") as f:
- f.write(text)
- print(book_name,'下载完毕')
复制代码
爬虫之后报错:
- 四章完毕 下载完毕
- 萧炎,我们的主角的图片哦~ 下载完毕
- Traceback (most recent call last):
- File "E:/推土机/py-lzy/Requests-bs4-xpath/爬虫-笔趣阁-斗破苍穹.py", line 29, in <module>
- book_name = detal_tree.xpath('//div[@class="bookname"]/h1/text()')[0]
- IndexError: list index out of range
复制代码
到报错的位置,我看了【土豆强力推荐——《》网游】这个页面的标题代码是一样的啊,单独爬取能爬取到数据的,为啥会超出索引那
- import requests
- from lxml import etree
- import os
- import time
- url = "https://www.ibswtan.com/0/425/"
- headers = {
- "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36"
- }
- response = requests.get(url, headers=headers)
- response.encoding = response.apparent_encoding
- html_text = response.text
- html_tree = etree.HTML(html_text)
- div_list = html_tree.xpath('//div[@id="list"]/dl/dd')
- url_list = []
- if not os.path.exists("./斗破苍穹"):
- # os.mkdir("./斗破苍穹")
- for dd in div_list:
- new_url = dd.xpath("./a/@href")[0]
- url_list.append(new_url)
- for i in url_list:
- download_url = "https://www.ibswtan.com/0/425/" + i
- print(download_url)
- time.sleep(1) # 睡上1秒就可以了
- r = requests.get(download_url, headers=headers)
- r.encoding = r.apparent_encoding
- detal_html = r.text
- detal_tree = etree.HTML(detal_html)
- book_name = detal_tree.xpath(
- '//div[@class="bookname"]/h1/text()')[0]
- text = detal_tree.xpath('//div[@id="content"]/text()')[0]
- file_path = "./斗破苍穹/" + book_name + ".txt"
- with open(file_path, "w", encoding="utf-8") as f:
- f.write(text)
- print(book_name, '下载完毕')
复制代码
可能是请求的太频繁了,睡上1秒就应该没问题了
|
|