百度贴吧爬虫实例，一直在第一页循环不能翻页，循环体哪里有问题呢？

我爱l两条柴 · 发表于 2021-9-7 19:45:58

马上注册，结交更多好友，享用更多功能^_^

您需要登录才可以下载或查看，没有账号？立即注册

x

import requests
from lxml import etree
class Tieba(object):
def __init__(self,name):
#1.url/headers
self.url='https://tieba.baidu.com/f?ie=utf-8&kw={}&fr=search'.format(name)
#headers
self.headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36'}
def get_data(self,url):
#request、返回网页源码
response=requests.get(self.url,self.headers)
return response.content
def parse_data(self,data): #解析
html=etree.HTML(data)
el_list=html.xpath('//*[@id="thread_list"]/li/div/div[2]/div[1]/div[1]/a|//*[@id="thread_top_list"]/li/div/div[2]/div/div[1]/a')
#print(len(el_list)) #看能出来几个结果如果没有则可能是语法或者浏览器高级渲染把源代码注释了，换个低级headers
# 或者网页中用此格式做标注 < !-- 引入百度统计 --> ，可以用空格替换掉 data=data.decode().replace('< !--','').replace('-->','')
data_list=[] #title+link的列表
for el in el_list: #el 是 html的子项也属于elment 也可以用xpath语法
title_link={}
title_link['title']=el.xpath('./text()')[0]
title_link['link']='http://tieba.baidu.com' + el.xpath('./@href')[0]
data_list.append(title_link)
#一页操作完成，接着下一页找下一页是相对的，不同页索引不一样
try:
next_url='https:'+html.xpath('//a[contains(text(),"下一页")]/@href')[0] #最后一页没有下一页标签会报错
except: #//a[@class ="next pagination-item"]/@href 这里用这个语法网页中可以，但py中显示不出下一页的链接
next_url =None
return data_list,next_url
def save_data(self,data_list):
for data in data_list:
print(data)
def run(self):
next_url=self.url
while True:
data = self.get_data(next_url) #第一次的下一页链接从首页获取，
data_list,next_url=self.parse_data(data) #后面的下一页链接从解析出来的当前页面获取
# 提取 (数据和翻页的url)
self.save_data(data_list)
print(next_url)
if next_url ==None:
break
if __name__ == '__main__':
tieba=Tieba('李毅')
tieba.run()

复制代码

suchocolate · 发表于 2021-9-11 20:35:44

import requests
from lxml import etree
class Tieba():
def __init__(self, name):
self.name = name
self.base_url = 'https://tieba.baidu.com/'
self.headers = {'user-agent': 'Mozilla', 'host': 'tieba.baidu.com'}
def run(self):
num = int(input('Please enter the number of pages you want to download: '))
for x in range(num):
url = f'https://tieba.baidu.com/f?ie=utf-8&kw={self.name}&pn={x * 50}'
r = requests.get(url, headers=self.headers)
html = etree.HTML(r.text)
tits = html.xpath('//a[@class="j_th_tit "]/@title')
hrefs = html.xpath('//a[@class="j_th_tit "]/@href')
print(f'第{x + 1}页标题和链接：')
for k, v in zip(tits, hrefs):
print(k, f'{self.base_url}{v}')
print('=' * 100)
if __name__ == '__main__':
t = Tieba('李毅')
t.run()

复制代码

我爱l两条柴 · 发表于 2021-9-12 10:35:29

suchocolate 发表于 2021-9-11 20:35

感谢
我最后把 next_url= 删了，在下面加了 self.url=next_url 就好了

不过我想了几天还是没想通抄的教程里的 run（）究竟错在哪了

账号		自动登录	找回密码
密码			立即注册

百度贴吧爬虫实例，一直在第一页循环不能翻页，循环体哪里有问题呢？

马上注册，结交更多好友，享用更多功能^_^

浏览过的版块