[已解决]爬虫问题求助

潺陵大地 · 发表于 2022-4-15 16:08:39

马上注册，结交更多好友，享用更多功能^_^

您需要登录才可以下载或查看，没有账号？立即注册

x

import requests
from bs4 import BeautifulSoup
import re

def gethttpText(url) : #获取网页数据
try :
 r = requests.get(url ,timeout = 30)
 r.raise_for_status()
 r.encoding = r.apparent_encoding
 return r.text
except :
 print("Funation getHttpText 代码出错！")

def parserPage(itl,html): #解析网页数据
try :
 soup = BeautifulSoup(html,"html.parser")
 find_shuju = soup.find('div', attrs={'class': "subjectbox"}) #查找标签div 内容为class="tit_replay" 这是最新的回复
 find_a = find_shuju.find_all('a') #进一步查找到标签 a 内容
 for each in find_a : #迭代标签 a 的取内容
 ls = re.findall(r'.*?',each.get('tip')) #用正则在 tip 里面查找符合要求的内容就是文章题目
 ls_name = ls[0].split('>')[1].split("<")[0] #split 分割两次得出要的内容
 zz = re.findall(r'作者:.*?\(',each.get('tip')) #用正则在 tip 里面查找作者
 zz_name = zz[0].split(':')[1].split('(')[0] # #split 分割两次得出要的内容
 itl.append([ls_name,zz_name,each.get("href")]) #放进itl 参数列表中
except:
 print("Funation parserPage 程序出错！")

def printList(itl) : #显示屏幕上
list_top = "{:4}\t{:32}\t{:20}\t{:32}" #格式化头
print(list_top.format("序号","题目","作者","链接")) #显示列表头
count = 0
try :
 for each in itl :
 count += 1
 print(list_top.format(count , each[0] , each[1] ,each[2])) #显示内容
except:
 print("Funation printList 程序出错！")
def main():
try:
 url = 'http://bbs.lwhfishing.com/forum.php'
 html = gethttpText(url)
 itl = []
 parserPage(itl,html)
 printList(itl)
except :
 print("Funation main 程序出错！")

if __name__ =="__main__" :
main()

要爬的内容大致如下：

<td valign="top" class="category_l2">
<div class="subjectbox">
<h4>最新主题</h4>
 <ul class="category_newlist">
 <li><a href="forum.php?mod=viewthread&tid=1149&extra=" tip="标题: [原创] 浮钓拉巧嘴 作者: 野钓迷 (4 小时前) 查看/回复: 8/4" onmouseover="showTip(this)" target="_blank">[原创] 浮钓拉巧嘴</a></li>
 <li><a href="forum.php?mod=viewthread&tid=1147&extra=" tip="标题: 【原创首发】“大蓝鲫•腥世界”——转战大潮 作者: 子鱼365 (前天 16:38) 查看/回复: 66/15" onmouseover="showTip(this)" target="_blank">【原创首发】“大蓝鲫•腥世界” ...</a></li>
 <li><a href="forum.php?mod=viewthread&tid=1146&extra=" tip="标题: [原创] 三移其窝终有获 作者: 野钓迷 (前天 05:47) 查看/回复: 114/28" onmouseover="showTip(this)" target="_blank">[原创] 三移其窝终有获</a></li>
 <li><a href="forum.php?mod=viewthread&tid=1145&extra=" tip="标题: 【原创首发】大蓝鲫美极鲜收获满满 作者: 坚守誓言 (3 天前) 查看/回复: 144/25" onmouseover="showTip(this)" target="_blank">【原创首发】大蓝鲫美极鲜收获满 ...</a></li>
 <li><a href="forum.php?mod=viewthread&tid=1144&extra=" tip="标题: [原创] 洸河南钓小鱼 作者: 野钓迷 (3 天前) 查看/回复: 229/33" onmouseover="showTip(this)" target="_blank">[原创] 洸河南钓小鱼</a></li>
 <li><a href="forum.php?mod=viewthread&tid=1143&extra=" tip="标题: 【原创首发】废塘钓到狠东西 作者: 潺陵大地 (4 天前) 查看/回复: 134/19" onmouseover="showTip(this)" target="_blank">【原创首发】废塘钓到狠东西 ...</a></li>
 <li><a href="forum.php?mod=viewthread&tid=1142&extra=" tip="标题: 【原创首发】钓鱼日记：4.6大风无鱼挖菜 作者: 念念不忘 (4 天前) 查看/回复: 117/18" onmouseover="showTip(this)" target="_blank">【原创首发】钓鱼日记：4.6大风 ...</a></li>
 <li><a href="forum.php?mod=viewthread&tid=1141&extra=" tip="标题: 【原创首发】书法习作，请雅正 作者: 江海道人 (4 天前) 查看/回复: 135/16" onmouseover="showTip(this)" target="_blank">【原创首发】书法习作，请雅正 ...</a></li>
 <li><a href="forum.php?mod=viewthread&tid=1140&extra=" tip="标题: 【原创首发】小坐钓小河 作者: 野钓手 (4 天前) 查看/回复: 157/18" onmouseover="showTip(this)" target="_blank">【原创首发】小坐钓小河 ...</a></li>
 <li><a href="forum.php?mod=viewthread&tid=1139&extra=" tip="标题: 【原创首发】让我一次钓个够...... 作者: 野钓手 (4 天前) 查看/回复: 142/15" onmouseover="showTip(this)" target="_blank">【原创首发】让我一次钓个够.... ...</a></li>
 </ul>
 </div>
</td>

二个问题：
一、我的取出 之间的题目内容和“作者：” 这个内容代码比较麻烦，肯定有更简便的办法，但我只能一层一层找到了再正则取出。因为这个页面下面还有更多的 的内容，所以不能直接从整体页面用正则取出，想了好久我也只能这一层一层找，能否有更简便的办法？
二、函数 printList(itl) 显示出来不整齐，请问有没有好办法让显示内容的整齐一点。

谢谢！

最佳答案

月排行榜 / 总排行榜

isdkz

2022-4-15 18:10:32

本帖最后由 isdkz 于 2022-4-16 08:23 编辑

第一个问题：

你可以继续对 tip 标签的内容使用 BeautifulSoup 解析

第二个问题：

想打印的更整齐一点可以使用 prettytable 这个库，使用之前先执行以下命令安装。

pip install prettytable -i https://mirrors.aliyun.com/pypi/simple
复制代码

对你的代码修改如下：

import requests

from bs4 import BeautifulSoup

from prettytable import PrettyTable

def gethttpText(url) : #获取网页数据

try :

      r = requests.get(url ,timeout = 30)

      r.raise_for_status()

      r.encoding = r.apparent_encoding

      return r.text

except :

      print("Funation getHttpText 代码出错！")

def parserPage(itl,html): #解析网页数据

try :

      soup = BeautifulSoup(html,"html.parser")

      find_shuju = soup.find('div', attrs={'class': "subjectbox"})  #查找标签div 内容为class="tit_replay"  这是最新的回复

      find_a = find_shuju.find_all('a')    #进一步查找到标签  a 内容

      for each in find_a :                #迭代标签 a  的取内容

         tip_soup = BeautifulSoup(each['tip'],"html.parser")

         ls_name = tip_soup.find('strong').text

         zz_name = tip_soup.find('br').next_element.split(':')[1].split('(')[0]

         itl.append([ls_name,zz_name,each.get("href")]) #放进itl 参数列表中

except:

      print("Funation parserPage 程序出错！")

def main():

try:

      url = 'http://bbs.lwhfishing.com/forum.php'

      html = gethttpText(url)

      itl = []

      parserPage(itl,html)

      table = PrettyTable(field_names=("序号","题目","作者","链接"))

      for i,j in enumerate(itl):

         table.add_row([i] + j)

      print(table)

except :

      print("Funation main 程序出错！")

if __name__ =="__main__" :

main()
复制代码

跳转到最佳答案楼层

代码小白liu · 发表于 2022-4-15 17:23:14

可以用xpath吧

isdkz · 发表于 2022-4-15 18:10:32

本帖最后由 isdkz 于 2022-4-16 08:23 编辑

第一个问题：

你可以继续对 tip 标签的内容使用 BeautifulSoup 解析

第二个问题：

想打印的更整齐一点可以使用 prettytable 这个库，使用之前先执行以下命令安装。

pip install prettytable -i https://mirrors.aliyun.com/pypi/simple
复制代码

对你的代码修改如下：

import requests

from bs4 import BeautifulSoup

from prettytable import PrettyTable

def gethttpText(url) : #获取网页数据

try :

      r = requests.get(url ,timeout = 30)

      r.raise_for_status()

      r.encoding = r.apparent_encoding

      return r.text

except :

      print("Funation getHttpText 代码出错！")

def parserPage(itl,html): #解析网页数据

try :

      soup = BeautifulSoup(html,"html.parser")

      find_shuju = soup.find('div', attrs={'class': "subjectbox"})  #查找标签div 内容为class="tit_replay"  这是最新的回复

      find_a = find_shuju.find_all('a')    #进一步查找到标签  a 内容

      for each in find_a :                #迭代标签 a  的取内容

         tip_soup = BeautifulSoup(each['tip'],"html.parser")

         ls_name = tip_soup.find('strong').text

         zz_name = tip_soup.find('br').next_element.split(':')[1].split('(')[0]

         itl.append([ls_name,zz_name,each.get("href")]) #放进itl 参数列表中

except:

      print("Funation parserPage 程序出错！")

def main():

try:

      url = 'http://bbs.lwhfishing.com/forum.php'

      html = gethttpText(url)

      itl = []

      parserPage(itl,html)

      table = PrettyTable(field_names=("序号","题目","作者","链接"))

      for i,j in enumerate(itl):

         table.add_row([i] + j)

      print(table)

except :

      print("Funation main 程序出错！")

if __name__ =="__main__" :

main()
复制代码

潺陵大地 · 发表于 2022-4-16 09:19:59

代码小白liu 发表于 2022-4-15 17:23
可以用xpath吧

还不会 xpth

潺陵大地 · 发表于 2022-4-18 15:24:16

isdkz 发表于 2022-4-15 18:10
第一个问题：

你可以继续对 tip 标签的内容使用 BeautifulSoup 解析

谢谢

账号		自动登录	找回密码
密码			立即注册

[已解决]爬虫问题求助

马上注册，结交更多好友，享用更多功能^_^

浏览过的版块