[已解决]来个大佬帮忙分析下

smyiuo_11 · 发表于 2022-6-17 23:00:59

马上注册，结交更多好友，享用更多功能^_^

您需要登录才可以下载或查看，没有账号？立即注册

x

import os
import re
import requests

if __name__ == "__main__":
headers = {
      'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:101.0) Gecko/20100101 Firefox/101.0'
}
if not os.path.exists('./tupian'):
      os.mkdir('./tupian')
url = 'https://www.libvio.me/type/1-%d.html'
#  设置通用的url模板
for pageNum in range(1, 3):
      new_url = format(url % pageNum)

      #  使用通用爬虫对url对应的页面爬取
      page_text = requests.get(url=new_url, headers=headers).text
      page_text.encode('utf-8')
      # print(page_text)

      # 使用聚焦爬虫将页面所有的内容进行解析提取
      ex = '<div class="stui-vodlist__box"><a .*?href="(.*?)"'
      href_list = re.findall(ex, page_text, re.S)
      for href in href_list:
         # 拼接完整的电影url
         href = 'https://www.libvio.me/' + href
         print(href)
         ex = '<div class="stui-vodlist__box"><a .*?title="(.*?)"'
         name_list = re.findall(ex, page_text, re.S)
         for name in name_list:
            name: str
            print(href, name)
运行结果：
课程/爬虫测试2.py
https://www.libvio.me//detail/101056.html
https://www.libvio.me//detail/101056.html 夺命剑
https://www.libvio.me//detail/101056.html 喜欢妳是你
https://www.libvio.me//detail/101056.html 肉罢不能
https://www.libvio.me//detail/101056.html 红色火箭
https://www.libvio.me//detail/101056.html 教场
https://www.libvio.me//detail/101056.html 我的见鬼女友
https://www.libvio.me//detail/101056.html 我的爸爸
https://www.libvio.me//detail/101056.html RRR
https://www.libvio.me//detail/101056.html 唐顿庄园2
https://www.libvio.me//detail/101056.html 一个明星的诞生
https://www.libvio.me//detail/101056.html 迎雨咆哮
https://www.libvio.me//detail/101056.html 妖兽都市
https://www.libvio.me//detail/101055.html
https://www.libvio.me//detail/101055.html 夺命剑
https://www.libvio.me//detail/101055.html 喜欢妳是你
https://www.libvio.me//detail/101055.html 肉罢不能
https://www.libvio.me//detail/101055.html 红色火箭
https://www.libvio.me//detail/101055.html 教场
https://www.libvio.me//detail/101055.html 我的见鬼女友

什么会出现这个情况。电影链接地址，名字都拿到了，但是每一个链接都重复了13次  名字也不是对应的。应该怎么修改  求大佬帮忙解析

最佳答案

月排行榜 / 总排行榜

suchocolate

2022-6-18 19:27:53

你的代码里href没有刷新

#!/usr/bin/env python3
import requests
import re
def main():
headers = {'user-agent': 'firefox'}
baseurl = 'https://www.libvio.me'
for i in range(1, 11):
url = f'https://www.libvio.me/type/1-{i}.html'
r = requests.get(url, headers=headers)
alist = re.findall(r'<a href="(.*?)" title="(.*?)">', r.text)[1:]
for a in alist:
print(baseurl + a[0], a[1])
if __name__ == '__main__':
main()

复制代码

跳转到最佳答案楼层

wlinwei · 发表于 2022-6-18 17:32:48

import requests
from bs4 import BeautifulSoup
for i in range(1, 5):
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:101.0) Gecko/20100101 Firefox/101.0'}
page = requests.get("https://www.libvio.me/type/1-{}.html".format(i), headers=headers)
soup = BeautifulSoup(page.content, 'html.parser', from_encoding='utf-8')
a = soup.find_all('h4', class_='title text-overflow',)
for j in a:
print('https://www.libvio.me/{}'.format(j.a.attrs['href']),
j.get_text())

复制代码

你那个我不清楚，这个你可以借鉴下

suchocolate · 发表于 2022-6-18 19:27:53

这个最佳答案由 suchocolate 给出，感谢 suchocolate 的回答。

单击隐藏图章

你的代码里href没有刷新

#!/usr/bin/env python3
import requests
import re
def main():
headers = {'user-agent': 'firefox'}
baseurl = 'https://www.libvio.me'
for i in range(1, 11):
url = f'https://www.libvio.me/type/1-{i}.html'
r = requests.get(url, headers=headers)
alist = re.findall(r'<a href="(.*?)" title="(.*?)">', r.text)[1:]
for a in alist:
print(baseurl + a[0], a[1])
if __name__ == '__main__':
main()

复制代码

changzhou · 发表于 2022-6-18 20:15:28

#!/usr/bin/env python3

import requests
import re

def main():
headers = {'user-agent': 'firefox'}
baseurl = 'https://www.libvio.me'
for i in range(1, 11):
      url = f'https://www.libvio.me/type/1-{i}.html'
      r = requests.get(url, headers=headers)
      alist = re.findall(r'<a href="(.*?)" title="(.*?)">', r.text)[1:]
      for a in alist:
         print(baseurl + a[0], a[1])

if __name__ == '__main__':
main()

smyiuo_11 · 发表于 2022-6-19 11:17:39

suchocolate 发表于 2022-6-18 19:27
你的代码里href没有刷新

新手和高手写的代码差距果然巨大，，你这个结构清晰思路简洁，非常好。。只是alist = re.findall(r'<a href="(.*?)" title="(.*?)">', r.text)[1:]最后这个[1:]不是很清楚，请指教

suchocolate · 发表于 2022-6-19 11:24:09

smyiuo_11 发表于 2022-6-19 11:17
新手和高手写的代码差距果然巨大，，你这个结构清晰思路简洁，非常好。。只是alist = re.findall(r'', r ...

页面有13个a元素符合上面的结构，第一个不是电影，跳过。

账号		自动登录	找回密码
密码			立即注册

[已解决]来个大佬帮忙分析下

马上注册，结交更多好友，享用更多功能^_^

浏览过的版块