来个大佬帮忙分析下,Python交流,编程语言专区,鱼C论坛

smyiuo_11 发表于 2022-6-17 23:00:59

来个大佬帮忙分析下

import os
import re
import requests

if __name__ == "__main__":
headers = {
   'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:101.0) Gecko/20100101 Firefox/101.0'
}
if not os.path.exists('./tupian'):
   os.mkdir('./tupian')
url = 'https://www.libvio.me/type/1-%d.html'
#设置通用的url模板
for pageNum in range(1, 3):
   new_url = format(url % pageNum)

   #使用通用爬虫对url对应的页面爬取
   page_text = requests.get(url=new_url, headers=headers).text
   page_text.encode('utf-8')
   # print(page_text)

   # 使用聚焦爬虫将页面所有的内容进行解析提取
   ex = '<div class="stui-vodlist__box"><a .*?href="(.*?)"'
   href_list = re.findall(ex, page_text, re.S)
   for href in href_list:
         # 拼接完整的电影url
         href = 'https://www.libvio.me/' + href
         print(href)
         ex = '<div class="stui-vodlist__box"><a .*?title="(.*?)"'
         name_list = re.findall(ex, page_text, re.S)
         for name in name_list:
            name: str
            print(href, name)
运行结果：
课程/爬虫测试2.py
https://www.libvio.me//detail/101056.html
https://www.libvio.me//detail/101056.html 夺命剑
https://www.libvio.me//detail/101056.html 喜欢妳是你
https://www.libvio.me//detail/101056.html 肉罢不能
https://www.libvio.me//detail/101056.html 红色火箭
https://www.libvio.me//detail/101056.html 教场
https://www.libvio.me//detail/101056.html 我的见鬼女友
https://www.libvio.me//detail/101056.html 我的爸爸
https://www.libvio.me//detail/101056.html RRR
https://www.libvio.me//detail/101056.html 唐顿庄园2
https://www.libvio.me//detail/101056.html 一个明星的诞生
https://www.libvio.me//detail/101056.html 迎雨咆哮
https://www.libvio.me//detail/101056.html 妖兽都市
https://www.libvio.me//detail/101055.html
https://www.libvio.me//detail/101055.html 夺命剑
https://www.libvio.me//detail/101055.html 喜欢妳是你
https://www.libvio.me//detail/101055.html 肉罢不能
https://www.libvio.me//detail/101055.html 红色火箭
https://www.libvio.me//detail/101055.html 教场
https://www.libvio.me//detail/101055.html 我的见鬼女友

什么会出现这个情况。电影链接地址，名字都拿到了，但是每一个链接都重复了13次名字也不是对应的。应该怎么修改求大佬帮忙解析

wlinwei 发表于 2022-6-18 17:32:48

import requests
from bs4 import BeautifulSoup

for i in range(1, 5):
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:101.0) Gecko/20100101 Firefox/101.0'}
page = requests.get("https://www.libvio.me/type/1-{}.html".format(i), headers=headers)
soup = BeautifulSoup(page.content, 'html.parser', from_encoding='utf-8')
a = soup.find_all('h4', class_='title text-overflow',)
for j in a:
print('https://www.libvio.me/{}'.format(j.a.attrs['href']),
j.get_text())你那个我不清楚，这个你可以借鉴下

suchocolate 发表于 2022-6-18 19:27:53

你的代码里href没有刷新
#!/usr/bin/env python3

import requests
import re

def main():
headers = {'user-agent': 'firefox'}
baseurl = 'https://www.libvio.me'
for i in range(1, 11):
   url = f'https://www.libvio.me/type/1-{i}.html'
   r = requests.get(url, headers=headers)
   alist = re.findall(r'<a href="(.*?)" title="(.*?)">', r.text)
   for a in alist:
         print(baseurl + a, a)

if __name__ == '__main__':
main()

changzhou 发表于 2022-6-18 20:15:28

#!/usr/bin/env python3

import requests
import re

def main():
headers = {'user-agent': 'firefox'}
baseurl = 'https://www.libvio.me'
for i in range(1, 11):
   url = f'https://www.libvio.me/type/1-{i}.html'
   r = requests.get(url, headers=headers)
   alist = re.findall(r'<a href="(.*?)" title="(.*?)">', r.text)
   for a in alist:
         print(baseurl + a, a)

if __name__ == '__main__':
main()

smyiuo_11 发表于 2022-6-19 11:17:39

suchocolate 发表于 2022-6-18 19:27
你的代码里href没有刷新

新手和高手写的代码差距果然巨大，，你这个结构清晰思路简洁，非常好。。只是alist = re.findall(r'<a href="(.*?)" title="(.*?)">', r.text)最后这个不是很清楚，请指教

suchocolate 发表于 2022-6-19 11:24:09

smyiuo_11 发表于 2022-6-19 11:17
新手和高手写的代码差距果然巨大，，你这个结构清晰思路简洁，非常好。。只是alist = re.findall(r'', r ...

页面有13个a元素符合上面的结构，第一个不是电影，跳过。

页: [1]

鱼C论坛's Archiver

来个大佬帮忙分析下