来个大佬帮忙分析下
import osimport re
import requests
if __name__ == "__main__":
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:101.0) Gecko/20100101 Firefox/101.0'
}
if not os.path.exists('./tupian'):
os.mkdir('./tupian')
url = 'https://www.libvio.me/type/1-%d.html'
#设置通用的url模板
for pageNum in range(1, 3):
new_url = format(url % pageNum)
#使用通用爬虫对url对应的页面爬取
page_text = requests.get(url=new_url, headers=headers).text
page_text.encode('utf-8')
# print(page_text)
# 使用聚焦爬虫将页面所有的内容进行解析提取
ex = '<div class="stui-vodlist__box"><a .*?href="(.*?)"'
href_list = re.findall(ex, page_text, re.S)
for href in href_list:
# 拼接完整的电影url
href = 'https://www.libvio.me/' + href
print(href)
ex = '<div class="stui-vodlist__box"><a .*?title="(.*?)"'
name_list = re.findall(ex, page_text, re.S)
for name in name_list:
name: str
print(href, name)
运行结果:
课程/爬虫测试2.py
https://www.libvio.me//detail/101056.html
https://www.libvio.me//detail/101056.html 夺命剑
https://www.libvio.me//detail/101056.html 喜欢妳是你
https://www.libvio.me//detail/101056.html 肉罢不能
https://www.libvio.me//detail/101056.html 红色火箭
https://www.libvio.me//detail/101056.html 教场
https://www.libvio.me//detail/101056.html 我的见鬼女友
https://www.libvio.me//detail/101056.html 我的爸爸
https://www.libvio.me//detail/101056.html RRR
https://www.libvio.me//detail/101056.html 唐顿庄园2
https://www.libvio.me//detail/101056.html 一个明星的诞生
https://www.libvio.me//detail/101056.html 迎雨咆哮
https://www.libvio.me//detail/101056.html 妖兽都市
https://www.libvio.me//detail/101055.html
https://www.libvio.me//detail/101055.html 夺命剑
https://www.libvio.me//detail/101055.html 喜欢妳是你
https://www.libvio.me//detail/101055.html 肉罢不能
https://www.libvio.me//detail/101055.html 红色火箭
https://www.libvio.me//detail/101055.html 教场
https://www.libvio.me//detail/101055.html 我的见鬼女友
什么会出现这个情况。电影链接地址 ,名字都拿到了 ,但是每一个链接都重复了13次名字也不是对应的。应该怎么修改求大佬帮忙解析
import requests
from bs4 import BeautifulSoup
for i in range(1, 5):
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:101.0) Gecko/20100101 Firefox/101.0'}
page = requests.get("https://www.libvio.me/type/1-{}.html".format(i), headers=headers)
soup = BeautifulSoup(page.content, 'html.parser', from_encoding='utf-8')
a = soup.find_all('h4', class_='title text-overflow',)
for j in a:
print('https://www.libvio.me/{}'.format(j.a.attrs['href']),
j.get_text())你那个我不清楚,这个你可以借鉴下
你的代码里href没有刷新
#!/usr/bin/env python3
import requests
import re
def main():
headers = {'user-agent': 'firefox'}
baseurl = 'https://www.libvio.me'
for i in range(1, 11):
url = f'https://www.libvio.me/type/1-{i}.html'
r = requests.get(url, headers=headers)
alist = re.findall(r'<a href="(.*?)" title="(.*?)">', r.text)
for a in alist:
print(baseurl + a, a)
if __name__ == '__main__':
main()
#!/usr/bin/env python3
import requests
import re
def main():
headers = {'user-agent': 'firefox'}
baseurl = 'https://www.libvio.me'
for i in range(1, 11):
url = f'https://www.libvio.me/type/1-{i}.html'
r = requests.get(url, headers=headers)
alist = re.findall(r'<a href="(.*?)" title="(.*?)">', r.text)
for a in alist:
print(baseurl + a, a)
if __name__ == '__main__':
main()
suchocolate 发表于 2022-6-18 19:27
你的代码里href没有刷新
新手和高手写的代码差距果然巨大,,你这个结构清晰 思路简洁,非常好。。只是alist = re.findall(r'<a href="(.*?)" title="(.*?)">', r.text)最后这个不是很清楚,请指教 smyiuo_11 发表于 2022-6-19 11:17
新手和高手写的代码差距果然巨大,,你这个结构清晰 思路简洁,非常好。。只是alist = re.findall(r'', r ...
页面有13个a元素符合上面的结构,第一个不是电影,跳过。
页:
[1]