爬虫:爬取豆瓣TOP250(Beautifulsoup&lxml),Python交流,编程语言专区,鱼C论坛

pegasustiger 发表于 2023-7-16 19:10:46

学习学习

totoro12366 发表于 2023-7-17 10:34:07

学习学习

Gaizi 发表于 2023-9-30 10:24:58

kuxiao 发表于 2023-9-30 18:34:58

太强了加油

froeverX 发表于 2023-10-1 09:39:29

学习一下{:9_237:}

mwy1024 发表于 2023-10-1 11:29:02

{:5_98:}

一位小白 发表于 2024-7-6 19:45:07

厉害

Alextssui 发表于 2024-8-15 23:52:01

111

ryan2836 发表于 2024-8-16 07:01:07

来学习了

小肥狼haoran 发表于 2024-8-16 17:56:35

from selenium import webdriver
import time
from lxml import etree

browser = webdriver.Chrome()

def main():
browser.get("https://movie.douban.com/top250?start=0&filter=")
for page in range(10):
   browser.execute_script('window.scroll(0,document.documentElement.scrollHeight)')# 拉动滚动条到页面底部
   html = browser.page_source# .page_source获取网页源码
   tree = etree.HTML(html)
   for div in tree.xpath('//ol[@class="grid_view"]/li/div'):
         title1 = div.xpath('.//div[@class="hd"]/a/span/text()')# 电影名信息1
         title2 = div.xpath('.//div[@class="hd"]/a/span/text()')# 电影名信息2
         title3 = div.xpath('.//div[@class="hd"]/a/span/text()')# 电影名信息3
         if title3 == []:
            title3 = ['/ 暂无标签']
         title3 = ''.join(title3).replace('\\xa0/\\xa0', '/')# 上面3电影名信息拼起来并替换掉多的\xa0/\xa0
         info1 = div.xpath('.//div[@class="bd"]/p//text()')# 取导演，年份，国家的详细信息
         # print(info1)
         info = '+'.join(info1).replace('\n', '').replace(' ', '')
         director = info.split("主")# 导演用主字分割字符串并取列表下标0的部分
         year_country_type = info.split("+")# 将字符串info 用+分割出 [主演，年、国家、电影类型，电影描述] 为列表类型
         # print(len(year_country_type)) # 查看列表长度区别
         if len(year_country_type) > 2:# 用列表长度判断有电影描述列表长度=5
            year_country_type = year_country_type# 获取列表下标为1的年、国家、电影类型
         else:# 没有电影描述列表长度为2
            year_country_type = year_country_type[-1]# 获取列表下标为-1的年、国家、电影类型
         # print(year_country_type)# 测试，寄生虫影片无电影描述，成功走if判断获取正确的年、国家、电影类型
         star = div.xpath('.//div[@class="star"]/span/@class')# 电影评分几颗星
         if star == 'rating5-t':# 找到规律的关键字，加个判断和符号，美观一点
            star = '5星★★★★★'
         elif star == 'rating45-t':
            star = '4.5星★★★★⭐️'
         else:
            star = '4星★★★★ ️'# 经过测试筛选250个电影没有低于4星的，所以判断也只到4星截止
         score = div.xpath('.//div[@class="star"]/span/text()') + '分'# 满分10分，评价分数
         man_score = div.xpath('.//div[@class="star"]/span/text()')# 评分人数
         if len(info1) > 2:# 判断此列表长度，筛选出没有电影描述的影片，给其替换一个暂无描述的的数据
            quote = div.xpath('.//div[@class="bd"]/p/span/text()')
         else:
            quote = '暂无描述'
         print(title1, title2, title3, director, year_country_type, star, score, man_score, quote)
   browser.find_element_by_xpath('//span[@class="next"]').click()# 游览器在xpath规则内匹配一次元素找span标签单击鼠标

main()
time.sleep(3)
browser.quit()# 退出游览器或者browser.close() 关闭当前标签页
好巧,前段时间我也学爬这个来着, 不过我学的是用selenium自动化爬取的, 需要提前准备好自动化的驱动的环境

小肥狼haoran 发表于 2024-8-16 18:01:41

附上一张爬取下来的部分数据截图

xiaohe.he 发表于 2024-9-2 16:17:35

学习

1354025705 发表于 2025-5-13 10:31:55

而我

liyanlong 发表于 2025-5-22 15:03:20

带到

页: 1 2 3 4 5 6 7 8 9 [10]

鱼C论坛's Archiver