basketmn 发表于 2021-8-9 14:04:32

问个爬虫问题

本帖最后由 basketmn 于 2021-8-9 14:06 编辑

import requests
import re
from lxml import etree
url='https://www.qiushibaike.com/video/'
headers={'User-Agent':
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:90.0) Gecko/20100101 Firefox/90.0'
}
response=requests.get(url=url,headers=headers)
result=etree.HTML(response.text)
#tupian=re.findall(r'<div class="thumb">.*?<img src="(.*?)" alt=.*?</div>',response.text,re.S)
tupian=result.xpath('//video[@controls="controls"]/source/@src')
print(tupian)
for img_tupian in tupian:
        video_url='https:'+img_tupian
        shipin=requests.get(url=video_url,headers=headers)
        print(shipin)
        with open('.\','wb') as f:
                f.write(shipin.content)

各位大佬,这个被反爬了,返回response ,怎么解决

2012277033 发表于 2021-8-9 14:18:52

没被反扒啊,response,表示请求正常,200是状态码,你这个代码就是open那里有点问题,可以改成with open('./'+img_tupian.split('/')[-1],'wb')

basketmn 发表于 2021-8-9 14:31:35

本帖最后由 basketmn 于 2021-8-9 14:39 编辑

2012277033 发表于 2021-8-9 14:18
没被反扒啊,response,表示请求正常,200是状态码,你这个代码就是open那里有点问题,可以改成

路径没学好啊{:5_100:},要把文件存储再好好看看。
我改了一下还是没东西啊!

basketmn 发表于 2021-8-9 14:41:02

2012277033 发表于 2021-8-9 14:18
没被反扒啊,response,表示请求正常,200是状态码,你这个代码就是open那里有点问题,可以改成

谢谢大佬!好了

2012277033 发表于 2021-8-9 14:42:17

basketmn 发表于 2021-8-9 14:31
路径没学好啊,要把文件存储再好好看看。
我改了一下还是没东西啊!

我这边跑下来正常的,看下你的文件夹是否有写入权限吧
import requests
import re
from lxml import etree
url='https://www.qiushibaike.com/video/'
headers={'User-Agent':
      'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:90.0) Gecko/20100101 Firefox/90.0'
}
response=requests.get(url=url,headers=headers)
result=etree.HTML(response.text)
#tupian=re.findall(r'<div class="thumb">.*?<img src="(.*?)" alt=.*?</div>',response.text,re.S)
tupian=result.xpath('//video[@controls="controls"]/source/@src')
print(tupian)
for img_tupian in tupian:
      video_url='https:'+img_tupian
      shipin=requests.get(url=video_url,headers=headers)
      print(shipin)
      with open('./'+img_tupian.split('/')[-1],'wb') as f:
                f.write(shipin.content)

大马强 发表于 2021-8-9 14:49:27

你代码没问题,保存路径原因
url = 'https://www.qiushibaike.com/video/'
headers = {'User-Agent':
         'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:90.0) Gecko/20100101 Firefox/90.0'
         }
response = requests.get(url=url, headers=headers)
result = etree.HTML(response.text)
#tupian=re.findall(r'<div class="thumb">.*?<img src="(.*?)" alt=.*?</div>',response.text,re.S)
tupian = result.xpath('//video[@controls="controls"]/source/@src')
print(tupian)
for img_tupian in tupian:
    video_url = 'https:'+img_tupian
    shipin = requests.get(url=video_url, headers=headers)
    file_name = img_tupian.split("/")[-1]
    with open(f"./video/{file_name}", 'wb') as f:
      f.write(shipin.content)
      print(f"{file_name} 下载完毕!")

大马强 发表于 2021-8-9 14:50:08

https://static01.imgkr.com/temp/c7babf5625c34962bee16cb865a01147.jpg
页: [1]
查看完整版本: 问个爬虫问题