鱼C论坛

 找回密码
 立即注册
查看: 3073|回复: 6

[已解决]Python爬取rar压缩包解压显示文件损坏

[复制链接]
发表于 2023-2-4 23:25:01 | 显示全部楼层 |阅读模式
50鱼币
各位大佬,爬取的链接直接点击下载的是完整没有问题的,但是爬虫的爬取下来的文件就是损坏无法打开,望大佬指点
import requests
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36 Edg/109.0.1518.70",
         'Host':'sc.chinaz.com'}
xiazai = requests.get(url='https://downsc.chinaz.net/Files/DownLoad/moban/202209/zppt10948.rar',headers=headers).content
with open("name_list.rar", "wb") as fp:
    fp.write(xiazai)
最佳答案
2023-2-4 23:25:02
改一下 Host
import requests
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36 Edg/109.0.1518.70",
         'Host':'downsc.chinaz.net'}        # Host 错了
xiazai = requests.get(url='https://downsc.chinaz.net/Files/DownLoad/moban/202209/zppt10948.rar',headers=headers).content
with open("name_list.rar", "wb") as fp:
    fp.write(xiazai)

最佳答案

查看完整内容

改一下 Host
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复

使用道具 举报

发表于 2023-2-4 23:25:02 | 显示全部楼层    本楼为最佳答案   
改一下 Host
import requests
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36 Edg/109.0.1518.70",
         'Host':'downsc.chinaz.net'}        # Host 错了
xiazai = requests.get(url='https://downsc.chinaz.net/Files/DownLoad/moban/202209/zppt10948.rar',headers=headers).content
with open("name_list.rar", "wb") as fp:
    fp.write(xiazai)
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复

使用道具 举报

 楼主| 发表于 2023-2-5 11:30:45 | 显示全部楼层

谢谢大佬,已经解决,设置为最佳答案了。想再请教大佬几个问题
1.您是怎么挖出的下载链接的host的,我这边只能分析出下载页面的host
2.如何判定访问一个网页的请求头除User-Agent参数,还需要其他的参数如host和Referer,因为我 才发现这个案例其实不需要host参数反而不会报错。
3如果一个页面我所需要的数据是以json文件存储在阿贾克斯请求里,发起post请求后得到的json数据怎么爬取里面的数据呢,也是用xpath吗但是我用get请求是无效的无法得到网页元素。
感谢大佬了
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复

使用道具 举报

发表于 2023-2-5 11:57:05 | 显示全部楼层
uupppo 发表于 2023-2-5 11:30
谢谢大佬,已经解决,设置为最佳答案了。想再请教大佬几个问题
1.您是怎么挖出的下载链接的host的,我这 ...

1、Host 是主机名,而主机名是看你请求的链接(比如 https://downsc.chinaz.net/Files/DownLoad/moban/202209/zppt10948.rar 红色部分就是主机名)的,

通常服务器会对主机名做一个判断,主机名不对就不给你正确的响应

2、服务器对请求头所做的反爬机制是一个一个试出来的,一般先加 User-Agent 看看,然后再加 Referer,而 Host 可加可不加,因为它会自动给你带上正确的 Host,如果你加了 Host 就不能写错

>>> import requests
>>> resp = requests.get('http://httpbin.org/get')
>>> resp.json()
{'args': {}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Host': 'httpbin.org', 'User-Agent': 'python-requests/2.28.1', 'X-Amzn-Trace-Id': 'Root=1-63df2755-269ff79b6fae5b353ac7125c'}, 'origin': '120.235.189.192', 'url': 'http://httpbin.org/get'}
>>>


可以看到,即时你的请求头没有加 Host,它也会自动把正确的 Host 带上去

3、得到了 json 数据可以用 requests 响应对象自带的 json 方法,看我上面的用法,如果是 Ajax 请求你得分析那个 Ajax 请求的参数以及请求方法都要一致
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复

使用道具 举报

 楼主| 发表于 2023-2-5 19:43:03 | 显示全部楼层
isdkz 发表于 2023-2-5 11:57
1、Host 是主机名,而主机名是看你请求的链接(比如 https://downsc.chinaz.net/Files/DownLoad/moban/20 ...

前两个问题我大概明白了,但是第三个问题我还是不大明白
我这个有个案例
import requests
import json
from lxml import html
name = input("请输入关键词:")
a = input("请输入搜索开始日期,如2022-10-10:")
startTime = a+' 00:00:00'
b = input("请输入搜索结束日期,如2022-12-10:")
endTime = b+' 23:59:59'
print("正在为您搜索请稍后...")
url = 'https://www.cqggzy.com/interface/rest/esinteligentsearch/getFullTextDataNew'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36 Edg/109.0.1518.70',
    'Cookie':'cookie_www=36802747; __jsluid_s=712d8591293852446a2d196d57a069a2; Hm_lvt_3b83938a8721dadef0b185225769572a=1674978329,1675256803; Hm_lpvt_3b83938a8721dadef0b185225769572a=1675256803',
    'Host': 'www.cqggzy.com',
    'Referer': 'https://www.cqggzy.com/jyxx/transaction_detail.html'}
data = '{"token":"","pn":0,"rn":9999,"sdt":"","edt":"","wd":"","inc_wd":"","exc_wd":"","fields":"","cnum":"001","sort":"{\"istop\":\"0\",\"ordernum\":\"0\",\"webdate\":\"0\",\"rowid\":\"0\"}","ssort":"","cl":10000,"terminal":"","condition":[{"fieldName":"categorynum","equal":"004","notEqual":null,"equalList":null,"notEqualList":["014001018","004002005","014001015","014005014","014008011"],"isLike":true,"likeType":2},{"fieldName":"titlenew","equal":"'+name+'","notEqual":null,"equalList":null,"notEqualList":null,"isLike":true,"likeType":0}],"time":[{"fieldName":"webdate","startTime":"'+startTime+'","endTime":"'+endTime+'"}],"highlights":"","statistics":null,"unionCondition":[],"accuracy":"","noParticiple":"1","searchRange":null,"noWd":true}'
response = requests.post(url=url,data=data.encode('utf-8'),headers=headers)
print(response)
#etree=html.etree
#etree.HTML(response)
#print(etree)
data_id = response.json()
原始网址是https://www.cqggzy.com/jyxx/transaction_detail.html,Ajax请求得到的json数据,我需要的内容也在data_id里面,但是不知道如何处理数据json,因为我需要的用xpath都是get请求直接爬取网页元素,第一次遇到post请求,也不能用get请求里再导入etree.HTML
刚接触不久,不知道我这边讲清楚了没有
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复

使用道具 举报

发表于 2023-2-5 20:04:54 | 显示全部楼层
uupppo 发表于 2023-2-5 19:43
前两个问题我大概明白了,但是第三个问题我还是不大明白
我这个有个案例

data_id 当成字典来取键值就好了
import requests
import json
name = input("请输入关键词:")
a = input("请输入搜索开始日期,如2022-10-10:") 
startTime = a+' 00:00:00'
b = input("请输入搜索结束日期,如2022-12-10:")
endTime = b+' 23:59:59'
print("正在为您搜索请稍后...")
url = 'https://www.cqggzy.com/interface/rest/esinteligentsearch/getFullTextDataNew'
headers = {'User-Agent': 'Mozilla/5.0', 'Referer': 'https://www.cqggzy.com/jyxx/transaction_detail.html'}
data = r'{"token":"","pn":0,"rn":9999,"sdt":"","edt":"","wd":"","inc_wd":"","exc_wd":"","fields":"","cnum":"001","sort":"{"istop":"0","ordernum":"0","webdate":"0","rowid":"0"}","ssort":"","cl":10000,"terminal":"","condition":[{"fieldName":"categorynum","equal":"004","notEqual":null,"equalList":null,"notEqualList":["014001018","004002005","014001015","014005014","014008011"],"isLike":true,"likeType":2},{"fieldName":"titlenew","equal":"'+name+'","notEqual":null,"equalList":null,"notEqualList":null,"isLike":true,"likeType":0}],"time":[{"fieldName":"webdate","startTime":"'+startTime+'","endTime":"'+endTime+'"}],"highlights":"","statistics":null,"unionCondition":[],"accuracy":"","noParticiple":"1","searchRange":null,"noWd":true}'
response = requests.post(url=url,data=data.encode('utf-8'),headers=headers)
data_id = response.json()
print(json.dumps(data_id, ensure_ascii=False, indent=2))
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复

使用道具 举报

发表于 2023-2-5 20:11:06 | 显示全部楼层
下次你要有新的问题建议另开一个求助帖,这样也可以让更多人可以看到你的问题,并从中思考或借鉴,

在评论区你的问题即使被解决了,也会被埋没在评论区,悬不悬赏无所谓,开一个求助帖就好,这样才能让更多的人从中受益

评分

参与人数 1荣誉 +5 鱼币 +5 贡献 +3 收起 理由
uupppo + 5 + 5 + 3 感谢

查看全部评分

想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复

使用道具 举报

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

小黑屋|手机版|Archiver|鱼C工作室 ( 粤ICP备18085999号-1 | 粤公网安备 44051102000585号)

GMT+8, 2024-9-24 17:12

Powered by Discuz! X3.4

© 2001-2023 Discuz! Team.

快速回复 返回顶部 返回列表