鱼C论坛

 找回密码
 立即注册
查看: 3772|回复: 6

[已解决]Python爬取rar压缩包解压显示文件损坏

[复制链接]
发表于 2023-2-4 23:25:01 | 显示全部楼层 |阅读模式
50鱼币
各位大佬,爬取的链接直接点击下载的是完整没有问题的,但是爬虫的爬取下来的文件就是损坏无法打开,望大佬指点
  1. import requests
  2. headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36 Edg/109.0.1518.70",
  3.          'Host':'sc.chinaz.com'}
  4. xiazai = requests.get(url='https://downsc.chinaz.net/Files/DownLoad/moban/202209/zppt10948.rar',headers=headers).content
  5. with open("name_list.rar", "wb") as fp:
  6.     fp.write(xiazai)
复制代码
最佳答案
2023-2-4 23:25:02
改一下 Host
  1. import requests
  2. headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36 Edg/109.0.1518.70",
  3.          'Host':'downsc.chinaz.net'}        # Host 错了
  4. xiazai = requests.get(url='https://downsc.chinaz.net/Files/DownLoad/moban/202209/zppt10948.rar',headers=headers).content
  5. with open("name_list.rar", "wb") as fp:
  6.     fp.write(xiazai)
复制代码

最佳答案

查看完整内容

改一下 Host
小甲鱼最新课程 -> https://ilovefishc.com
回复

使用道具 举报

发表于 2023-2-4 23:25:02 | 显示全部楼层    本楼为最佳答案   
改一下 Host
  1. import requests
  2. headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36 Edg/109.0.1518.70",
  3.          'Host':'downsc.chinaz.net'}        # Host 错了
  4. xiazai = requests.get(url='https://downsc.chinaz.net/Files/DownLoad/moban/202209/zppt10948.rar',headers=headers).content
  5. with open("name_list.rar", "wb") as fp:
  6.     fp.write(xiazai)
复制代码
小甲鱼最新课程 -> https://ilovefishc.com
回复

使用道具 举报

 楼主| 发表于 2023-2-5 11:30:45 | 显示全部楼层

谢谢大佬,已经解决,设置为最佳答案了。想再请教大佬几个问题
1.您是怎么挖出的下载链接的host的,我这边只能分析出下载页面的host
2.如何判定访问一个网页的请求头除User-Agent参数,还需要其他的参数如host和Referer,因为我 才发现这个案例其实不需要host参数反而不会报错。
3如果一个页面我所需要的数据是以json文件存储在阿贾克斯请求里,发起post请求后得到的json数据怎么爬取里面的数据呢,也是用xpath吗但是我用get请求是无效的无法得到网页元素。
感谢大佬了
小甲鱼最新课程 -> https://ilovefishc.com
回复

使用道具 举报

发表于 2023-2-5 11:57:05 | 显示全部楼层
uupppo 发表于 2023-2-5 11:30
谢谢大佬,已经解决,设置为最佳答案了。想再请教大佬几个问题
1.您是怎么挖出的下载链接的host的,我这 ...

1、Host 是主机名,而主机名是看你请求的链接(比如 https://downsc.chinaz.net/Files/DownLoad/moban/202209/zppt10948.rar 红色部分就是主机名)的,

通常服务器会对主机名做一个判断,主机名不对就不给你正确的响应

2、服务器对请求头所做的反爬机制是一个一个试出来的,一般先加 User-Agent 看看,然后再加 Referer,而 Host 可加可不加,因为它会自动给你带上正确的 Host,如果你加了 Host 就不能写错

>>> import requests
>>> resp = requests.get('http://httpbin.org/get')
>>> resp.json()
{'args': {}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Host': 'httpbin.org', 'User-Agent': 'python-requests/2.28.1', 'X-Amzn-Trace-Id': 'Root=1-63df2755-269ff79b6fae5b353ac7125c'}, 'origin': '120.235.189.192', 'url': 'http://httpbin.org/get'}
>>>


可以看到,即时你的请求头没有加 Host,它也会自动把正确的 Host 带上去

3、得到了 json 数据可以用 requests 响应对象自带的 json 方法,看我上面的用法,如果是 Ajax 请求你得分析那个 Ajax 请求的参数以及请求方法都要一致
小甲鱼最新课程 -> https://ilovefishc.com
回复

使用道具 举报

 楼主| 发表于 2023-2-5 19:43:03 | 显示全部楼层
isdkz 发表于 2023-2-5 11:57
1、Host 是主机名,而主机名是看你请求的链接(比如 https://downsc.chinaz.net/Files/DownLoad/moban/20 ...

前两个问题我大概明白了,但是第三个问题我还是不大明白
我这个有个案例
  1. import requests
  2. import json
  3. from lxml import html
  4. name = input("请输入关键词:")
  5. a = input("请输入搜索开始日期,如2022-10-10:")
  6. startTime = a+' 00:00:00'
  7. b = input("请输入搜索结束日期,如2022-12-10:")
  8. endTime = b+' 23:59:59'
  9. print("正在为您搜索请稍后...")
  10. url = 'https://www.cqggzy.com/interface/rest/esinteligentsearch/getFullTextDataNew'
  11. headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36 Edg/109.0.1518.70',
  12.     'Cookie':'cookie_www=36802747; __jsluid_s=712d8591293852446a2d196d57a069a2; Hm_lvt_3b83938a8721dadef0b185225769572a=1674978329,1675256803; Hm_lpvt_3b83938a8721dadef0b185225769572a=1675256803',
  13.     'Host': 'www.cqggzy.com',
  14.     'Referer': 'https://www.cqggzy.com/jyxx/transaction_detail.html'}
  15. data = '{"token":"","pn":0,"rn":9999,"sdt":"","edt":"","wd":"","inc_wd":"","exc_wd":"","fields":"","cnum":"001","sort":"{\"istop\":\"0\",\"ordernum\":\"0\",\"webdate\":\"0\",\"rowid\":\"0\"}","ssort":"","cl":10000,"terminal":"","condition":[{"fieldName":"categorynum","equal":"004","notEqual":null,"equalList":null,"notEqualList":["014001018","004002005","014001015","014005014","014008011"],"isLike":true,"likeType":2},{"fieldName":"titlenew","equal":"'+name+'","notEqual":null,"equalList":null,"notEqualList":null,"isLike":true,"likeType":0}],"time":[{"fieldName":"webdate","startTime":"'+startTime+'","endTime":"'+endTime+'"}],"highlights":"","statistics":null,"unionCondition":[],"accuracy":"","noParticiple":"1","searchRange":null,"noWd":true}'
  16. response = requests.post(url=url,data=data.encode('utf-8'),headers=headers)
  17. print(response)
  18. #etree=html.etree
  19. #etree.HTML(response)
  20. #print(etree)
  21. data_id = response.json()
复制代码

原始网址是https://www.cqggzy.com/jyxx/transaction_detail.html,Ajax请求得到的json数据,我需要的内容也在data_id里面,但是不知道如何处理数据json,因为我需要的用xpath都是get请求直接爬取网页元素,第一次遇到post请求,也不能用get请求里再导入etree.HTML
刚接触不久,不知道我这边讲清楚了没有
小甲鱼最新课程 -> https://ilovefishc.com
回复

使用道具 举报

发表于 2023-2-5 20:04:54 | 显示全部楼层
uupppo 发表于 2023-2-5 19:43
前两个问题我大概明白了,但是第三个问题我还是不大明白
我这个有个案例

data_id 当成字典来取键值就好了

  1. import requests
  2. import json
  3. name = input("请输入关键词:")
  4. a = input("请输入搜索开始日期,如2022-10-10:")
  5. startTime = a+' 00:00:00'
  6. b = input("请输入搜索结束日期,如2022-12-10:")
  7. endTime = b+' 23:59:59'
  8. print("正在为您搜索请稍后...")
  9. url = 'https://www.cqggzy.com/interface/rest/esinteligentsearch/getFullTextDataNew'
  10. headers = {'User-Agent': 'Mozilla/5.0', 'Referer': 'https://www.cqggzy.com/jyxx/transaction_detail.html'}
  11. data = r'{"token":"","pn":0,"rn":9999,"sdt":"","edt":"","wd":"","inc_wd":"","exc_wd":"","fields":"","cnum":"001","sort":"{"istop":"0","ordernum":"0","webdate":"0","rowid":"0"}","ssort":"","cl":10000,"terminal":"","condition":[{"fieldName":"categorynum","equal":"004","notEqual":null,"equalList":null,"notEqualList":["014001018","004002005","014001015","014005014","014008011"],"isLike":true,"likeType":2},{"fieldName":"titlenew","equal":"'+name+'","notEqual":null,"equalList":null,"notEqualList":null,"isLike":true,"likeType":0}],"time":[{"fieldName":"webdate","startTime":"'+startTime+'","endTime":"'+endTime+'"}],"highlights":"","statistics":null,"unionCondition":[],"accuracy":"","noParticiple":"1","searchRange":null,"noWd":true}'
  12. response = requests.post(url=url,data=data.encode('utf-8'),headers=headers)
  13. data_id = response.json()
  14. print(json.dumps(data_id, ensure_ascii=False, indent=2))
复制代码
小甲鱼最新课程 -> https://ilovefishc.com
回复

使用道具 举报

发表于 2023-2-5 20:11:06 | 显示全部楼层
下次你要有新的问题建议另开一个求助帖,这样也可以让更多人可以看到你的问题,并从中思考或借鉴,

在评论区你的问题即使被解决了,也会被埋没在评论区,悬不悬赏无所谓,开一个求助帖就好,这样才能让更多的人从中受益

评分

参与人数 1荣誉 +5 鱼币 +5 贡献 +3 收起 理由
uupppo + 5 + 5 + 3 感谢

查看全部评分

小甲鱼最新课程 -> https://ilovefishc.com
回复

使用道具 举报

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

小黑屋|手机版|Archiver|鱼C工作室 ( 粤ICP备18085999号-1 | 粤公网安备 44051102000585号)

GMT+8, 2025-4-25 00:22

Powered by Discuz! X3.4

© 2001-2023 Discuz! Team.

快速回复 返回顶部 返回列表