Python爬取rar压缩包解压显示文件损坏
各位大佬,爬取的链接直接点击下载的是完整没有问题的,但是爬虫的爬取下来的文件就是损坏无法打开,望大佬指点import requests
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36 Edg/109.0.1518.70",
'Host':'sc.chinaz.com'}
xiazai = requests.get(url='https://downsc.chinaz.net/Files/DownLoad/moban/202209/zppt10948.rar',headers=headers).content
with open("name_list.rar", "wb") as fp:
fp.write(xiazai) 改一下 Host
import requests
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36 Edg/109.0.1518.70",
'Host':'downsc.chinaz.net'} # Host 错了
xiazai = requests.get(url='https://downsc.chinaz.net/Files/DownLoad/moban/202209/zppt10948.rar',headers=headers).content
with open("name_list.rar", "wb") as fp:
fp.write(xiazai) isdkz 发表于 2023-2-4 23:25
改一下 Host
谢谢大佬,已经解决,设置为最佳答案了。想再请教大佬几个问题
1.您是怎么挖出的下载链接的host的,我这边只能分析出下载页面的host
2.如何判定访问一个网页的请求头除User-Agent参数,还需要其他的参数如host和Referer,因为我 才发现这个案例其实不需要host参数反而不会报错。
3如果一个页面我所需要的数据是以json文件存储在阿贾克斯请求里,发起post请求后得到的json数据怎么爬取里面的数据呢,也是用xpath吗但是我用get请求是无效的无法得到网页元素。
感谢大佬了 uupppo 发表于 2023-2-5 11:30
谢谢大佬,已经解决,设置为最佳答案了。想再请教大佬几个问题
1.您是怎么挖出的下载链接的host的,我这 ...
1、Host 是主机名,而主机名是看你请求的链接(比如 https://downsc.chinaz.net/Files/DownLoad/moban/202209/zppt10948.rar 红色部分就是主机名)的,
通常服务器会对主机名做一个判断,主机名不对就不给你正确的响应
2、服务器对请求头所做的反爬机制是一个一个试出来的,一般先加 User-Agent 看看,然后再加 Referer,而 Host 可加可不加,因为它会自动给你带上正确的 Host,如果你加了 Host 就不能写错
>>> import requests
>>> resp = requests.get('http://httpbin.org/get')
>>> resp.json()
{'args': {}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Host': 'httpbin.org', 'User-Agent': 'python-requests/2.28.1', 'X-Amzn-Trace-Id': 'Root=1-63df2755-269ff79b6fae5b353ac7125c'}, 'origin': '120.235.189.192', 'url': 'http://httpbin.org/get'}
>>>
可以看到,即时你的请求头没有加 Host,它也会自动把正确的 Host 带上去
3、得到了 json 数据可以用 requests 响应对象自带的 json 方法,看我上面的用法,如果是 Ajax 请求你得分析那个 Ajax 请求的参数以及请求方法都要一致
isdkz 发表于 2023-2-5 11:57
1、Host 是主机名,而主机名是看你请求的链接(比如 https://downsc.chinaz.net/Files/DownLoad/moban/20 ...
前两个问题我大概明白了,但是第三个问题我还是不大明白
我这个有个案例
import requests
import json
from lxml import html
name = input("请输入关键词:")
a = input("请输入搜索开始日期,如2022-10-10:")
startTime = a+' 00:00:00'
b = input("请输入搜索结束日期,如2022-12-10:")
endTime = b+' 23:59:59'
print("正在为您搜索请稍后...")
url = 'https://www.cqggzy.com/interface/rest/esinteligentsearch/getFullTextDataNew'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36 Edg/109.0.1518.70',
'Cookie':'cookie_www=36802747; __jsluid_s=712d8591293852446a2d196d57a069a2; Hm_lvt_3b83938a8721dadef0b185225769572a=1674978329,1675256803; Hm_lpvt_3b83938a8721dadef0b185225769572a=1675256803',
'Host': 'www.cqggzy.com',
'Referer': 'https://www.cqggzy.com/jyxx/transaction_detail.html'}
data = '{"token":"","pn":0,"rn":9999,"sdt":"","edt":"","wd":"","inc_wd":"","exc_wd":"","fields":"","cnum":"001","sort":"{\\"istop\\":\\"0\\",\\"ordernum\\":\\"0\\",\\"webdate\\":\\"0\\",\\"rowid\\":\\"0\\"}","ssort":"","cl":10000,"terminal":"","condition":[{"fieldName":"categorynum","equal":"004","notEqual":null,"equalList":null,"notEqualList":["014001018","004002005","014001015","014005014","014008011"],"isLike":true,"likeType":2},{"fieldName":"titlenew","equal":"'+name+'","notEqual":null,"equalList":null,"notEqualList":null,"isLike":true,"likeType":0}],"time":[{"fieldName":"webdate","startTime":"'+startTime+'","endTime":"'+endTime+'"}],"highlights":"","statistics":null,"unionCondition":[],"accuracy":"","noParticiple":"1","searchRange":null,"noWd":true}'
response = requests.post(url=url,data=data.encode('utf-8'),headers=headers)
print(response)
#etree=html.etree
#etree.HTML(response)
#print(etree)
data_id = response.json()
原始网址是https://www.cqggzy.com/jyxx/transaction_detail.html,Ajax请求得到的json数据,我需要的内容也在data_id里面,但是不知道如何处理数据json,因为我需要的用xpath都是get请求直接爬取网页元素,第一次遇到post请求,也不能用get请求里再导入etree.HTML
刚接触不久,不知道我这边讲清楚了没有{:5_104:} uupppo 发表于 2023-2-5 19:43
前两个问题我大概明白了,但是第三个问题我还是不大明白
我这个有个案例
data_id 当成字典来取键值就好了
import requests
import json
name = input("请输入关键词:")
a = input("请输入搜索开始日期,如2022-10-10:")
startTime = a+' 00:00:00'
b = input("请输入搜索结束日期,如2022-12-10:")
endTime = b+' 23:59:59'
print("正在为您搜索请稍后...")
url = 'https://www.cqggzy.com/interface/rest/esinteligentsearch/getFullTextDataNew'
headers = {'User-Agent': 'Mozilla/5.0', 'Referer': 'https://www.cqggzy.com/jyxx/transaction_detail.html'}
data = r'{"token":"","pn":0,"rn":9999,"sdt":"","edt":"","wd":"","inc_wd":"","exc_wd":"","fields":"","cnum":"001","sort":"{\"istop\":\"0\",\"ordernum\":\"0\",\"webdate\":\"0\",\"rowid\":\"0\"}","ssort":"","cl":10000,"terminal":"","condition":[{"fieldName":"categorynum","equal":"004","notEqual":null,"equalList":null,"notEqualList":["014001018","004002005","014001015","014005014","014008011"],"isLike":true,"likeType":2},{"fieldName":"titlenew","equal":"'+name+'","notEqual":null,"equalList":null,"notEqualList":null,"isLike":true,"likeType":0}],"time":[{"fieldName":"webdate","startTime":"'+startTime+'","endTime":"'+endTime+'"}],"highlights":"","statistics":null,"unionCondition":[],"accuracy":"","noParticiple":"1","searchRange":null,"noWd":true}'
response = requests.post(url=url,data=data.encode('utf-8'),headers=headers)
data_id = response.json()
print(json.dumps(data_id, ensure_ascii=False, indent=2)) 下次你要有新的问题建议另开一个求助帖,这样也可以让更多人可以看到你的问题,并从中思考或借鉴,
在评论区你的问题即使被解决了,也会被埋没在评论区,悬不悬赏无所谓,开一个求助帖就好,这样才能让更多的人从中受益
页:
[1]