鱼C论坛

 找回密码
 立即注册
查看: 2782|回复: 12

[已解决]新手练习爬虫,但是报错麻烦大佬帮忙看看代码,不吝赐教

[复制链接]
发表于 2023-1-24 23:28:35 | 显示全部楼层 |阅读模式
50鱼币
import requests
import json
name=input("请输入关键词:")
a=input("请输入搜索开始日期,如2022-10-10:")
startTime=a+' 00:00:00'
b=input("请输入搜索开始日期,如2022-12-10:")
endTime=b+' 23:59:59'
print("正在为您搜索请稍后...")
url='https://www.cqggzy.com/interface/rest/esinteligentsearch/getFullTextDataNew'
headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36 Edg/109.0.1518.61'}
data={
        "token": "",
        "pn": "0",
        "rn": "999",
        "sdt": "",
        "edt": "",
        "wd": "",
        "inc_wd": "",
        "exc_wd": "",
        "fields": "",
        "cnum": "001",
        "sort": "{\"istop\":\"0\",\"ordernum\":\"0\",\"webdate\":\"0\",\"rowid\":\"0\"}",
        "ssort": "",
        "cl": "10000",
        "terminal": "",
        "condition": [
            {
                "fieldName": "categorynum",
                "equal": "004",
                "notEqual": "null",
                "equalList": "null",
                "notEqualList": [
                    "014001018",
                    "004002005",
                    "014001015",
                    "014005014",
                    "014008011"
                ],
                "isLike": "true",
                "likeType": "2"
            },
            {
                "fieldName": "titlenew",
                "equal": name,
                "notEqual": "null",
                "equalList": "null",
                "notEqualList": "null",
                "isLike": "true",
                "likeType": "0"
            }
        ],
        "time": [
            {
                "fieldName": "webdate",
                "startTime": startTime,
                "endTime": endTime
            }
        ],
        "highlights": "",
        "statistics": "null",
        "unionCondition": [],
        "accuracy": "",
        "noParticiple": "1",
        "searchRange": "null",
        "noWd": "true"
}
response=requests.post(url=url,data=data,headers=headers)
id=response.json()
filename=startTime+'至'+endTime+'.json'
fp=open(filename,'w',encoding='utf-8')
json.dump(id,fp=fp,ensure_ascii=False)
print("over!!!")

最佳答案
2023-1-24 23:28:36
uupppo 发表于 2023-1-25 19:29
我已经根据大佬的建议优化过一次了,已经尽力,烦请各位大佬再帮我看看

不好意思啊,我忘了说一点,data末尾那个冒号不要加
还有,如果你的name是中文的话,直接传会报错,得在headers里面加一项,还得把data编码成utf-8
还有你的filename,里面包括冒号这种特殊字符,会报错,得把逗号替换成其他符号,我用短横杠,你也可以用其他的
import requests
import json
name = input("请输入关键词:")
a = input("请输入搜索开始日期,如2022-10-10:")
startTime = a + ' 00:00:00'
b = input("请输入搜索开始日期,如2022-12-10:")
endTime = b + ' 23:59:59'
print("正在为您搜索请稍后...")
url = 'https://www.cqggzy.com/interface/rest/esinteligentsearch/getFullTextDataNew'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36 Edg/109.0.1518.61',
    'Content-Type': 'text/text; charset=utf-8'
}
data = r'{"token":"","pn":0,"rn":20,"sdt":"","edt":"","wd":"","inc_wd":"","exc_wd":"","fields":"","cnum":"001","sort":"{"istop":"0","ordernum":"0","webdate":"0","rowid":"0"}","ssort":"","cl":10000,"terminal":"","condition":[{"fieldName":"categorynum","equal":"004","notEqual":null,"equalList":null,"notEqualList":["014001018","004002005","014001015","014005014","014008011"],"isLike":true,"likeType":2},{"fieldName":"titlenew","equal":"'+name+'","notEqual":null,"equalList":null,"notEqualList":null,"isLike":true,"likeType":0}],"time":[{"fieldName":"webdate","startTime":"'+startTime+'","endTime":"'+endTime+'"}],"highlights":"","statistics":null,"unionCondition":[],"accuracy":"","noParticiple":"1","searchRange":null,"noWd":true}'
response = requests.post(url=url, data=data.encode('utf-8'), headers=headers)
data_id = response.json()
filename = f"{startTime}至{endTime}.json".replace(":", "-")
fp = open(filename, 'w', encoding='utf-8')
json.dump(data_id, fp=fp, ensure_ascii=False)
print("over!!!")
QQ截图20230124232433.jpg

最佳答案

查看完整内容

不好意思啊,我忘了说一点,data末尾那个冒号不要加 还有,如果你的name是中文的话,直接传会报错,得在headers里面加一项,还得把data编码成utf-8 还有你的filename,里面包括冒号这种特殊字符,会报错,得把逗号替换成其他符号,我用短横杠,你也可以用其他的
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复

使用道具 举报

发表于 2023-1-24 23:28:36 | 显示全部楼层    本楼为最佳答案   
uupppo 发表于 2023-1-25 19:29
我已经根据大佬的建议优化过一次了,已经尽力,烦请各位大佬再帮我看看

不好意思啊,我忘了说一点,data末尾那个冒号不要加
还有,如果你的name是中文的话,直接传会报错,得在headers里面加一项,还得把data编码成utf-8
还有你的filename,里面包括冒号这种特殊字符,会报错,得把逗号替换成其他符号,我用短横杠,你也可以用其他的
import requests
import json
name = input("请输入关键词:")
a = input("请输入搜索开始日期,如2022-10-10:")
startTime = a + ' 00:00:00'
b = input("请输入搜索开始日期,如2022-12-10:")
endTime = b + ' 23:59:59'
print("正在为您搜索请稍后...")
url = 'https://www.cqggzy.com/interface/rest/esinteligentsearch/getFullTextDataNew'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36 Edg/109.0.1518.61',
    'Content-Type': 'text/text; charset=utf-8'
}
data = r'{"token":"","pn":0,"rn":20,"sdt":"","edt":"","wd":"","inc_wd":"","exc_wd":"","fields":"","cnum":"001","sort":"{"istop":"0","ordernum":"0","webdate":"0","rowid":"0"}","ssort":"","cl":10000,"terminal":"","condition":[{"fieldName":"categorynum","equal":"004","notEqual":null,"equalList":null,"notEqualList":["014001018","004002005","014001015","014005014","014008011"],"isLike":true,"likeType":2},{"fieldName":"titlenew","equal":"'+name+'","notEqual":null,"equalList":null,"notEqualList":null,"isLike":true,"likeType":0}],"time":[{"fieldName":"webdate","startTime":"'+startTime+'","endTime":"'+endTime+'"}],"highlights":"","statistics":null,"unionCondition":[],"accuracy":"","noParticiple":"1","searchRange":null,"noWd":true}'
response = requests.post(url=url, data=data.encode('utf-8'), headers=headers)
data_id = response.json()
filename = f"{startTime}至{endTime}.json".replace(":", "-")
fp = open(filename, 'w', encoding='utf-8')
json.dump(data_id, fp=fp, ensure_ascii=False)
print("over!!!")
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复

使用道具 举报

发表于 2023-1-24 23:53:54 | 显示全部楼层
https://www.cqggzy.com/interface ... /getFullTextDataNew
都无法访问。又如何爬取?

调试一下,获取到的数据一目了然。
response=requests.post(url=url,data=data,headers=headers)
print(response.content)
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复

使用道具 举报

 楼主| 发表于 2023-1-25 00:38:53 | 显示全部楼层
ba21 发表于 2023-1-24 23:53
https://www.cqggzy.com/interface/rest/esinteligentsearch/getFullTextDataNew
都无法访问。又如何爬取 ...

大佬,这个确实是post请求的url,萌新求教
微信图片_20230125003743.png
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复

使用道具 举报

 楼主| 发表于 2023-1-25 00:41:47 | 显示全部楼层
ba21 发表于 2023-1-24 23:53
https://www.cqggzy.com/interface/rest/esinteligentsearch/getFullTextDataNew
都无法访问。又如何爬取 ...

而且这个请求确实有我需要的响应信息。
QQ截图20230125004106.jpg
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复

使用道具 举报

发表于 2023-1-25 11:22:16 | 显示全部楼层
这个好坑人,它的data是一个字符串而不是json,我第一眼没看出来,后面扒js才看出来
代码如下
import requests
import json
name = input("请输入关键词:")
a = input("请输入搜索开始日期,如2022-10-10:")
startTime = a + ' 00:00:00'
b = input("请输入搜索开始日期,如2022-12-10:")
endTime = b + ' 23:59:59'
print("正在为您搜索请稍后...")
url = 'https://www.cqggzy.com/interface/rest/esinteligentsearch/getFullTextDataNew'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36 Edg/109.0.1518.61',
    'Cookie': "cookie_www=19398923;__jsluid_s=c909e6e6b4e5bd96fcf32a454560ae15;Hm_lvt_3b83938a8721dadef0b185225769572a=1674614068;Hm_lpvt_3b83938a8721dadef0b185225769572a=1674614110",
    'Host': 'www.cqggzy.com',
    'Origin': 'https://www.cqggzy.com',
    'Referer': 'https://www.cqggzy.com/xxhz/transaction_detail.html'}
data = '{"token":"","pn":0,"rn":20,"sdt":"","edt":"","wd":"","inc_wd":"","exc_wd":"","fields":"","cnum":"001","sort":"{\"istop\":\"0\",\"ordernum\":\"0\",\"webdate\":\"0\",\"rowid\":\"0\"}","ssort":"","cl":10000,"terminal":"","condition":[{"fieldName":"categorynum","equal":"014","notEqual":null,"equalList":null,"notEqualList":["014001018","004002005","014001015","014005014","014008011"],"isLike":true,"likeType":2}],"time":[{"fieldName":"webdate","startTime":"' + \
    startTime + '","endTime":"' + endTime + '"}],"highlights":"","statistics":null,"unionCondition":[],"accuracy":"","noParticiple":"1","searchRange":null,"noWd":true}'
response = requests.post(url=url, data=data, headers=headers)
data_id = response.json()
filename = f"{startTime}至{endTime}.json".replace(":", "-")
fp = open(filename, 'w', encoding='utf-8')
json.dump(data_id, fp=fp, ensure_ascii=False)
print("over!!!")
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复

使用道具 举报

发表于 2023-1-25 16:44:37 | 显示全部楼层
鱼cpython学习者 发表于 2023-1-25 11:22
这个好坑人,它的data是一个字符串而不是json,我第一眼没看出来,后面扒js才看出来
代码如下

这么说已解决了?恭喜
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复

使用道具 举报

 楼主| 发表于 2023-1-25 18:22:59 | 显示全部楼层
鱼cpython学习者 发表于 2023-1-25 11:22
这个好坑人,它的data是一个字符串而不是json,我第一眼没看出来,后面扒js才看出来
代码如下

大佬,我复制了你的代码,但是还是报错,我在想会不会是字符串最后有个:号,但是我加上去了依旧报错
微信截图_20230125182157.png
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复

使用道具 举报

发表于 2023-1-25 19:04:31 | 显示全部楼层
uupppo 发表于 2023-1-25 18:22
大佬,我复制了你的代码,但是还是报错,我在想会不会是字符串最后有个:号,但是我加上去了依旧报错

刚刚试了一下,在data = '...'那里改成data = r'...'就好了
很奇怪,我也不知道刚才为什么可以运行,现在又不行了
import requests
import json
name = input("请输入关键词:")
a = input("请输入搜索开始日期,如2022-10-10:")
startTime = a + ' 00:00:00'
b = input("请输入搜索开始日期,如2022-12-10:")
endTime = b + ' 23:59:59'
print("正在为您搜索请稍后...")
url = 'https://www.cqggzy.com/interface/rest/esinteligentsearch/getFullTextDataNew'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36 Edg/109.0.1518.61',
    'Cookie': "cookie_www=19398923; __jsluid_s=c909e6e6b4e5bd96fcf32a454560ae15; Hm_lvt_3b83938a8721dadef0b185225769572a=1674614068; Hm_lpvt_3b83938a8721dadef0b185225769572a=1674642615",
    'Host': 'www.cqggzy.com',
    'Origin': 'https://www.cqggzy.com',
    'Referer': 'https://www.cqggzy.com/xxhz/transaction_detail.html'}
data = r'{"token":"","pn":0,"rn":20,"sdt":"","edt":"","wd":"","inc_wd":"","exc_wd":"","fields":"","cnum":"001","sort":"{"istop":"0","ordernum":"0","webdate":"0","rowid":"0"}","ssort":"","cl":10000,"terminal":"","condition":[{"fieldName":"categorynum","equal":"014","notEqual":null,"equalList":null,"notEqualList":["014001018","004002005","014001015","014005014","014008011"],"isLike":true,"likeType":2}],"time":[{"fieldName":"webdate","startTime":"' + startTime + '","endTime":"' + endTime + '"}],"highlights":"","statistics":null,"unionCondition":[],"accuracy":"","noParticiple":"1","searchRange":null,"noWd":true}'
response = requests.post(url=url, data=data, headers=headers)
data_id = response.json()
filename = f"{startTime}至{endTime}.json".replace(":", "-")
fp = open(filename, 'w', encoding='utf-8')
json.dump(data_id, fp=fp, ensure_ascii=False)
print("over!!!")
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复

使用道具 举报

 楼主| 发表于 2023-1-25 19:27:39 | 显示全部楼层
鱼cpython学习者 发表于 2023-1-25 19:04
刚刚试了一下,在data = '...'那里改成data = r'...'就好了
很奇怪,我也不知道刚才为什么可以运行,现 ...

大佬,不好意思,我复制运行你的代码还是报错,是我这边的运行环境问题吗?
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复

使用道具 举报

 楼主| 发表于 2023-1-25 19:29:05 | 显示全部楼层
我已经根据大佬的建议优化过一次了,已经尽力,烦请各位大佬再帮我看看
import requests
import json
name = input("请输入关键词:")
a = input("请输入搜索开始日期,如2022-10-10:")
startTime = a+' 00:00:00'
b = input("请输入搜索结束日期,如2022-12-10:")
endTime = b+' 23:59:59'
print("正在为您搜索请稍后...")
url = 'https://www.cqggzy.com/interface/rest/esinteligentsearch/getFullTextDataNew'
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36 Edg/109.0.1518.61'}
data = '{"token":"","pn":0,"rn":20,"sdt":"","edt":"","wd":"","inc_wd":"","exc_wd":"","fields":"","cnum":"001","sort":"{\"istop\":\"0\",\"ordernum\":\"0\",\"webdate\":\"0\",\"rowid\":\"0\"}","ssort":"","cl":10000,"terminal":"","condition":[{"fieldName":"categorynum","equal":"004","notEqual":null,"equalList":null,"notEqualList":["014001018","004002005","014001015","014005014","014008011"],"isLike":true,"likeType":2},{"fieldName":"titlenew","equal":"'+name+'","notEqual":null,"equalList":null,"notEqualList":null,"isLike":true,"likeType":0}],"time":[{"fieldName":"webdate","startTime":"'+startTime+'","endTime":"'+endTime+'"}],"highlights":"","statistics":null,"unionCondition":[],"accuracy":"","noParticiple":"1","searchRange":null,"noWd":true}: '
print(data)
response = requests.post(url=url,data=data,headers=headers)
data_id = response.json()
filename = a+'至'+b+'.json'
fp = open(filename,'w',encoding = 'utf-8')
json.dump(data_id,fp = fp,ensure_ascii = False)
print("over!!!")
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复

使用道具 举报

发表于 2023-1-26 07:03:08 | 显示全部楼层
论坛挺热闹的。
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复

使用道具 举报

 楼主| 发表于 2023-1-29 13:41:47 | 显示全部楼层
鱼cpython学习者 发表于 2023-1-24 23:28
不好意思啊,我忘了说一点,data末尾那个冒号不要加
还有,如果你的name是中文的话,直接传会报错,得在 ...

谢谢大佬。已经解决了
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复

使用道具 举报

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

小黑屋|手机版|Archiver|鱼C工作室 ( 粤ICP备18085999号-1 | 粤公网安备 44051102000585号)

GMT+8, 2024-11-16 08:39

Powered by Discuz! X3.4

© 2001-2023 Discuz! Team.

快速回复 返回顶部 返回列表