爬取百度照片

大马强 · 发表于 2022-1-17 23:38:50

马上注册，结交更多好友，享用更多功能^_^

您需要登录才可以下载或查看，没有账号？立即注册

x

本帖最后由大马强于 2022-1-17 23:38 编辑

今年回家过年又是孤身一人

，长夜漫漫去百度找找好康的小姐姐照片作伴，一个个看有浪费时间，不如一次性爬下来

爬虫三部曲无异于是

观察目标
发起请求，获取数据
处理数据，数据保存

一、观察目标
图片是随着往下拉而不断出现，说明要进行抓包，打开调试工具，再次刷新页面，下拉看其network抓包情况
看下爬到的包的内容，发现图片就在data标签中

登录/注册后可看大图

再看看其他细节

登录/注册后可看大图

观察请求图片的网址

https://image.baidu.com/search/acjson?tn=resultjson_com&logid=11008189197504139865&ipn=rj&ct=201326592&is=&fp=result&fr=ala&word=IU%E7%85%A7%E7%89%87&cg=star&queryWord=IU%E7%85%A7%E7%89%87&cl=2&lm=-1&ie=utf-8&oe=utf-8&adpicid=&st=&z=&ic=&hd=&latest=&copyright=&s=&se=&tab=&width=&height=&face=&istype=&qc=&nc=&expermode=&nojc=&isAsync=&pn=30&rn=30&gsm=1e&1642430456120=

三个标红的参数就是我们想要的，现在可以看出

word 和 queryWord是我们的关键字 (queryWord 有时候是没有的，下面的代码就是)
pn 可以看作是页数*30,因为一页有30个照片
gsm 前面的未知，但是后面的1642430456120是时间戳

所以最后一步就是要找到 gsm 参数的意义
进一步观察，发现在预览中的gsm参数就是下一个包的gsm

登录/注册后可看大图

所以最后一个参数gsm就是上一个包中的gsm+时间戳(后来发现，第一个包都是 1e +时间戳)

二、发起请求
把请求头参数直接复制下来就好，唯一要注意的参数是 Referer ，里面也有一个word，它的值也是关键字，不过遇到中文的话要进行编码encond("utf-8")

url = f"https://image.baidu.com/search/acjson?tn=resultjson_com&logid=10524201727286177943&ipn=rj&ct=201326592&is=&fp=result&fr=&word={ky}&cl=2&lm=-1&ie=utf-8&oe=utf-8&adpicid=&st=-1&z=&ic=0&hd=&latest=&copyright=&s=&se=&tab=&width=&height=&face=0&istype=2&qc=&nc=1&expermode=&nojc=&isAsync=&pn=30&rn=30&gsm=1e&{now_time}="
header = {
"Host": "image.baidu.com",
"Pragma": "no-cache",
"Referer": f"https://image.baidu.com/search/index?tn=baiduimage&ipn=r&ct=201326592&cl=2&lm=-1&st=-1&sf=1&fmq=&pv=&ic=0&nc=1&z=&se=1&showtab=0&fb=0&width=&height=&face=0&istype=2&ie=utf-8&fm=result&pos=history&word={ky_}&dyTabStr=MCwzLDQsMSwyLDYsNSw4LDcsOQ%3D%3D",
"sec-ch-ua": '" Not;A Brand";v="99", "Google Chrome";v="97", "Chromium";v="97"',
"sec-ch-ua-mobile": "?0",
"sec-ch-ua-platform": "Windows",
"Sec-Fetch-Dest": "empty",
"Sec-Fetch-Mode": "cors",
"Sec-Fetch-Site": "same-origin",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.3"
}

复制代码

三、处理数据
通过之前的分析可以知道图片url就在data中，爬取到直接get请求就好

最终代码

import time
import requests
import os
if not os.path.exists("./图片"):
os.mkdir("./图片")
# 数据初始化
now_time = int(time.time())
ky = input("请输入关键词:")
num = int(input("请输入你要爬取的页数(一页30条数据):"))
ky_ = ky.encode("utf-8") # Referer 参数需要
page = 30
next_parm = "1e"
count = 1
url = f"https://image.baidu.com/search/acjson?tn=resultjson_com&logid=10524201727286177943&ipn=rj&ct=201326592&is=&fp=result&fr=&word={ky}&cl=2&lm=-1&ie=utf-8&oe=utf-8&adpicid=&st=-1&z=&ic=0&hd=&latest=&copyright=&s=&se=&tab=&width=&height=&face=0&istype=2&qc=&nc=1&expermode=&nojc=&isAsync=&pn=30&rn=30&gsm=1e&{now_time}="
header = {
"Host": "image.baidu.com",
"Pragma": "no-cache",
"Referer": f"https://image.baidu.com/search/index?tn=baiduimage&ipn=r&ct=201326592&cl=2&lm=-1&st=-1&sf=1&fmq=&pv=&ic=0&nc=1&z=&se=1&showtab=0&fb=0&width=&height=&face=0&istype=2&ie=utf-8&fm=result&pos=history&word={ky_}&dyTabStr=MCwzLDQsMSwyLDYsNSw4LDcsOQ%3D%3D",
"sec-ch-ua": '" Not;A Brand";v="99", "Google Chrome";v="97", "Chromium";v="97"',
"sec-ch-ua-mobile": "?0",
"sec-ch-ua-platform": "Windows",
"Sec-Fetch-Dest": "empty",
"Sec-Fetch-Mode": "cors",
"Sec-Fetch-Site": "same-origin",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.3"
}
for i in range(num):
print(f"正在爬取第{i+1}页")
req = requests.get(url, headers=header)
res = req.json()["data"]
next_parm = req.json()["gsm"]
page = int(page) + 30
now_time = int(time.time())
url = f"https://image.baidu.com/search/acjson?tn=resultjson_com&logid=10524201727286177943&ipn=rj&ct=201326592&is=&fp=result&fr=&word={ky}&cl=2&lm=-1&ie=utf-8&oe=utf-8&adpicid=&st=-1&z=&ic=0&hd=&latest=&copyright=&s=&se=&tab=&width=&height=&face=0&istype=2&qc=&nc=1&expermode=&nojc=&isAsync=&pn=30&rn=30&gsm=1e&{now_time}="
for i in res:
try:
# print(i["hoverURL"])
res2 = requests.get(i["hoverURL"])
with open("./图片/"+ky+str(count)+".jpg", "wb") as fp:
fp.write(res2.content)
count += 1
except:
pass
time.sleep(0.5) # 防反爬

复制代码

登录/注册后可看大图

大马强 · 发表于 2022-1-18 08:37:30

@冬雪雪冬

smartsy · 发表于 2022-2-26 18:00:34

账号		自动登录	找回密码
密码			立即注册

[技术交流] 爬取百度照片

马上注册，结交更多好友，享用更多功能^_^