[已解决]python爬虫多页爬取

2022@lif · 发表于 2022-1-24 13:58:16

马上注册，结交更多好友，享用更多功能^_^

您需要登录才可以下载或查看，没有账号？立即注册

x

本帖最后由 2022@lif 于 2022-1-25 08:30 编辑

爬取网址https://list.szlcsc.com/catalog/439.html中的所有产品的信息，发现在此页面翻页时网址的url不会变。查看网络请求时发现这些页面页数信息应该是存放在Payload中的表单数据中：

登录/注册后可看大图

但是我发起POST请求时携带了这些参数，并不能得到这个页面的页面源码。

data = {
'catalogNodeId': 439,
'pageNumber': 2,
'querySortBySign': 0,
'showOutSockProduct': 1,
'showDiscountProduct': 1,
'queryBeginPrice':'',
'queryEndPrice':'' ,
'queryProductArrange':'',
'queryProductGradePlateId': '',
'queryProductTypeCode': '',
'queryParameterValue': '',
'queryProductStandard': '',
'querySmtLabel': '',
'queryReferenceEncap': '',
'queryProductLabel':'',
'lastParamName': '',
'baseParameterCondition': '',
'parameterCondition': ''
}
session = requests.session()
response = session.post(lit_url,data = json.dumps(data),headers = headers).content.decode('utf-8')

复制代码

完整代码如下：

import requests
from lxml import etree
import json
import pandas as pd
import sys
url = 'https://list.szlcsc.com/'
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.55 Safari/537.36 Edg/96.0.1054.34'
}
#获取页面源码
def get_page(url):
session = requests.session()
#session在发起请求时会创建cookies，并在下次发起session请求时包含sookies；'.content.decode('utf-8')解决爬取到的数据中文乱码问题'
page_text = session.get(url = url,headers = headers,proxies=proxy).content.decode('utf-8')
return page_text
#获取路径下的内容
def get_content(text,path):
tree = etree.HTML(text)
content_list = tree.xpath(path)
return content_list
resp = get_page(url)
for num in range(1,19):
Manul_list = get_content(resp,f'/html/body/div[1]/div/div[1]/ul/li[{num}]/div/dl/dt/a[1]/text()')
Manul = '/'.join(Manul_list)
print(str(num)+' '+Manul)
#选择需要查询的大类
choice = input('please select a number you would like to choice:')
lit_list = get_content(resp,f'/html/body/div[1]/div/div[1]/ul/li[{choice}]/div/dl/dd/a[1]/text()')
lit_url_list = get_content(resp,f'/html/body/div[1]/div/div[1]/ul/li[{choice}]/div/dl/dd/a[1]/@href')
lit_dict = {}
for i in range(0,len(lit_list)):
lit_dict.update({lit_list[i]:lit_url_list[i]})
name_dict = {}
for n in range(1,len(lit_list)+1):
print(str(n)+' '+lit_list[n-1])
name_dict.update({n:lit_list[n-1]})
#选择需要查询的具体类型
cho = int(input('please select a number you would like to choice:'))
lit_url = lit_dict[name_dict[cho]]
data = {
'catalogNodeId': 439,

复制代码

运行代码后选1，然后选18.
得到的却不是我想要的这个页面的页面源码，哪里出现了问题呢？

最佳答案

月排行榜 / 总排行榜

wp231957

2022-1-25 09:26:18

看看这些东西是不是你要的，这个页面太专业了，有点看不懂

import requests,json
data = {
'catalogNodeId': 439,
'pageNumber': 2,
'querySortBySign': 0,
'showOutSockProduct': 1,
'showDiscountProduct': 1,
'queryBeginPrice':'',
'queryEndPrice':'' ,
'queryProductArrange':'',
'queryProductGradePlateId': '',
'queryProductTypeCode': '',
'queryParameterValue': '',
'queryProductStandard': '',
'querySmtLabel': '',
'queryReferenceEncap': '',
'queryProductLabel':'',
'lastParamName': '',
'baseParameterCondition': '',
'parameterCondition': ''
}
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.55 Safari/537.36 Edg/96.0.1054.34',
"cookie": "acw_tc=da3dd31c16430728279798128e080b8460c74973023561a97ab6c006d3; SID=471d893b-2380-43e5-ba2f-7c8cf55e7592; SID.sig=bPEGcnKdLpFDiL5G5xuKHDFqRnS7DjKR0uoun7RX0Cg; Qs_lvt_290854=1643072833; Qs_pv_290854=4487896694418019300; cpx=1; guidePage=true; noLoginCustomerFlag=929acea54e81e00b7866; noLoginCustomerFlag2=a0c9da953af694050bd3; PRO_NEW_SID=8920f7cd-dc56-4f97-bd00-f3cb9b98596b; computerKey=d87e0a87824fc7096b6a; AGL_USER_ID=85617b33-f9fb-4d6e-9d65-1b1878037a9d; Hm_lvt_e2986f4b6753d376004696a1628713d2=1643072840; Hm_lpvt_e2986f4b6753d376004696a1628713d2=1643072840; show_out_sock_product=1",
"origin":"https://list.szlcsc.com",
"referer": "https://list.szlcsc.com/catalog/439.html"
}
url="https://list.szlcsc.com/products/list"
txt=requests.post(url,headers=headers,data=data).text
js=json.loads(txt)
print(js["productRecordList"])

复制代码

跳转到最佳答案楼层

wp231957 · 发表于 2022-1-25 09:26:18

这个最佳答案由 wp231957 给出，感谢 wp231957 的回答。

单击隐藏图章

看看这些东西是不是你要的，这个页面太专业了，有点看不懂

import requests,json
data = {
'catalogNodeId': 439,
'pageNumber': 2,
'querySortBySign': 0,
'showOutSockProduct': 1,
'showDiscountProduct': 1,
'queryBeginPrice':'',
'queryEndPrice':'' ,
'queryProductArrange':'',
'queryProductGradePlateId': '',
'queryProductTypeCode': '',
'queryParameterValue': '',
'queryProductStandard': '',
'querySmtLabel': '',
'queryReferenceEncap': '',
'queryProductLabel':'',
'lastParamName': '',
'baseParameterCondition': '',
'parameterCondition': ''
}
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.55 Safari/537.36 Edg/96.0.1054.34',
"cookie": "acw_tc=da3dd31c16430728279798128e080b8460c74973023561a97ab6c006d3; SID=471d893b-2380-43e5-ba2f-7c8cf55e7592; SID.sig=bPEGcnKdLpFDiL5G5xuKHDFqRnS7DjKR0uoun7RX0Cg; Qs_lvt_290854=1643072833; Qs_pv_290854=4487896694418019300; cpx=1; guidePage=true; noLoginCustomerFlag=929acea54e81e00b7866; noLoginCustomerFlag2=a0c9da953af694050bd3; PRO_NEW_SID=8920f7cd-dc56-4f97-bd00-f3cb9b98596b; computerKey=d87e0a87824fc7096b6a; AGL_USER_ID=85617b33-f9fb-4d6e-9d65-1b1878037a9d; Hm_lvt_e2986f4b6753d376004696a1628713d2=1643072840; Hm_lpvt_e2986f4b6753d376004696a1628713d2=1643072840; show_out_sock_product=1",
"origin":"https://list.szlcsc.com",
"referer": "https://list.szlcsc.com/catalog/439.html"
}
url="https://list.szlcsc.com/products/list"
txt=requests.post(url,headers=headers,data=data).text
js=json.loads(txt)
print(js["productRecordList"])

复制代码

2022@lif · 发表于 2022-1-25 10:21:54

wp231957 发表于 2022-1-25 09:26
看看这些东西是不是你要的，这个页面太专业了，有点看不懂

所以这些产品的一些信息都是在返回的json串中是吗？
‘origin’和·‘referer’是一定要的吗？
非常感谢

wp231957 · 发表于 2022-1-25 10:26:53

2022@lif 发表于 2022-1-25 10:21
所以这些产品的一些信息都是在返回的json串中是吗？
‘origin’和·‘referer’是一定要的吗？
非常感 ...

‘origin’和·‘referer’是一定要的吗？
这个需要一个一个的测试，我图方便全都放上去了，我也不知道是否全部必须，感觉上COOKIE是必须

所以这些产品的一些信息都是在返回的json串中是吗？

是的，这个请求返回的是一个json串，而不是一个页面

2022@lif · 发表于 2022-1-25 10:32:36

wp231957 发表于 2022-1-25 10:26
‘origin’和·‘referer’是一定要的吗？
这个需要一个一个的测试，我图方便全都放上去了，我也不知道 ...

那是需要用字典和列表的方式来对json串进行处理从而来得到我想要的数据吗？

wp231957 · 发表于 2022-1-25 10:35:08

2022@lif 发表于 2022-1-25 10:32
那是需要用字典和列表的方式来对json串进行处理从而来得到我想要的数据吗？

是的，看返回数据类型，有时候是列表（js中叫数组）有时候是字典（js中叫json串）

2022@lif · 发表于 2022-1-25 10:39:30

wp231957 发表于 2022-1-25 10:35
是的，看返回数据类型，有时候是列表（js中叫数组）有时候是字典（js中叫json串）

txt=requests.post(url,headers=headers,data=data,proxies = proxy).text

这里的data=data不需要用data=json.dumps(data)吗？这个我是在网上看到的，说是将 Python 对象编码成JSON字符串。我也不是很清楚

wp231957 · 发表于 2022-1-25 11:02:50

2022@lif 发表于 2022-1-25 10:39
txt=requests.post(url,headers=headers,data=data,proxies = proxy).text

这里的data=data不需要用da ...

一般情况下都不用吧，字典本身就和json都是一个玩意

2022@lif · 发表于 2022-1-26 08:46:38

wp231957 发表于 2022-1-25 09:26
看看这些东西是不是你要的，这个页面太专业了，有点看不懂

import requests,json
import pandas as pd
productinformation_list = []
#for i in range(1,7):
data = {
'catalogNodeId': 11182,
'pageNumber': 1,
'querySortBySign': 0,
'showOutSockProduct': 1,
'showDiscountProduct': 1,
'queryBeginPrice':'',
'queryEndPrice':'' ,
'queryProductArrange':'',
'queryProductGradePlateId': '',
'queryProductTypeCode': '',
'queryParameterValue': '',
'queryProductStandard': '',
'querySmtLabel': '',
'queryReferenceEncap': '',
'queryProductLabel':'',
'lastParamName': '',
'baseParameterCondition': '',
'parameterCondition': ''
}
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.55 Safari/537.36 Edg/96.0.1054.34',
"cookie": "acw_tc=da3dd31c16430728279798128e080b8460c74973023561a97ab6c006d3; SID=471d893b-2380-43e5-ba2f-7c8cf55e7592; SID.sig=bPEGcnKdLpFDiL5G5xuKHDFqRnS7DjKR0uoun7RX0Cg; Qs_lvt_290854=1643072833; Qs_pv_290854=4487896694418019300; cpx=1; guidePage=true; noLoginCustomerFlag=929acea54e81e00b7866; noLoginCustomerFlag2=a0c9da953af694050bd3; PRO_NEW_SID=8920f7cd-dc56-4f97-bd00-f3cb9b98596b; computerKey=d87e0a87824fc7096b6a; AGL_USER_ID=85617b33-f9fb-4d6e-9d65-1b1878037a9d; Hm_lvt_e2986f4b6753d376004696a1628713d2=1643072840; Hm_lpvt_e2986f4b6753d376004696a1628713d2=1643072840; show_out_sock_product=1",
"origin":"https://list.szlcsc.com",
"referer": "https://list.szlcsc.com/catalog/11182.html"
}
url="https://list.szlcsc.com/products/list"
txt=requests.post(url,headers=headers,data=data).text
js=json.loads(txt)
#详情页的数据处理
product_list = js["productRecordList"]
print(product_list)
for n in range(0,len(product_list)):
product_dict = product_list[n]
productinformation = []
productName = product_dict['productName'] #产品名
productinformation.append(productName)
productCode = product_dict['productCode'] #产品编号
productinformation.append(productCode)
productModel = product_dict["productModel"] #产品型号
productinformation.append(productModel)
package = product_dict['encapsulationModel'] #产品封装
productinformation.append(package)
brandname = product_dict['lightBrandName'] #产品品牌
productinformation.append(brandname)
packageArrange = product_dict['productArrange'] #产品包装
productinformation.append(packageArrange)
gdWarehouseStockNumber = product_dict['gdWarehouseStockNumber']
productinformation.append(gdWarehouseStockNumber)
jsWarehouseStockNumber = product_dict['jsWarehouseStockNumber']
productinformation.append(jsWarehouseStockNumber)
#产品价格数据处理
#内地（人民币）
productprice_list = product_dict["productPriceList"]
Num1 = ''
for price in productprice_list:
Num1 = Num1+str(price['spNumber'])+'->'+str(price['epNumber'])+':￥'+str(price['thePrice'])+'\n'
productinformation.append(Num1)
#香港（美元）
productprice_list = product_dict["productHkDollerPriceList"]
Num2 = ''
for price in productprice_list:
Num2 = Num2+str(price['spNumber'])+'->'+str(price['epNumber'])+':$'+str(price['thePrice'])+'\n'
productinformation.append(Num2)
productinformation_list.append(productinformation)
lable = ['产品名','产品编号','产品型号','产品封装','产品品牌','产品包装','广东存储量','江苏存储量','内地（RMB）/个','香港（美元）/个']
text = pd.DataFrame(productinformation_list,columns = lable)
text.to_csv('./1-25.csv',encoding='utf-8')
print('OK')

复制代码

当pageNumber为1的时候运行代码，显示在68行那里报错，说keyError: 'productHkDollerPriceList'。但是我将获得的product_list打印出来，搜索是有 'productHkDollerPriceList'这个字段的。
这是什么原因啊？‘productPriceList’这个字段就不会报错。

wp231957 · 发表于 2022-1-26 13:33:37

2022@lif 发表于 2022-1-26 08:46
当pageNumber为1的时候运行代码，显示在68行那里报错，说keyError: 'productHkDollerPriceList'。但 ...

确实没有看到你这个 productHkDollerPriceList 东东

账号		自动登录	找回密码
密码			立即注册

[已解决]python爬虫多页爬取

马上注册，结交更多好友，享用更多功能^_^

浏览过的版块