|
|
马上注册,结交更多好友,享用更多功能^_^
您需要 登录 才可以下载或查看,没有账号?立即注册
x
本帖最后由 javyru 于 2020-2-2 18:13 编辑
我写了个爬虫,爬取一个化妆品购物网站,但遇到翻页时出现了问题:
很奇怪的是:我把下一页直接字符串输入没有问题可以进入,但通过response.css+urllib.parse.urljoin写入时就不会翻页了(我已在shell里测试,css语句正确,如下)
>>> next_urlid = str(response.css(".module-pagination-main.myaccount-product-list a:nth-child(3)::attr(href)").extract()[1])
>>> next_url1 = "https://www.sephora.cn/brand/givenchy-190/page3/?hasInventory=0&sortField=1&sortMode=desc"
>>> parse.urljoin(response.url, next_urlid)
'https://www.sephora.cn/brand/givenchy-190/page2/?hasInventory=0&sortField=1&sortMode=desc'
>>> next_url1 = "https://www.sephora.cn/brand/givenchy-190/page3/?hasInventory=0&sortField=1&sortMode=desc"
>>> next_url2 = parse.urljoin(response.url, next_urlid)
>>> next_url2
'https://www.sephora.cn/brand/givenchy-190/page2/?hasInventory=0&sortField=1&sortMode=desc'
>>>
相关翻页代码:
for post_url in article_url:
post_img=post_url.css("img::attr(src)").extract_first("")
post_url=post_url.css("::attr(href)").extract_first("")
response = scrapy.Request(post_url,meta={"img_url":post_img},callback=self.parse_detail)
yield response
next_urlid = str(response.css(".module-pagination-main.myaccount-product-list a:nth-child(3)::attr(href)").extract()[1])
# next_url2 = parse.urljoin(response.url, next_urlid)
# next_url = "https://www.sephora.cn"+next_urlid
# next_url="https://www.sephora.cn"+ next_urlid
# next_url1 = "https://www.sephora.cn/brand/givenchy-190/page3/?hasInventory=0&sortField=1&sortMode=desc"
# next_url = "https://www.sephora.cn"+next_urlid
next_url = parse.urljoin(response.url, next_urlid)
if next_urlid:
response = scrapy.Request(url=next_url ,callback=self.parse)
yield response
如果蓝色部分用next_url1就能成功
麻烦好心人帮忙看一下,困扰了好几天了,谢谢 |
|