Scrapy+selenium爬虫翻页问题

javyru · 发表于 2020-2-2 18:11:15

马上注册，结交更多好友，享用更多功能^_^

您需要登录才可以下载或查看，没有账号？立即注册

x

本帖最后由 javyru 于 2020-2-2 18:13 编辑

我写了个爬虫，爬取一个化妆品购物网站，但遇到翻页时出现了问题：
很奇怪的是：我把下一页直接字符串输入没有问题可以进入，但通过response.css+urllib.parse.urljoin写入时就不会翻页了（我已在shell里测试，css语句正确，如下）
>>> next_urlid = str(response.css(".module-pagination-main.myaccount-product-list a:nth-child(3)::attr(href)").extract()[1])
>>> next_url1 = "https://www.sephora.cn/brand/givenchy-190/page3/?hasInventory=0&sortField=1&sortMode=desc"
>>> parse.urljoin(response.url, next_urlid)
'https://www.sephora.cn/brand/givenchy-190/page2/?hasInventory=0&sortField=1&sortMode=desc'
>>> next_url1 = "https://www.sephora.cn/brand/givenchy-190/page3/?hasInventory=0&sortField=1&sortMode=desc"
>>> next_url2 = parse.urljoin(response.url, next_urlid)
>>> next_url2
'https://www.sephora.cn/brand/givenchy-190/page2/?hasInventory=0&sortField=1&sortMode=desc'
>>>

相关翻页代码：
      for post_url in article_url:
         post_img=post_url.css("img::attr(src)").extract_first("")
         post_url=post_url.css("::attr(href)").extract_first("")
         response = scrapy.Request(post_url,meta={"img_url":post_img},callback=self.parse_detail)
         yield response

      next_urlid = str(response.css(".module-pagination-main.myaccount-product-list a:nth-child(3)::attr(href)").extract()[1])
      # next_url2 = parse.urljoin(response.url, next_urlid)
      # next_url = "https://www.sephora.cn"+next_urlid
      # next_url="https://www.sephora.cn"+ next_urlid
      # next_url1 = "https://www.sephora.cn/brand/givenchy-190/page3/?hasInventory=0&sortField=1&sortMode=desc"
      # next_url = "https://www.sephora.cn"+next_urlid
      next_url = parse.urljoin(response.url, next_urlid)
      if next_urlid:
         response = scrapy.Request(url=next_url ,callback=self.parse)
         yield response
如果蓝色部分用next_url1就能成功
麻烦好心人帮忙看一下，困扰了好几天了，谢谢

qiuyouzhi · 发表于 2020-2-2 21:38:53

网页的第二页和第一页肯定是不一样的url嘛
nexturl就是page2

javyru · 发表于 2020-2-2 21:44:03

qiuyouzhi 发表于 2020-2-2 21:38
网页的第二页和第一页肯定是不一样的url嘛
nexturl就是page2

是的，next_url传入的就是page2，字符串和next_url1一模一样，但执行时就是不成功（用next_url1就能成功）

qiuyouzhi · 发表于 2020-2-2 21:44:59

javyru 发表于 2020-2-2 21:44
是的，next_url传入的就是page2，字符串和next_url1一模一样，但执行时就是不成功（用next_url1就能成功 ...

哦豁？
报的什么错

javyru · 发表于 2020-2-2 22:09:47

qiuyouzhi 发表于 2020-2-2 21:44
哦豁？
报的什么错

没有报错，但存数据的时候只存下一页数据（25个产品）

账号		自动登录	找回密码
密码			立即注册

Scrapy+selenium爬虫翻页问题

马上注册，结交更多好友，享用更多功能^_^

浏览过的版块