[已解决]请问为什么用scrapy写一个之前已经实现的程序，xpath解析的结果却出现不同？

fishclove · 发表于 2018-11-25 20:35:20

马上注册，结交更多好友，享用更多功能^_^

您需要登录才可以下载或查看，没有账号？立即注册

x

代码：
<code>
# -*- coding: utf-8 -*-
import scrapy
from twlk.items import TwlkItem

class MoviesSpider(scrapy.Spider):
name = 'movies'
#allowed_domains = ['twlkbt.com']

def start_requests(self):
      start_urls = ['http://twlkbt.com/forum-2-%s.html' %str(x) for x in range(1,3)]
      #print(start_urls)
      for url in start_urls:
         yield scrapy.Request(url,self.parse)

def parse(self, response):
      # GET SINGLE_PAGE URL
      for href in response.xpath('//a[@class=\'xst\']/@href'):
         yield response.follow(href,self.parse2)

def parse2(self,response):
      item = TwlkItem()
      #get torrent link
      src_link = response.xpath("//span[@onmouseover]/a/@href")
      print('外部src_link是：',src_link)
      item['title'] = response.xpath("//*[@id='thread_subject']/text()").extract_first()
      #title = title.replace('/','-')
      #title = title.replace(':',' ')
      #img_url = response.xpath("//img[@onclick='zoom(this, this.src, 0, 0, 0)']/@src")
      #print(img_url)
      print('title======',item['title'])
      if src_link:
         print('第一个if')
         src_link = response.xpath("//span[@onmouseover]/a/@href").extract_first()
         print("现在的src_link是",src_link)
         item['src_link'] = 'http://twlkbt.com/' + src_link
         print("现在的item['src_link']是",item['src_link'])
         yield item
         print("yiele ok")

      else:
         print('进入else块--')
         src_link = response.xpath("//a[@onmouseover=\"showMenu({'ctrlid':this.id,'pos':'12'})\"]/@href").extract_first()
         print("现在的src_link是",src_link)
         item['src_link'] = 'http://twlkbt.com/' + src_link
         print("现在的item['src_link']是",item['src_link'])
         yield item
         print('yield ok')
      #下载Torrent
</code>

结果在前面一部分还没问题src_link正常的解析结果只有一个。但是程序运行到中间后面的部分 src_link就解析出2个元素了所以导致结果出现问题
想了很久想不到为什么同一个解析语句到后面结果会一个变2个（特别用scrapy shell 分别就结果正常网址与结果不正常网址做了测试发现在shell里同一个解析语句解析结果又没问题，。。所以。。程序有问题？看不出来囧。。。）

下面是中间一部分不同结果运行信息：

               018-11-25 20:05:42 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://twlkbt.com/thread-86371-1-1.html> (referer: http://twlkbt.com/forum-2-1.html) ['partial']
外部src_link是： [<Selector xpath='//span[@onmouseover]/a/@href' data='forum.php?mod=attachment&aid=NDg2MjF8N2E'>]
title====== [11.24][中国][动作][新七侠五义之屠龙案][WEB.1080p-MKV/2G][国语中字][2018新片]
第一个if
现在的src_link是 forum.php?mod=attachment&aid=NDg2MjF8N2EzOTMwZTF8MTU0MzE0NzU0MHwwfDg2Mzcx
现在的item['src_link']是 http://twlkbt.com/forum.php?mod=attachment&aid=NDg2MjF8N2EzOTMwZTF8MTU0MzE0NzU0MHwwfDg2Mzcx
2018-11-25 20:05:42 [scrapy.core.scraper] DEBUG: Scraped from <200 http://twlkbt.com/thread-86371-1-1.html>
{'src_link': 'http://twlkbt.com/forum.php?mod=attachment&aid=NDg2MjF8N2EzOTMwZTF8MTU0MzE0NzU0MHwwfDg2Mzcx',
'title': '[11.24][中国][动作][新七侠五义之屠龙案][WEB.1080p-MKV/2G][国语中字][2018新片]'}
yiele ok
2018-11-25 20:05:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://twlkbt.com/thread-86117-1-1.html> (referer: http://twlkbt.com/forum-2-1.html) ['partial']
外部src_link是： [<Selector xpath='//span[@onmouseover]/a/@href' data='forum-37-1.html'>, <Selector xpath='//span[@onmouseover]/a/@href' data='forum.php?mod=attachment&aid=NDgzNDB8Mzg'>, <Selector xpath='//span[@onmouseover]/a/@href' data='forum-37-1.html'>]
title====== [10.18][美国][动作][第九禁区][高清BluRay.1080p-MKV/7G][国英双语中字][经典刺激]
第一个if
现在的src_link是 forum-37-1.html
现在的item['src_link']是 http://twlkbt.com/forum-37-1.html
2018-11-25 20:05:43 [scrapy.core.scraper] DEBUG: Scraped from <200 http://twlkbt.com/thread-86117-1-1.html>
{'src_link': 'http://twlkbt.com/forum-37-1.html',
'title': '[10.18][美国][动作][第九禁区][高清BluRay.1080p-MKV/7G][国英双语中字][经典刺激]'}
yiele ok

最佳答案

月排行榜 / 总排行榜

wongyusing

2018-11-26 20:16:29

fishclove 发表于 2018-11-26 19:53
改了你的这个xpath语句就好了...
请问在写这个scrapy之前我用class类同样的xpath语句写的 ...

怎么说好呢？？
我不是通过网页审查元素写的xpath语句。

我直接阅读网页源代码写的xpath语句。

解析方法，像bs4、re、pq、xpath、pandas 等，有好几种变化，每个人写出来，有34种变化。

变化的原因一般在网页源代码中。
应该是由于这个网页有一点动态加载吧。

跳转到最佳答案楼层

fishclove · 发表于 2018-11-25 20:53:09

求大佬指教

wongyusing · 发表于 2018-11-25 21:13:05

fishclove 发表于 2018-11-25 20:53
求大佬指教

你到底想爬什么内容。能详细说一下吗？？
我要分析一下页面内容的取法

fishclove · 发表于 2018-11-25 21:49:52

wongyusing 发表于 2018-11-25 21:13
你到底想爬什么内容。能详细说一下吗？？
我要分析一下页面内容的取法

爬帖子的标题文本和种子的链接地址

wongyusing · 发表于 2018-11-25 22:00:28

fishclove 发表于 2018-11-25 21:49
爬帖子的标题文本和种子的链接地址

我想问的是那个区域里面的数据？
转载区？？？

wongyusing · 发表于 2018-11-25 22:13:05

分析了一下网页，感觉用re比xpath好。

感觉你好像只是看审查元素，而没有看源代码的感觉。

我从网页源代码中看到的结果是，你的xpath表达式好像是写错了

fishclove · 发表于 2018-11-25 22:52:34

wongyusing 发表于 2018-11-25 22:00
我想问的是那个区域里面的数据？
转载区？？？

对转载区

fishclove · 发表于 2018-11-25 22:54:16

wongyusing 发表于 2018-11-25 22:13
分析了一下网页，感觉用re比xpath好。

感觉你好像只是看审查元素，而没有看源代码的感觉。

啊我在scrapy shell中测试了xpath语法的结果就是种子链接和帖子标题没错

wongyusing · 发表于 2018-11-26 01:12:56

fishclove 发表于 2018-11-25 22:54
啊我在scrapy shell中测试了xpath语法的结果就是种子链接和帖子标题没错

下面的是我写的xpath语句（好久没用xpath了，没测试过，你自己试试）

//td[1]//span/a/@href

复制代码

稍微认真的看了一下你的代码，其实你一开始生成一个空字典。再yeild过去会比较好写。

fishclove · 发表于 2018-11-26 07:33:19

wongyusing 发表于 2018-11-26 01:12
下面的是我写的xpath语句（好久没用xpath了，没测试过，你自己试试）

额我上电脑了试试

fishclove · 发表于 2018-11-26 19:53:58

wongyusing 发表于 2018-11-26 01:12
下面的是我写的xpath语句（好久没用xpath了，没测试过，你自己试试）

改了你的这个xpath语句就好了...
请问在写这个scrapy之前我用class类同样的xpath语句写的程序下载就没有问题为什么我为了熟悉下scrapy 写的程序 xpath语句就不行了呢。。。下面是我之前写的还麻烦大佬看看

import requests
from lxml import etree
import os
import time
from hashlib import md5
class Twlk():# 1 page cost time 655s
headers = {
'user_agent':'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36'
}
def __init__(self,url):
self.url = url
def get_response(self):
response = requests.get(self.url,headers = self.headers)
text = response.text
self.parse_page(text)
def parse_page(self,text):
html = etree.HTML(text)
href = html.xpath("//a[@class='xst']/@href")
#title = html.xpath("//a[@class='xst']/text()")
#获取单个电影详情页URL
for i in href:
stime = time.time()
page_link = 'http://twlkbt.com/{}'.format(i)
print('pagelink is:',page_link)
res2 = requests.get(page_link,headers = self.headers)
text2 = res2.text
html2 = etree.HTML(text2)
#get torrent link
src_link = html2.xpath("//span[@onmouseover]/a/@href")
title = html2.xpath("//*[@id='thread_subject']/text()")[0]
title = title.replace('/','-')
title = title.replace(':',' ')
img_url = html2.xpath("//img[@onclick='zoom(this, this.src, 0, 0, 0)']/@src")
print(img_url)
print(title)
if src_link:
src_link = 'http://twlkbt.com/' + html2.xpath("//span[@onmouseover]/a/@href")[0]
else:
src_link = 'http://twlkbt.com/' + html2.xpath("//a[@onmouseover="showMenu({'ctrlid':this.id,'pos':'12'})"]/@href")[0]
#下载Torrent
time.sleep(1)
res = requests.get(src_link,headers = self.headers)
goal = res.content
etime = time.time()
print('循环单次耗时：：',etime - stime)
#若不存在则创建目录
path = "I:/IDM_Download/TWLK/%s/" %title
if not os.path.exists(path):
os.makedirs(path)
file_path = path + title + '.torrent'
#print(file_path)
if not os.path.exists(file_path):
with open(file_path,'wb') as f:
f.write(goal)
#get img content
for x in img_url:
img_path = path + title + x.split('/')[-1]
if x == 'static/image/common/back.gif':
pass
elif not os.path.exists(img_path):
print("img_url is :",x)
res_img = requests.get(x,headers= self.headers)
print('1')
img_goal = res_img.content
print('即将打开路径')
m = open(img_path,'wb')
print('写入前——————')
m.write(img_goal)
print("img_goal save ok.")
m.close()
print('save %s success' %title,'\n')
def run(self):
#获取主题网页
self.get_response()
#获取标题title主题网页src_link
print('开始解析页面')
def timecount(func):
def wrapper():
timestart = time.time()
print(timestart)
func()
timeend = time.time()
print(timeend)
timecost = timeend - timestart
print("time cost is %s" %timecost)
return wrapper
@timecount
def main():
#构造URL
for x in range(2,4):
url = 'http://twlkbt.com/forum-2-%s.html' %str(x)
print(url)
twlk = Twlk(url)
twlk.run()
print("%s ------------- 已下载完成！" %url)
if __name__ == "__main__":
main()

复制代码

wongyusing · 发表于 2018-11-26 20:16:29

这个最佳答案由 wongyusing 给出，感谢 wongyusing 的回答。

单击隐藏图章

fishclove 发表于 2018-11-26 19:53
改了你的这个xpath语句就好了...
请问在写这个scrapy之前我用class类同样的xpath语句写的 ...

怎么说好呢？？
我不是通过网页审查元素写的xpath语句。

我直接阅读网页源代码写的xpath语句。

解析方法，像bs4、re、pq、xpath、pandas 等，有好几种变化，每个人写出来，有34种变化。

变化的原因一般在网页源代码中。
应该是由于这个网页有一点动态加载吧。

fishclove · 发表于 2018-11-26 20:44:33

wongyusing 发表于 2018-11-26 20:16
怎么说好呢？？
我不是通过网页审查元素写的xpath语句。

谢谢你你能说一下你的xpath吗开头的td[1]是什么意思我看源代码有好多td

wongyusing · 发表于 2018-11-26 21:55:00

fishclove 发表于 2018-11-26 20:44
谢谢你你能说一下你的xpath吗开头的td[1]是什么意思我看源代码有好多td

应该是选择第一个td吧

昨天无聊，随便写的。

今天不想看网页源代码。自己去查查文档吧，应该不是很深奥吧

fishclove · 发表于 2018-11-27 05:15:11

账号		自动登录	找回密码
密码			立即注册

[已解决]请问为什么用scrapy写一个之前已经实现的程序，xpath解析的结果却出现不同？

马上注册，结交更多好友，享用更多功能^_^

浏览过的版块