我的为什么爬取不到东西。在线求大佬帮助！万分感谢！,Python交流,编程语言专区,鱼C论坛

chen1203 发表于 2021-9-16 05:30:06

我的为什么爬取不到东西。在线求大佬帮助！万分感谢！

本帖最后由 chen1203 于 2021-9-16 05:35 编辑

我的为什么爬取不到东西，在线求大佬帮助，万分感谢！
其中日志是这样的：（协议已经被我关了)
2021-09-16 05:25:33 INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

wp231957 发表于 2021-9-16 07:14:06

发网址及想提取的数据，如果不方便就只能自己弄

chen1203 发表于 2021-9-16 16:54:06

wp231957 发表于 2021-9-16 07:14
发网址及想提取的数据，如果不方便就只能自己弄

https://fishc.com.cn/forum-173-1.html 提取pthon交流板块的各个模块名字

chen1203 发表于 2021-9-16 16:55:06

chen1203 发表于 2021-9-16 16:54
https://fishc.com.cn/forum-173-1.html 提取pthon交流板块的各个模块名字

response.xpath("//tbody/text()").extract()
['\r\n', '\r\n', '\r\n', '\r\n', '\r\n', '\r\n', '\r\n', '\r\n', '\r\n', '\r\n', '\r\n', '\r\n', '\r\n', '\r\n', '\r\n', '\r\n', '\r\n', '\r\n', '\r\n', '\r\n', '\r\n', '\r\n', '\r\n', '\r\n', '\r\n', '\r\n', '\r\n', '\r\n', '\r\n', '\r\n', '\r\n', '\r\n', '\r\n', '\r\n', '\r\n', '\r\n', '\r\n', '\r\n', '\r\n', '\r\n', '\r\n', '\r\n', '\r\n', '\r\n', '\r\n', '\r\n', '\r\n', '\r\n', '\r\n', '\r\n', '\r\n', '\r\n', '\r\n', '\r\n', '\r\n', '\r\n', '\r\n', '\r\n', '\r\n', '\r\n', '\r\n', '\r\n', '\r\n', '\r\n', '\r\n', '\r\n', '\r\n', '\r\n', '\r\n', '\r\n', '\r\n', '\r\n']看了一下是这种，那是什么鬼

chen1203 发表于 2021-9-16 16:56:03

chen1203 发表于 2021-9-16 16:55
response.xpath("//tbody/text()").extract()
['\r\n', '\r\n', '\r\n', '\r\n', '\r\n', '\r\n', '\r\ ...

源代码如下：import scrapy
from tutorial.items import TutorialItem
class QiuSpider(scrapy.Spider):
name="qiu"

allowed_domains=["fishc.com.cn"]
urls=("https://fishc.com.cn/forum-173-1.html",)



def parse(self,response):
   item=TutorialItem()

   item["content"]=response.xpath("//tbody/text()").extract()
   yield item

chen1203 发表于 2021-9-16 16:57:32

chen1203 发表于 2021-9-16 16:56
源代码如下：import scrapy
from tutorial.items import TutorialItem
class QiuSpider(scrapy.Spider) ...

import scrapy

class TutorialItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
content=scrapy.Field()

chen1203 发表于 2021-9-16 16:58:09

chen1203 发表于 2021-9-16 16:57
import scrapy

from itemadapter import ItemAdapter

class TutorialPipeline:
def process_item(self, item, spider):
   with open("date.text","wb",encoding="utf-8") as f:
         f.write(item["content"])
   return item

wp231957 发表于 2021-9-16 17:04:16

chen1203 发表于 2021-9-16 16:55
response.xpath("//tbody/text()").extract()
['\r\n', '\r\n', '\r\n', '\r\n', '\r\n', '\r\n', '\r\ ...

import requests
from lxml import etree

headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:89.0) Gecko/20100101 Firefox/89.0',
}
url="https://fishc.com.cn/forum-173-1.html"
html = requests.get(url,headers=headers)
html.encoding="gbk"
obj=etree.HTML(html.text)
data=obj.xpath("//div[@id='subforum_173']/table/tr/td/dl/dt/a/text()")
print(data)

wp231957 发表于 2021-9-16 17:06:37

chen1203 发表于 2021-9-16 16:56
源代码如下：import scrapy
from tutorial.items import TutorialItem
class QiuSpider(scrapy.Spider) ...

XPATH里没有tbody

chen1203 发表于 2021-9-16 18:17:15

wp231957 发表于 2021-9-16 17:04

用scrapy项目来完成

页: [1]

鱼C论坛's Archiver

我的为什么爬取不到东西。在线求大佬帮助！万分感谢！