我想用一个外部循环来实现反复调用scrapy来爬取不同的网页内容,但是现在就出现这...
我想用一个外部循环来实现反复调用scrapy来爬取不同的网页内容,但是现在就出现这种情况:附上源码:
1.spider:
import scrapy
import requests
import re as r
#from GoooodDesigh.items import goooodpostItem
import json
import pickle as pk
import time as ti
class goooodspider(scrapy.Spider):
from center import current_url
print(current_url)
start_urls = current_url
name='goooodpro'
allowed_domain = ['gooood.cn']
imglist = []
namelist = []
def parse(self,response):
print(response.body)
it = goooodpostItem()
yield it
2.外部文件:
import os
import pickle as pk
import time as ti
# input
def input():
pass
# post request
def post():
pass
# cmd
oriurl = pk.load(open('C:\\Users\\surface\\Desktop\\GoooodDesigh\\url_postlist.pkl','rb'))
oriurl2 = oriurl['url'].split(',')
post_url = []
for i in range(len(oriurl2)):
if i == 0:
fileter = oriurl2
true = fileter
else:
fileter = oriurl2
true = fileter
print(true)
post_url.append(true)
d = 0
current_url = post_url
while True:
#os.system("cd C:\\Users\\surface\\Desktop\\GoooodDesigh\\GoooodDesigh")
os.system("python -m scrapy crawl goooodpro")
print(d) scrapy默认是遵守爬虫准则的,即settings里面,ROBOTSTXT_OBEY = True。把ROBOTSTXT_OBEY=False.也就是不遵守它的规则
提示缺少模块,it = goooodpostItem()
你在导包的时候不是注释掉了吗?后面为什么还要用
页:
[1]