马上注册,结交更多好友,享用更多功能^_^
您需要 登录 才可以下载或查看,没有账号?立即注册
x
这是部分代码class BizhiSpider(CrawlSpider):
name = 'bizhi'
# allowed_domains = ['www.xxxx.com']
start_urls = ['http://www.netbian.com/weimei/']
start_link = LinkExtractor(allow='index_\d+?.htm')
detail_link = LinkExtractor(allow='desk.+?\.htm')
link = []
detail=[]
for each in start_link:
each = 'http://www.netbian.com/' + each
link.append(each)
for url in detail_link:
url = 'http://www.netbian.com/' + url
detail.append(url)
#对页码url进行解析
rules = (
Rule(link, callback='parse_item', follow=False),
Rule(detail,callback=detail_parse,follow=True)
)
#页码解析函数
def parse_item(self, response):
pass
# li_list = response.xpath('//*[@id="main"]/div[2]/ul/li')
# item = QuanzhanproItem()
# for li in li_list:
# img_name = li.xpath('./a/@title').extract_first()
# item['img_name'] = img_name
def detail_parse(self,response):
src = response.xpath('//*[@id="main"]/div[2]/div/p/a/img/@src').extract_first()
item = QuanzhanproItem()
item['src'] = src
yield item
在用链接提取器后提取到了不完整的url,我以为可以迭代然后拼接,可是报错了,问一下有没有比较ok的解决办法?? |