|
马上注册,结交更多好友,享用更多功能^_^
您需要 登录 才可以下载或查看,没有账号?立即注册
x
这是部分代码
- class BizhiSpider(CrawlSpider):
- name = 'bizhi'
- # allowed_domains = ['www.xxxx.com']
- start_urls = ['http://www.netbian.com/weimei/']
- start_link = LinkExtractor(allow='index_\d+?.htm')
- detail_link = LinkExtractor(allow='desk.+?\.htm')
- link = []
- detail=[]
- for each in start_link:
- each = 'http://www.netbian.com/' + each
- link.append(each)
- for url in detail_link:
- url = 'http://www.netbian.com/' + url
- detail.append(url)
- #对页码url进行解析
- rules = (
- Rule(link, callback='parse_item', follow=False),
- Rule(detail,callback=detail_parse,follow=True)
- )
- #页码解析函数
- def parse_item(self, response):
- pass
- # li_list = response.xpath('//*[@id="main"]/div[2]/ul/li')
- # item = QuanzhanproItem()
- # for li in li_list:
- # img_name = li.xpath('./a/@title').extract_first()
- # item['img_name'] = img_name
- def detail_parse(self,response):
- src = response.xpath('//*[@id="main"]/div[2]/div/p/a/img/@src').extract_first()
- item = QuanzhanproItem()
- item['src'] = src
- yield item
复制代码
在用链接提取器后提取到了不完整的url,我以为可以迭代然后拼接,可是报错了,问一下有没有比较ok的解决办法?? |
|