|
马上注册,结交更多好友,享用更多功能^_^
您需要 登录 才可以下载或查看,没有账号?立即注册
x
- # -*- coding: utf-8 -*-
- import scrapy
- from tutorial.items import DemzItem
- class DmozSpider(scrapy.Spider):
- name = 'dmoz'
- allowed_domains = ['dmoz.org']
- start_urls = [
- 'https://www.dmoztools.net/Computers/Programming/Languages/Python/Books/',
- 'https://www.dmoztools.net/Computers/Programming/Languages/Python/Resources/'
- ]
- def parse(self, response):
- sel = scrapy.selector.Selector(response)
- sites = sel.xpath('//div/div/div[@class="title-and-desc"]')
- items = []
- for site in sites:
- item = DemzItem()
- item['title'] = site.xpath('a/div/text()').extract()
- item['link'] = site.xpath('div[@class="site-descr "]/text()').extract()
- item['desc'] = site.xpath('a/@href').extract()
- items.append(item)
- return items
复制代码
我想问一下上面代码中allowed_domains = ['dmoz.org']改成[‘dmoztools.net’]为什么就爬不了东西?只有填dmoz.org才行?不是说[]里面要填正确的域名么?上面是小甲鱼python零基础第63课——Scrapy框架的代码
|
|