爬取网页中的中文显示出来是乱码
刚刚按照小甲鱼的p64集的操作,爬取了chinadmoz的网页数据http://www.chinadmoz.org/,但是爬取出来的结果,中文的貌似是16进制的,怎么转换成中文呢。#spider里的代码
import scrapy
from tutorial.items import DmozItem
class DmozSpider(scrapy.Spider):
name = "dmoz"
#allowed_domains = ["chinadmoz.org"]
start_urls = ["http://www.chinadmoz.org/subindustry/42/",
"http://www.chinadmoz.org/subindustry/46/"]
def parse(self,response):
items = []
rel = scrapy.selector.Selector(response)
sites = rel.xpath('//ul/li/div')
for site in sites:
item = DmozItem()
item['title'] = site.xpath('h4/@title').extract()
item['link']= site.xpath('h4/a/@href').extract()
item['desc'] = site.xpath('p/text()').extract()
items.append(item)
return items
设置编码 试试把第二十一行return items改成return items.encode('utf-8') https://fishc.com.cn/thread-193037-1-1.html
试试?
页:
[1]