爬取网页中的中文显示出来是乱码,Python交流,编程语言专区,鱼C论坛

垂天之云 发表于 2021-3-28 14:07:09

爬取网页中的中文显示出来是乱码

刚刚按照小甲鱼的p64集的操作，爬取了chinadmoz的网页数据http://www.chinadmoz.org/，但是爬取出来的结果，中文的貌似是16进制的，怎么转换成中文呢。
#spider里的代码
import scrapy
from tutorial.items import DmozItem

class DmozSpider(scrapy.Spider):
name = "dmoz"
#allowed_domains = ["chinadmoz.org"]
start_urls = ["http://www.chinadmoz.org/subindustry/42/",
               "http://www.chinadmoz.org/subindustry/46/"]

def parse(self,response):
   items = []
   rel = scrapy.selector.Selector(response)
   sites = rel.xpath('//ul/li/div')
   for site in sites:
         item = DmozItem()
         item['title'] = site.xpath('h4/@title').extract()
         item['link']= site.xpath('h4/a/@href').extract()
         item['desc'] = site.xpath('p/text()').extract()
         items.append(item)
   return items

名字只有七个字 发表于 2021-3-28 14:19:12

设置编码

名字只有七个字 发表于 2021-3-28 14:28:36

试试把第二十一行return items改成return items.encode('utf-8')

Daniel_Zhang 发表于 2021-3-28 16:21:20

https://fishc.com.cn/thread-193037-1-1.html

试试？

页: [1]

鱼C论坛's Archiver

爬取网页中的中文显示出来是乱码