【爬虫】遇到了些问题,Python交流,编程语言专区,鱼C论坛

昨非发表于 2021-3-26 16:20:03

【爬虫】遇到了些问题

网页源码连接如下view-source:http://www.xbiquge.la/1/1508/1158793.html

网页url如下：http://www.xbiquge.la/1/1508/1158793.html

代码如下：
from urllib.request import Request,urlopen
from fake_useragent import UserAgent

url = "http://www.xbiquge.la/1/1508/1158793.html"

headers={
"User-Agent":UserAgent().chrome
}
request =Request(url,headers=headers)

response = urlopen(request)
print(response.read())

但奇怪的是，响应内容居然是一堆：
b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03\xed|iOcY\x9a\xe6\xf7\x92\xf2?\xb8\x90\xa6+B\xca\xc0A,\x95\x19\x19\xcbH\x95U\xd2\x8c\xd4S]j\xe5h\xba\xd5j\x85\x1c\xe0\x0c\xc8 \x80\x04\x93\x91\xd9\xad\x96l\x8cW\xf0\xc2\xbe\xd8\xec\x06\x82\xd5\x06\x0cx\xf7\x7f\xa9\xf4\xb9\xcb\xa7\xf8\x0b\xf3\xbc\xe7\xbd\xbe\xbe\x06\x13e\xd7|\x9d\xd4\r\xa7\xf1=\xe7=\xef\xbe\x9ds\xef\x8b\xdf\xfe\xf1\x9f\xbe\xfd\xee_\xff\xf2\'[\xbf\xeb\xfd\xa0\xed/\xff\xfb\x0f\xff\xf8?\xbf\xb5u=\xb0\xdb\xff\xcf\xe3o\xed\xf6?~\xf7G\xdb\xbf\xfc\x8f\xef\xfe\xd7?\xdaz\xba\x1f\xda\xbe\x1bu\x0c\x8d\r\xb8\x06\x86\x87\x1c\x83v\xfb\x9f\xfe\xdce\xeb\xeaw\xb9F\xbe\xb1\xdb?|\xf8\xd0\xfd\xe1q\xf7\xf0\xe8[\xfbw\xffl\xff\x99`\xf5\xd0d\xe3\xeb\x03\x97efw\x9f\xab\xaf\xeb\xd5\x17\xbfy!W\xfc\xf9\xfd\xe0\xd0\xd8\xcb\x16pz\x9e={\xc6\xd3y\xb0\xd3\xd1G\xff\x7

read之后，无法decode，（我试了一大堆编码方式，全不行）
麻烦各位帮忙看一下
爬虫这方面确实没啥经验{:10_266:}

逃兵发表于 2021-3-26 17:49:50

import requests
from lxml import html

url = "http://www.xbiquge.la/1/1508/1158793.html"

headers={
'User-Agent' : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 11_1_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
}
request = requests.get(url,headers=headers)

request = request.content.decode("utf-8")

selector = html.fromstring(request)

txt_list = selector.xpath('//div[@id = "content"]/text()')

txt = ''

for i in txt_list:
i = repr(i).replace(r'\r','').replace(r'\xa0','').replace("'",'')
txt +=i

print(txt)

昨非发表于 2021-3-26 17:52:29

逃兵发表于 2021-3-26 17:49

谢谢兄弟了，好多表达都没见过
我回去再研究研究，最佳回去后就设置哈

逃兵发表于 2021-3-26 20:55:03

昨非发表于 2021-3-26 17:52
谢谢兄弟了，好多表达都没见过
我回去再研究研究，最佳回去后就设置哈

爬虫也是我的弱项，这我抄的

昨非发表于 2021-3-26 22:13:17

逃兵发表于 2021-3-26 20:55
爬虫也是我的弱项，这我抄的

嗯，我就是最近赶鸭子上架，直接课设上手
问题挺多的{:10_266:}

Daniel_Zhang 发表于 2021-3-27 00:42:33

url = "http://www.xbiquge.la/1/1508/1158793.html"

headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 11_1_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
}

def way_1():
import requests
html = requests.get(url, headers=headers)
html.encoding = 'utf-8'
return html.text

def way_2():
from urllib.request import Request, urlopen
import gzip
request = Request(url, headers=headers)
response = urlopen(request)
return gzip.decompress(response.read()).decode("utf-8")

print(way_1())
print(way_2())

Daniel_Zhang 发表于 2021-3-27 00:43:40

上面提供了两种方法

第一个是我自己的，第二个是在你的基础上改良的

百度是个好东西，虽然没谷歌好{:10_256:}

https://blog.csdn.net/DylanYuan/article/details/81533105

昨非发表于 2021-3-27 01:12:45

Daniel_Zhang 发表于 2021-3-27 00:43
上面提供了两种方法

第一个是我自己的，第二个是在你的基础上改良的

确实，
但现在百度体验越来越差了
净是些个广告

Daniel_Zhang 发表于 2021-3-27 01:14:43

昨非发表于 2021-3-27 01:12
确实，
但现在百度体验越来越差了
净是些个广告

还不睡？{:10_257:}

你想想哈，百度的对手是谁啊？没有吧应该？没个对手怎么竞争

要不是yahoo和谷歌退出中国市场，百度早就{:10_297:}

百度用户，满打满算 14亿，谷歌 70 亿{:10_250:}

页: [1]

鱼C论坛's Archiver

【爬虫】遇到了些问题