爬虫遇到这问题

苏绛雪 · 发表于 2020-2-25 22:44:38

我在爬取一个妹子网站的时候，出现了如图所示的问题。
那个网站是https://www.pexels.com/search/beautiful%20girl/
请问各位大佬应该怎么解决

我的代码如下：

import requests
import re
import time
Headers = {
'user-agent': "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36"}
response=requests.get("https://www.pexels.com/search/beautiful%20girl/",headers=Headers)
html=response.text
print(html)
print("------------------------------------------------------------------------------------------------")
urls=re.findall('<a class="js-photo-link photo-item__link" style=".*?" title=".*?" href=".*?"><img srcset="(.*?).*?" .*?>',html)
#保存图片
for url in urls:
print(url)

复制代码

＿摆架_回宫、 · 发表于 2020-2-25 23:55:21

import requests
import re
import time
s = requests.session()
Headers = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'zh-CN,zh;q=0.9',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'
}
response=s.get("https://www.pexels.com/search/beautiful%20girl/",headers=Headers)
html=response.text
print(html)
print("------------------------------------------------------------------------------------------------")
urls=re.findall('<a class="js-photo-link photo-item__link" style=".*?" title=".*?" href=".*?"><img srcset="(.*?).*?" .*?>',html)
#保存图片
for url in urls:
print(url)

复制代码

参数带齐，但是返回数据下面的样子，我看了下好像返回他自己有个字体文件，回头在解析把

�մ���p�~9K��y�۲�Y�f7cY�IT��%v��I���.�m��%Y6��ɵS1����~�
(##eX�d��]�w��à\�Y�*թP�
�wkVN���c���5]��j5�`\�Z����d����~�F�U*�Ӥ15����*��.�~����ط�,d�*�3Ό���ɮ������l�+��d��ߒx�(;��������ʔD��4����Nn��|+�-`[a@+4t[~�D�!pU��

复制代码

哈喇子淌一手 · 发表于 2020-2-26 01:28:06

Request URL: https://www.pexels.com/search/beautiful%20girl/
Request Method: GET
Status Code: 200
Remote Address: 104.17.209.102:443
Referrer Policy: no-referrer-when-downgrade
age: 5610
alt-svc: h3-25=":443"; ma=86400, h3-24=":443"; ma=86400, h3-23=":443"; ma=86400
cache-control: max-age=3600
cf-cache-status: HIT
cf-ray: 56ab56e44b4b99c5-LAX
content-encoding: br
content-type: text/html; charset=utf-8
date: Tue, 25 Feb 2020 17:18:36 GMT
expect-ct: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
expires: Tue, 25 Feb 2020 18:18:36 GMT
server: cloudflare
status: 200
vary: Accept-Encoding
x-frame-options: ALLOWALL
x-request-id: 9e10213b-644a-4e0e-90e0-8aa664df9caa
x-runtime: 1.586400

复制代码

我用浏览器和postman都测试了,浏览器没有问题,postman有问题.
上面这个是浏览器中获取到的参数,可以看出,人家使用了cf的cache,都是cf代理的,而cf是专门做这个的,所以看样子很难搞.cf既然是专门搞这个的,显然可以对包进行特征检测,从而指导你是否是使用浏览器访问的.
另外,在浏览器中,我关闭了js,所有的资源依然可以访问,这说明确实是cf搞得鬼.
爬虫方面,最简单的做法是直接用浏览器,可以用pyqtwebengine,或者pythonnet+cefsharp,或者直接用chromedriver+selenium,需要注意的是可以把浏览器的js引擎关闭掉,这样就可以非常快了.当然,用qt4的webkit也可以.

python/print · 发表于 2020-2-26 09:10:46

爬虫python不能加print（html）

苏绛雪 · 发表于 2020-2-26 09:49:29

＿摆架_回宫、发表于 2020-2-25 23:55
参数带齐，但是返回数据下面的样子，我看了下好像返回他自己有个字体文件，回头在解析把

好像是是编码错误

苏绛雪 · 发表于 2020-2-26 09:50:45

哈喇子淌一手发表于 2020-2-26 01:28
我用浏览器和postman都测试了,浏览器没有问题,postman有问题.
上面这个是浏览器中获取到的参数,可以看出 ...

实在是听不懂

＿摆架_回宫、 · 发表于 2020-2-26 10:48:59

三楼正解，所以get解决不了的都chromedriver+selenium。新书推荐这个组合

苏绛雪 · 发表于 2020-2-26 10:56:53

＿摆架_回宫、发表于 2020-2-26 10:48
三楼正解，所以get解决不了的都chromedriver+selenium。新书推荐这个组合

Traceback (most recent call last):
File "J:\python_code_guitulb\get_pexel_girls\get_p_girls.py", line 14, in <module>
print(html)
UnicodeEncodeError: 'UCS-2' codec can't encode characters in position 2179-2179: Non-BMP character not supported in Tk

这是三楼的报错

＿摆架_回宫、 · 发表于 2020-2-26 11:28:09

苏绛雪发表于 2020-2-26 10:56
Traceback (most recent call last):
File "J:\python_code_guitulb\get_pexel_girls\get_p_girls.py" ...

我意思直接用selenium控制浏览器，真实访问。我现在还在研究这个百度云加速。有新成果我发上来
如果你不想玩技术，单纯要拿图片。直接用selenium

账号		自动登录	找回密码
密码			立即注册