[已解决]爬虫挑战

人造人 · 发表于 2023-11-28 01:56:38

简单的看了一下，挺简单的，^_^
简单的说一下思路
先把主页面爬下来，看了看发现没有这个目录列表
看了下浏览器接收到的数据，发现了你提到的这个地址

http://www.038909.xyz:5678/api/fs/list

复制代码

简单的研究了一下，得到了这个地址的使用方法
其实就是多看几个目录在这个地址上收到的数据，找规律么
然后就是根据找到的规律写代码了

另外，这个网站的反爬机制还不错
我目前最简单的解决方法也就是sleep一会了
1秒一个目录，出错了再10秒重试一下
如果这是你自己的网站，在测试爬虫的时候可以把这个目录弄的小一点么
我这爬了十几分钟了，才爬到 /体育/健身这个目录，^_^
下面这个程序用于获取这个网站的整个目录树

#!/usr/bin/env python
#coding=utf-8
import requests
import json
import itertools
import time
from retry import retry
import sys
import logging
logging.basicConfig(stream = sys.stderr, level = logging.WARNING)
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"}
@retry(delay = 10, logger = logging.getLogger())
def list_dir(path):
time.sleep(1)
print(path, file = sys.stderr)
content = []
for page in itertools.count(1):
json_ = {"path": path, "password": "", "page": page, "per_page": 30, "refresh": False}
url = 'http://www.038909.xyz:5678/api/fs/list'
response = requests.post(url, json = json_, headers = headers)
response.encoding = 'utf-8'
json_ = json.loads(response.text)
data = json_['data']
if data['content'] != None: content.extend(data['content'])
total = data['total']
if total == len(content): break
for i in content:
if not i['is_dir']: continue
i['dir_content'] = list_dir(path + i['name'] + '/')
return content
content = list_dir('/')
print(content)

复制代码

人造人 · 发表于 2023-11-28 02:03:23

fineconey 发表于 2023-11-27 22:47
乱码了，路径如下。

还是直接用 api 吧，这样去拼 url 不得行

人造人 · 发表于 2023-11-28 19:58:25

再改一改，我发现文件没有爬到下载地址，需要使用下面这个地址再对文件进行一次请求

http://www.038909.xyz:5678/api/fs/get

复制代码

爬了两个多小时才爬到 "/体育/健身/【赛普健身专业视频】价值19800元/实用工具图表/体质判断对照图"，^_^
然后服务器断开连接了，看起来是我的ip被封了，^_^

#!/usr/bin/env python
#coding=utf-8
import requests
import json
import itertools
import time
from retry import retry
import sys
import logging
logging.basicConfig(stream = sys.stderr, level = logging.WARNING)
headers = {"User-Agent": "Mozilla/6.0 (X11; Linux x86_64; rv:109.0) Gecko/20120101 Firefox/139.0"}
dir_url = 'http://www.038909.xyz:5678/api/fs/list'
file_url = 'http://www.038909.xyz:5678/api/fs/get'
@retry(delay = 5, logger = logging.getLogger())
def read_json(path, url, json_):
time.sleep(1)
response = requests.post(url, json = json_, headers = headers)
json_ = json.loads(response.text)
if json_['code'] != 200: raise ValueError(json_['message'])
return json_['data']
def read_file(path):
print(path, file = sys.stderr)
json_ = {"path": path, "password": ""}
return read_json(path, file_url, json_)
def read_dir(path):
print(path, file = sys.stderr)
content = []
for page in itertools.count(1):
json_ = {"path": path, "password": "", "page": page, "per_page": 30, "refresh": False}
data = read_json(path, dir_url, json_)
if data['content'] != None: content.extend(data['content'])
total = data['total']
if total == len(content): break
for i in range(len(content)):
is_dir = content[i]['is_dir']
name = content[i]['name']
if is_dir: content[i]['dir_content'] = read_dir(path + name + '/')
else: content[i] = read_file(path + name)
return content
#content = read_dir('/游戏/PC/07.运行库/')
content = read_dir('/')
print(content)

复制代码

人造人 · 发表于 2023-11-28 21:03:18

fineconey 发表于 2023-11-28 20:20
结构是一样的，可以试试这个。

https://pan.mediy.cn/

需要我继续挂这个脚本去爬整个目录树吗？
^_^

账号		自动登录	找回密码
密码			立即注册

[已解决]爬虫挑战

浏览过的版块