昨天又去爬了下，我学习爬虫后，爬的第一个网站，知道是那个吗？

Stubborn · 发表于 2019-6-1 23:05:09

马上注册，结交更多好友，享用更多功能^_^

您需要登录才可以下载或查看，没有账号？立即注册

x

不要宣扬，放过头来看当初爬它的时候，

各种百度，各种难题，哈哈，回过头来，这种静态的已经是so easy 了，还能变着花样爬它。

from lxml import etree
from OpenSSL import SSL
import requests
import re
import time
import os
HEADERS = {
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36",
'Referer': 'http://www.mzitu.com'
}
QUANTITY =220#图集页数每页/24
FILE_PATH = None #下载路径，请设置绝对路径，默认以当前绝对路径做保存
def get_path(name,num=" "):
'''
当num为None时，用来查询目录是否存在
:param name:
:param num:
:return:
'''
currrent_path = os.path.realpath(__file__) # 文件绝对路径
current_dir = os.path.split(currrent_path)[0]
file_path = os.path.join(current_dir, name)
if num==" ":
if not os.path.exists(file_path):
return False
if FILE_PATH is None:
if not os.path.exists(file_path):
os.makedirs(file_path)
else:
file_path = FILE_PATH
return os.path.join(file_path,num) + ".jpg"
def get_response(url):
"""返回URL响应"""
time.sleep(2)
return requests.get(url=url,headers=HEADERS)
def atlas():
'''
:yield:图集下载的地址
'''
for i in range(QUANTITY):
url = "https://www.mzitu.com/page/{page}/".format(page=i)
response = get_response(url=url).text
imgurl_list = re.findall(r'<li><a href="(.*?)" target="_blank">',response)
imgname_list = re.findall(r"alt='(.*?)' width=",response)
for img_naem,img_url in zip(imgname_list,imgurl_list):
item = {}
item["name"] = img_naem
print(img_naem)
# if not get_path(name=img_naem):
# break
item["img_url"] = img_url
yield item
def get_download_url(item):
"""
:param url:图片url
:return: 图集下载地址
"""
response = get_response(url=item["img_url"]).text
etre = etree.HTML(response)
num = etre.xpath("//div[@class='pagenavi']/a[5]/span/text()")
for i in range(0,int(num[0])+1):
item["download"] = item["img_url"]+ "/" + str(i)
item["number"] = str(i)
yield item
def download(item):
headers = {
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36",
'Referer': item["download"]
}
try:
time.sleep(2)
etre = etree.HTML(requests.get(url=item["download"],headers=headers).text)
download_url = etre.xpath("//div[@class='main-image']/p/a/img/@src")[0]
response = requests.get(url=download_url,headers=headers).content
file_path = get_path(item["name"],item["number"])
with open(file_path,"wb") as fp:
fp.write(response)
except SSL.SysCallError as e:
print("当前出现错误%s"%e)
print(item)
def product(c):
c.send(None)
for img in atlas():
c.send(img)
c.close()
def customer():
data = ""
while True:
n = yield data
if not n:
return
for each in get_download_url(item=n):
download(each)
def main():
func = customer()
product(func)
if __name__ == '__main__':
main()

复制代码

青梅 · 发表于 2019-6-1 23:54:12

Traceback (most recent call last):
File "C:/Users/Administrator/AppData/Local/Programs/Python/Python37-32/Save/1.py", line 1, in <module>
from lxml import etree
ModuleNotFoundError: No module named 'lxml'
一脸蒙蔽

Stubborn · 发表于 2019-6-2 00:58:43

青梅发表于 2019-6-1 23:54
Traceback (most recent call last):
File "C:/Users/Administrator/AppData/Local/Programs/Python/Pyt ...

缺少lxml库，安装这个库，pip install lxml

青梅 · 发表于 2019-6-2 23:31:08

Stubborn 发表于 2019-6-2 00:58
缺少lxml库，安装这个库，pip install lxml

哦晓得了

2164930278 · 发表于 2019-6-11 07:35:56

看不懂，懵了

jermey1994 · 发表于 2019-7-25 11:13:24

大佬进步神速

censing · 发表于 2019-9-14 21:44:48

进步神速，厉害

解技 · 发表于 2019-11-6 19:05:43

Traceback (most recent call last):
File "C:/Users/14028/Desktop/爱词霸.py", line 2, in <module>
from OpenSSL import SSL
ModuleNotFoundError: No module named 'OpenSSL

Stubborn · 发表于 2019-11-6 19:23:04

解技发表于 2019-11-6 19:05
Traceback (most recent call last):
File "C:/Users/14028/Desktop/爱词霸.py", line 2, in
from ...

pip install OpenSSl 少一个库

解技 · 发表于 2019-11-6 20:26:12

Stubborn 发表于 2019-11-6 19:23
pip install OpenSSl 少一个库

ERROR: Could not find a version that satisfies the requirement OpenSSL (from versions: none)
ERROR: No matching distribution found for OpenSSL
不知道为什么安装不了

Stubborn · 发表于 2019-11-6 20:44:15

解技发表于 2019-11-6 20:26
ERROR: Could not find a version that satisfies the requirement OpenSSL (from versions: none)
ERRO ...

那就不要了，在89行，异常捕捉，直接用except: 不指定具体错误。

账号		自动登录	找回密码
密码			立即注册

[作品展示] 昨天又去爬了下，我学习爬虫后，爬的第一个网站，知道是那个吗？

马上注册，结交更多好友，享用更多功能^_^

本帖被以下淘专辑推荐: