也发一个小爬虫

轻轻4843 · 发表于 2018-5-18 11:48:43

您需要登录才可以下载或查看，没有账号？立即注册

x

本帖最后由轻轻4843 于 2018-5-18 13:06 编辑

闲来无事也发一个小爬虫，
之前忘记是哪个大佬发的贴了，爬取gifjia5.com
下下来用了一下发现貌似不太能用，自己也写了一个，实战效果还可以

水平较低，大佬勿喷

主要的几个功能：

1.自定义页码数量
2.未知错误跳过
3.gif和jpg按各自类型下载

缺点：
1.没有引入自动创建目录，需要自己提前建好

# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
import os, time
import re
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:54.0) Gecko/20100101 Firefox/54.0'
}
def downpic(url):
a = 0
for url in url:
html = requests.get(url,headers=headers)
soup = BeautifulSoup(html.text,'lxml')
for article in soup.find_all('article'):
next_url= article.a.get('href')
next_html = requests.get(next_url,headers=headers)
next_soup = BeautifulSoup(next_html.text,'lxml')
for pic_article in next_soup.find_all('article'):
try:
pic_url = pic_article.p.img.get('src')
#print(pic_url) #测试链接是否正确获取
pic = requests.get(pic_url,headers=headers)
if pic_url.endswith('.gif'):
with open('E:\\pic\\gif5\\%s.gif' % a,'wb') as f:
f.write(pic.content)
f.close()
a += 1
print('No.%s' % a)
elif pic_url.endswith('.jpg'):
with open('E:\\pic\\gif5\\%s.jpg' % a,'wb') as f:
f.write(pic.content)
f.close()
a += 1
print('No.%s' % a)
except:
print('有个错误，跳过吧')
time.sleep(3)
def appendurl(num):
url = []
for i in range(1,int(num)+1):
if i == 1:
url.append(r'http://www.gifjia5.com/tag/老司机')
else:
url.append(r'http://www.gifjia5.com/tag/老司机' + r'/page/' + ('%s' % i))
return url
page = input('输入页数:')
urllist = appendurl(page)
#print (urllist)
downpic(urllist)

复制代码

黄金猫 · 发表于 2018-5-18 12:26:21

gif5.net吧

轻轻4843 · 发表于 2018-5-18 12:46:40

黄金猫发表于 2018-5-18 12:26
gif5.net吧

啊、写错了，gifjia5.com

账号		自动登录	找回密码
密码			立即注册

[作品展示] 也发一个小爬虫