|
马上注册,结交更多好友,享用更多功能^_^
您需要 登录 才可以下载或查看,没有账号?立即注册
x
本帖最后由 轻轻4843 于 2018-5-18 13:06 编辑
闲来无事也发一个小爬虫,
之前忘记是哪个大佬发的贴了,爬取gifjia5.com
下下来用了一下发现貌似不太能用,自己也写了一个,实战效果还可以
水平较低,大佬勿喷
主要的几个功能:
1.自定义页码数量
2.未知错误跳过
3.gif和jpg按各自类型下载
缺点:
1.没有引入自动创建目录,需要自己提前建好
- # -*- coding: utf-8 -*-
- import requests
- from bs4 import BeautifulSoup
- import os, time
- import re
- headers = {
- 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:54.0) Gecko/20100101 Firefox/54.0'
- }
- def downpic(url):
- a = 0
- for url in url:
- html = requests.get(url,headers=headers)
- soup = BeautifulSoup(html.text,'lxml')
- for article in soup.find_all('article'):
- next_url= article.a.get('href')
- next_html = requests.get(next_url,headers=headers)
- next_soup = BeautifulSoup(next_html.text,'lxml')
- for pic_article in next_soup.find_all('article'):
- try:
- pic_url = pic_article.p.img.get('src')
- #print(pic_url) #测试链接是否正确获取
- pic = requests.get(pic_url,headers=headers)
- if pic_url.endswith('.gif'):
- with open('E:\\pic\\gif5\\%s.gif' % a,'wb') as f:
- f.write(pic.content)
- f.close()
- a += 1
- print('No.%s' % a)
- elif pic_url.endswith('.jpg'):
- with open('E:\\pic\\gif5\\%s.jpg' % a,'wb') as f:
- f.write(pic.content)
- f.close()
- a += 1
- print('No.%s' % a)
- except:
- print('有个错误,跳过吧')
- time.sleep(3)
- def appendurl(num):
- url = []
- for i in range(1,int(num)+1):
- if i == 1:
- url.append(r'http://www.gifjia5.com/tag/老司机')
- else:
- url.append(r'http://www.gifjia5.com/tag/老司机' + r'/page/' + ('%s' % i))
- return url
-
- page = input('输入页数:')
- urllist = appendurl(page)
- #print (urllist)
- downpic(urllist)
复制代码 |
|