|
马上注册,结交更多好友,享用更多功能^_^
您需要 登录 才可以下载或查看,没有账号?立即注册
x
本帖最后由 Levin-e 于 2020-3-24 10:25 编辑
第二弹已发布
https://fishc.com.cn/thread-162027-1-1.html
这是我第一次发帖,在论坛逛了好久,看到许多大佬们写的爬虫,自己也想试一试
于是用了一天时间写了这个爬虫,基本功能可以实现,但是细节上还有待优化,希望有大佬可以修改完善
额外依赖的包有requests和bs4中的BeautifulSoup
网页来源:https://www.nvshens.net/
爬取的是高清原图,不是缩略图
运行过程中console会显示下载信息,提示一个error不用担心,连续出现两个error则说明特定的一张图片下载失败
这个bug有待改进
具体使用方法在代码的注释中
最后放出源码:
- import re, math
- import requests
- from bs4 import BeautifulSoup
- if __name__ == '__main__':
- headers = {
- 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.106 Safari/537.36 Edg/80.0.361.54'
- }
- first_url = 'https://www.nvshens.net/g/31118/' #需手动更改链接,参照这个样式
- r = requests.get(first_url,headers=headers)
- html = r.text
- bf = BeautifulSoup(html,'lxml')
- page_info = bf.find('div',id='dinfo').find('span').get_text() #获取套图照片总数以计算页码
- pic_num = re.findall(r'\d+',page_info)
- page_num = math.ceil(int(pic_num[0])/3)
- print('该套图总共有%d页' %page_num)
- print('该套图包含%s' %page_info)
- img_url = bf.find('div', class_='gallery_wrapper').find_all('img')
- for each in img_url:
- name = each.get('alt') + '.jpg'
- img_src = each.get('src')
- replace_src = img_src.replace('/s','')
- path = r'E:\Movie\photo\azhua' #自定义存储位置
- file_name = path + '\\' + name
- try:
- # print(replace_src)
- req1 = requests.get(replace_src,headers=headers)
- with open(file_name, 'wb') as f:
- f.write(req1.content)
- print('%s is download' %file_name)
- time.sleep(3)
- req1 = requests.get(first_url,headers=headers)
- except:
- print('Error')
- page = 2
- while page<=page_num:
- url = 'https://www.nvshens.net/g/31118/%d.html'%page #手动设置最终页码,以及图片地址
- req = requests.get(url,headers=headers)
- html = req.text
- bf = BeautifulSoup(html, 'lxml')
- #page_url = 'https://www.nvshens.net' + bf.find('div', id_='pages').find('a', class_='a1')
- #print(page_url)
- img_url = bf.find('div', class_='gallery_wrapper').find_all('img')
- # print(img_url)
- for each in img_url:
- name = each.get('alt') + '.jpg'
- img_src = each.get('src')
- replace_src = img_src.replace('/s','')
- path = r'E:\' #自定义存储位置
- file_name = path + '\\' + name
- try:
- # print(replace_src)
- req1 = requests.get(replace_src,headers=headers)
- with open(file_name, 'wb') as f:
- f.write(req1.content)
- print('%s is download' %file_name)
- time.sleep(3)
- req1 = requests.get(url,headers=headers)
- except:
- print('Error')
- page=page+1
复制代码
|
|