|
|
马上注册,结交更多好友,享用更多功能^_^
您需要 登录 才可以下载或查看,没有账号?立即注册
x
本帖最后由 夜深听雨 于 2017-9-7 10:15 编辑
学了正则表达式之后想自己试着把小甲鱼老师以前爬煎蛋网妹子图的程序换为使用正则表达式的,但是感觉自己写的正则表达式有问题匹配不到想要匹配到的图片网址和page number,另外page number出现了不能被int的情况,请问应该怎么修改?谢谢。
- import urllib.request
- import os
- import re
- def url_open(url):
- req = urllib.request.Request(url)
- req.add_header('User-Agent','Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36')
- response = urllib.request.urlopen(req)
- html = response.read()
- return html
- def get_page(url):
- html = url_open(url).decode('utf-8')
- p = r'<spanclass ="current-comment-page">[(\d\d)]</span>' #这里应该有错
- html = re.findall(p,html)
- return html
- def find_image(url):
- html = url_open(url).decode('utf-8')
- image_address = []
- p = r'<img src="([^"]+\.jpg)" style="max-width: 480px; max-height: 750px;">'
- address = re.findall(p,html)
- image_address.append(address) #这里append不知道用的合不合适
- return image_address
- def save_image(folder,image_address):
- for each in image_address:
- if 'http:' not in each:
- filename = each.split('/')[-1]
- eachhttp = 'http:' + each
- print(eachhttp)
- with open(filename,'wb') as f:
- img=url_open(eachhttp)
- f.write(img)
- else:
- filename = each.split('/')[-1]
- print(each)
- urllib.request.urlretrieve(each,filename)
- def download_mm(folder = 'pics',pages=10):
- os.mkdir(folder)
- os.chdir(folder)
- url = "http://jandan.net/ooxx/"
- page_number = get_page(url) #好像需要用到int
- for i in range(pages):
- page_number -= i
- page_url = url + 'page-' + str(page_number) + '#comments'
- image_address = find_image(page_url)
- save_image(folder,image_address)
- if __name__ == '__main__':
- download_mm()
复制代码
find_image函数的问题我自己解决了,以下是代码
- def find_image(url):
- html = url_open(url).decode('utf-8')
- p = r'<img src="([^"]+\.jpg)"'
- image_address = re.findall(p,html)
- return image_address
复制代码
现在剩下的问题是page number怎么用正则表达式匹配到,而且匹配到的能否使用int函数处理
|
|