鱼C论坛

 找回密码
 立即注册
楼主: 玩酷子弟lv

[作品展示] python简单爬虫实例-----爬取豆瓣妹子的照片

  [复制链接]
发表于 2018-2-6 03:24:53 | 显示全部楼层
看看,嘿嘿嘿
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复 支持 反对

使用道具 举报

发表于 2018-2-6 22:52:10 | 显示全部楼层
学习学习
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复 支持 反对

使用道具 举报

发表于 2018-2-6 23:33:07 | 显示全部楼层
学习学习
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复 支持 反对

使用道具 举报

发表于 2018-2-7 14:44:47 | 显示全部楼层
学习
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复

使用道具 举报

发表于 2018-2-7 21:10:26 | 显示全部楼层
wow genius!
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复 支持 反对

使用道具 举报

发表于 2018-2-7 23:39:55 | 显示全部楼层
果然不错!
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复 支持 反对

使用道具 举报

发表于 2018-2-8 00:23:31 | 显示全部楼层
66666666666
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复 支持 反对

使用道具 举报

发表于 2018-2-8 20:02:42 | 显示全部楼层
666
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复

使用道具 举报

发表于 2018-2-10 16:38:22 | 显示全部楼层
回复
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复

使用道具 举报

发表于 2018-2-10 21:42:17 | 显示全部楼层
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复

使用道具 举报

发表于 2018-2-12 00:33:13 | 显示全部楼层
import os
import re
import requests

def get_urls(url, regex):
    urls = []
    base_url = 'http://desk.zol.com.cn'
    content = requests.get(url).content
    area = re.search(regex, content, re.S).group(0)
    tails = re.findall(r'href="(.*?)"', area)
    for tail in tails:
        urls.append(base_url + tail)
    return urls

def download_picture(url, count):
    target_dir = 'pic'
    if os.path.exists(target_dir):
        if not os.path.isdir(target_dir):
            os.remove(target_dir)
    else:
        os.mkdir(target_dir)
    content = requests.get(url).content
    picture_url = re.search(r'<img id="bigImg" src="(.*?)"', content).group(1)
    picture = requests.get(picture_url).content
    suffix = re.sub(r'.*\.', '.', picture_url)
    with open('pic/' + str(count) + suffix, 'wb') as f:
        f.write(picture)

def spider(url, count):
    regex1 = r'<ul class="pic-list2  clearfix">.*?</ul>'
    regex2 = r'<ul id="showImg".*?</ul>'
    urls = get_urls(url, regex1)
    for each_url in urls:
        picture_urls = get_urls(each_url, regex2)
        for each_picture_url in picture_urls:
            download_picture(each_picture_url, count)
            print 'Downloading picture ' + str(count)
            count += 1
    return count

def get_next_page_url(url):
    base_url = 'http://desk.zol.com.cn'
    content = requests.get(url).content
    tail = re.search(r'<a id="pageNext" href="(.*?)"', content).group(1)
    return base_url + tail

if __name__ == '__main__':
    url = 'http://desk.zol.com.cn/meinv/'
    count = 1
    count = spider(url, count)
    while True:
        key = raw_input('Input y/Y to continue download next page, or input other words to exit.')
        if re.match(r'Y', key, re.I):
            url = get_next_page_url(url)
            count = spider(url, count)
        else:
            exit()import os
import re
import requests

def get_urls(url, regex):
    urls = []
    base_url = 'http://desk.zol.com.cn'
    content = requests.get(url).content
    area = re.search(regex, content, re.S).group(0)
    tails = re.findall(r'href="(.*?)"', area)
    for tail in tails:
        urls.append(base_url + tail)
    return urls

def download_picture(url, count):
    target_dir = 'pic'
    if os.path.exists(target_dir):
        if not os.path.isdir(target_dir):
            os.remove(target_dir)
    else:
        os.mkdir(target_dir)
    content = requests.get(url).content
    picture_url = re.search(r'<img id="bigImg" src="(.*?)"', content).group(1)
    picture = requests.get(picture_url).content
    suffix = re.sub(r'.*\.', '.', picture_url)
    with open('pic/' + str(count) + suffix, 'wb') as f:
        f.write(picture)

def spider(url, count):
    regex1 = r'<ul class="pic-list2  clearfix">.*?</ul>'
    regex2 = r'<ul id="showImg".*?</ul>'
    urls = get_urls(url, regex1)
    for each_url in urls:
        picture_urls = get_urls(each_url, regex2)
        for each_picture_url in picture_urls:
            download_picture(each_picture_url, count)
            print 'Downloading picture ' + str(count)
            count += 1
    return count

def get_next_page_url(url):
    base_url = 'http://desk.zol.com.cn'
    content = requests.get(url).content
    tail = re.search(r'<a id="pageNext" href="(.*?)"', content).group(1)
    return base_url + tail

if __name__ == '__main__':
    url = 'http://desk.zol.com.cn/meinv/'
    count = 1
    count = spider(url, count)
    while True:
        key = raw_input('Input y/Y to continue download next page, or input other words to exit.')
        if re.match(r'Y', key, re.I):
            url = get_next_page_url(url)
            count = spider(url, count)
        else:
            exit()import os
import re
import requests

def get_urls(url, regex):
    urls = []
    base_url = 'http://desk.zol.com.cn'
    content = requests.get(url).content
    area = re.search(regex, content, re.S).group(0)
    tails = re.findall(r'href="(.*?)"', area)
    for tail in tails:
        urls.append(base_url + tail)
    return urls

def download_picture(url, count):
    target_dir = 'pic'
    if os.path.exists(target_dir):
        if not os.path.isdir(target_dir):
            os.remove(target_dir)
    else:
        os.mkdir(target_dir)
    content = requests.get(url).content
    picture_url = re.search(r'<img id="bigImg" src="(.*?)"', content).group(1)
    picture = requests.get(picture_url).content
    suffix = re.sub(r'.*\.', '.', picture_url)
    with open('pic/' + str(count) + suffix, 'wb') as f:
        f.write(picture)

def spider(url, count):
    regex1 = r'<ul class="pic-list2  clearfix">.*?</ul>'
    regex2 = r'<ul id="showImg".*?</ul>'
    urls = get_urls(url, regex1)
    for each_url in urls:
        picture_urls = get_urls(each_url, regex2)
        for each_picture_url in picture_urls:
            download_picture(each_picture_url, count)
            print 'Downloading picture ' + str(count)
            count += 1
    return count

def get_next_page_url(url):
    base_url = 'http://desk.zol.com.cn'
    content = requests.get(url).content
    tail = re.search(r'<a id="pageNext" href="(.*?)"', content).group(1)
    return base_url + tail

if __name__ == '__main__':
    url = 'http://desk.zol.com.cn/meinv/'
    count = 1
    count = spider(url, count)
    while True:
        key = raw_input('Input y/Y to continue download next page, or input other words to exit.')
        if re.match(r'Y', key, re.I):
            url = get_next_page_url(url)
            count = spider(url, count)
        else:
            exit()import os
import re
import requests

def get_urls(url, regex):
    urls = []
    base_url = 'http://desk.zol.com.cn'
    content = requests.get(url).content
    area = re.search(regex, content, re.S).group(0)
    tails = re.findall(r'href="(.*?)"', area)
    for tail in tails:
        urls.append(base_url + tail)
    return urls

def download_picture(url, count):
    target_dir = 'pic'
    if os.path.exists(target_dir):
        if not os.path.isdir(target_dir):
            os.remove(target_dir)
    else:
        os.mkdir(target_dir)
    content = requests.get(url).content
    picture_url = re.search(r'<img id="bigImg" src="(.*?)"', content).group(1)
    picture = requests.get(picture_url).content
    suffix = re.sub(r'.*\.', '.', picture_url)
    with open('pic/' + str(count) + suffix, 'wb') as f:
        f.write(picture)

def spider(url, count):
    regex1 = r'<ul class="pic-list2  clearfix">.*?</ul>'
    regex2 = r'<ul id="showImg".*?</ul>'
    urls = get_urls(url, regex1)
    for each_url in urls:
        picture_urls = get_urls(each_url, regex2)
        for each_picture_url in picture_urls:
            download_picture(each_picture_url, count)
            print 'Downloading picture ' + str(count)
            count += 1
    return count

def get_next_page_url(url):
    base_url = 'http://desk.zol.com.cn'
    content = requests.get(url).content
    tail = re.search(r'<a id="pageNext" href="(.*?)"', content).group(1)
    return base_url + tail

if __name__ == '__main__':
    url = 'http://desk.zol.com.cn/meinv/'
    count = 1
    count = spider(url, count)
    while True:
        key = raw_input('Input y/Y to continue download next page, or input other words to exit.')
        if re.match(r'Y', key, re.I):
            url = get_next_page_url(url)
            count = spider(url, count)
        else:
            exit()import os
import re
import requests

def get_urls(url, regex):
    urls = []
    base_url = 'http://desk.zol.com.cn'
    content = requests.get(url).content
    area = re.search(regex, content, re.S).group(0)
    tails = re.findall(r'href="(.*?)"', area)
    for tail in tails:
        urls.append(base_url + tail)
    return urls

def download_picture(url, count):
    target_dir = 'pic'
    if os.path.exists(target_dir):
        if not os.path.isdir(target_dir):
            os.remove(target_dir)
    else:
        os.mkdir(target_dir)
    content = requests.get(url).content
    picture_url = re.search(r'<img id="bigImg" src="(.*?)"', content).group(1)
    picture = requests.get(picture_url).content
    suffix = re.sub(r'.*\.', '.', picture_url)
    with open('pic/' + str(count) + suffix, 'wb') as f:
        f.write(picture)

def spider(url, count):
    regex1 = r'<ul class="pic-list2  clearfix">.*?</ul>'
    regex2 = r'<ul id="showImg".*?</ul>'
    urls = get_urls(url, regex1)
    for each_url in urls:
        picture_urls = get_urls(each_url, regex2)
        for each_picture_url in picture_urls:
            download_picture(each_picture_url, count)
            print 'Downloading picture ' + str(count)
            count += 1
    return count

def get_next_page_url(url):
    base_url = 'http://desk.zol.com.cn'
    content = requests.get(url).content
    tail = re.search(r'<a id="pageNext" href="(.*?)"', content).group(1)
    return base_url + tail

if __name__ == '__main__':
    url = 'http://desk.zol.com.cn/meinv/'
    count = 1
    count = spider(url, count)
    while True:
        key = raw_input('Input y/Y to continue download next page, or input other words to exit.')
        if re.match(r'Y', key, re.I):
            url = get_next_page_url(url)
            count = spider(url, count)
        else:
            exit()import os
import re
import requests

def get_urls(url, regex):
    urls = []
    base_url = 'http://desk.zol.com.cn'
    content = requests.get(url).content
    area = re.search(regex, content, re.S).group(0)
    tails = re.findall(r'href="(.*?)"', area)
    for tail in tails:
        urls.append(base_url + tail)
    return urls

def download_picture(url, count):
    target_dir = 'pic'
    if os.path.exists(target_dir):
        if not os.path.isdir(target_dir):
            os.remove(target_dir)
    else:
        os.mkdir(target_dir)
    content = requests.get(url).content
    picture_url = re.search(r'<img id="bigImg" src="(.*?)"', content).group(1)
    picture = requests.get(picture_url).content
    suffix = re.sub(r'.*\.', '.', picture_url)
    with open('pic/' + str(count) + suffix, 'wb') as f:
        f.write(picture)

def spider(url, count):
    regex1 = r'<ul class="pic-list2  clearfix">.*?</ul>'
    regex2 = r'<ul id="showImg".*?</ul>'
    urls = get_urls(url, regex1)
    for each_url in urls:
        picture_urls = get_urls(each_url, regex2)
        for each_picture_url in picture_urls:
            download_picture(each_picture_url, count)
            print 'Downloading picture ' + str(count)
            count += 1
    return count

def get_next_page_url(url):
    base_url = 'http://desk.zol.com.cn'
    content = requests.get(url).content
    tail = re.search(r'<a id="pageNext" href="(.*?)"', content).group(1)
    return base_url + tail

if __name__ == '__main__':
    url = 'http://desk.zol.com.cn/meinv/'
    count = 1
    count = spider(url, count)
    while True:
        key = raw_input('Input y/Y to continue download next page, or input other words to exit.')
        if re.match(r'Y', key, re.I):
            url = get_next_page_url(url)
            count = spider(url, count)
        else:
            exit()import os
import re
import requests

def get_urls(url, regex):
    urls = []
    base_url = 'http://desk.zol.com.cn'
    content = requests.get(url).content
    area = re.search(regex, content, re.S).group(0)
    tails = re.findall(r'href="(.*?)"', area)
    for tail in tails:
        urls.append(base_url + tail)
    return urls

def download_picture(url, count):
    target_dir = 'pic'
    if os.path.exists(target_dir):
        if not os.path.isdir(target_dir):
            os.remove(target_dir)
    else:
        os.mkdir(target_dir)
    content = requests.get(url).content
    picture_url = re.search(r'<img id="bigImg" src="(.*?)"', content).group(1)
    picture = requests.get(picture_url).content
    suffix = re.sub(r'.*\.', '.', picture_url)
    with open('pic/' + str(count) + suffix, 'wb') as f:
        f.write(picture)

def spider(url, count):
    regex1 = r'<ul class="pic-list2  clearfix">.*?</ul>'
    regex2 = r'<ul id="showImg".*?</ul>'
    urls = get_urls(url, regex1)
    for each_url in urls:
        picture_urls = get_urls(each_url, regex2)
        for each_picture_url in picture_urls:
            download_picture(each_picture_url, count)
            print 'Downloading picture ' + str(count)
            count += 1
    return count

def get_next_page_url(url):
    base_url = 'http://desk.zol.com.cn'
    content = requests.get(url).content
    tail = re.search(r'<a id="pageNext" href="(.*?)"', content).group(1)
    return base_url + tail

if __name__ == '__main__':
    url = 'http://desk.zol.com.cn/meinv/'
    count = 1
    count = spider(url, count)
    while True:
        key = raw_input('Input y/Y to continue download next page, or input other words to exit.')
        if re.match(r'Y', key, re.I):
            url = get_next_page_url(url)
            count = spider(url, count)
        else:
            exit()import os
import re
import requests

def get_urls(url, regex):
    urls = []
    base_url = 'http://desk.zol.com.cn'
    content = requests.get(url).content
    area = re.search(regex, content, re.S).group(0)
    tails = re.findall(r'href="(.*?)"', area)
    for tail in tails:
        urls.append(base_url + tail)
    return urls

def download_picture(url, count):
    target_dir = 'pic'
    if os.path.exists(target_dir):
        if not os.path.isdir(target_dir):
            os.remove(target_dir)
    else:
        os.mkdir(target_dir)
    content = requests.get(url).content
    picture_url = re.search(r'<img id="bigImg" src="(.*?)"', content).group(1)
    picture = requests.get(picture_url).content
    suffix = re.sub(r'.*\.', '.', picture_url)
    with open('pic/' + str(count) + suffix, 'wb') as f:
        f.write(picture)

def spider(url, count):
    regex1 = r'<ul class="pic-list2  clearfix">.*?</ul>'
    regex2 = r'<ul id="showImg".*?</ul>'
    urls = get_urls(url, regex1)
    for each_url in urls:
        picture_urls = get_urls(each_url, regex2)
        for each_picture_url in picture_urls:
            download_picture(each_picture_url, count)
            print 'Downloading picture ' + str(count)
            count += 1
    return count

def get_next_page_url(url):
    base_url = 'http://desk.zol.com.cn'
    content = requests.get(url).content
    tail = re.search(r'<a id="pageNext" href="(.*?)"', content).group(1)
    return base_url + tail

if __name__ == '__main__':
    url = 'http://desk.zol.com.cn/meinv/'
    count = 1
    count = spider(url, count)
    while True:
        key = raw_input('Input y/Y to continue download next page, or input other words to exit.')
        if re.match(r'Y', key, re.I):
            url = get_next_page_url(url)
            count = spider(url, count)
        else:
            exit()import os
import re
import requests

def get_urls(url, regex):
    urls = []
    base_url = 'http://desk.zol.com.cn'
    content = requests.get(url).content
    area = re.search(regex, content, re.S).group(0)
    tails = re.findall(r'href="(.*?)"', area)
    for tail in tails:
        urls.append(base_url + tail)
    return urls

def download_picture(url, count):
    target_dir = 'pic'
    if os.path.exists(target_dir):
        if not os.path.isdir(target_dir):
            os.remove(target_dir)
    else:
        os.mkdir(target_dir)
    content = requests.get(url).content
    picture_url = re.search(r'<img id="bigImg" src="(.*?)"', content).group(1)
    picture = requests.get(picture_url).content
    suffix = re.sub(r'.*\.', '.', picture_url)
    with open('pic/' + str(count) + suffix, 'wb') as f:
        f.write(picture)

def spider(url, count):
    regex1 = r'<ul class="pic-list2  clearfix">.*?</ul>'
    regex2 = r'<ul id="showImg".*?</ul>'
    urls = get_urls(url, regex1)
    for each_url in urls:
        picture_urls = get_urls(each_url, regex2)
        for each_picture_url in picture_urls:
            download_picture(each_picture_url, count)
            print 'Downloading picture ' + str(count)
            count += 1
    return count

def get_next_page_url(url):
    base_url = 'http://desk.zol.com.cn'
    content = requests.get(url).content
    tail = re.search(r'<a id="pageNext" href="(.*?)"', content).group(1)
    return base_url + tail

if __name__ == '__main__':
    url = 'http://desk.zol.com.cn/meinv/'
    count = 1
    count = spider(url, count)
    while True:
        key = raw_input('Input y/Y to continue download next page, or input other words to exit.')
        if re.match(r'Y', key, re.I):
            url = get_next_page_url(url)
            count = spider(url, count)
        else:
            exit()import os
import re
import requests

def get_urls(url, regex):
    urls = []
    base_url = 'http://desk.zol.com.cn'
    content = requests.get(url).content
    area = re.search(regex, content, re.S).group(0)
    tails = re.findall(r'href="(.*?)"', area)
    for tail in tails:
        urls.append(base_url + tail)
    return urls

def download_picture(url, count):
    target_dir = 'pic'
    if os.path.exists(target_dir):
        if not os.path.isdir(target_dir):
            os.remove(target_dir)
    else:
        os.mkdir(target_dir)
    content = requests.get(url).content
    picture_url = re.search(r'<img id="bigImg" src="(.*?)"', content).group(1)
    picture = requests.get(picture_url).content
    suffix = re.sub(r'.*\.', '.', picture_url)
    with open('pic/' + str(count) + suffix, 'wb') as f:
        f.write(picture)

def spider(url, count):
    regex1 = r'<ul class="pic-list2  clearfix">.*?</ul>'
    regex2 = r'<ul id="showImg".*?</ul>'
    urls = get_urls(url, regex1)
    for each_url in urls:
        picture_urls = get_urls(each_url, regex2)
        for each_picture_url in picture_urls:
            download_picture(each_picture_url, count)
            print 'Downloading picture ' + str(count)
            count += 1
    return count

def get_next_page_url(url):
    base_url = 'http://desk.zol.com.cn'
    content = requests.get(url).content
    tail = re.search(r'<a id="pageNext" href="(.*?)"', content).group(1)
    return base_url + tail

if __name__ == '__main__':
    url = 'http://desk.zol.com.cn/meinv/'
    count = 1
    count = spider(url, count)
    while True:
        key = raw_input('Input y/Y to continue download next page, or input other words to exit.')
        if re.match(r'Y', key, re.I):
            url = get_next_page_url(url)
            count = spider(url, count)
        else:
            exit()import os
import re
import requests

def get_urls(url, regex):
    urls = []
    base_url = 'http://desk.zol.com.cn'
    content = requests.get(url).content
    area = re.search(regex, content, re.S).group(0)
    tails = re.findall(r'href="(.*?)"', area)
    for tail in tails:
        urls.append(base_url + tail)
    return urls

def download_picture(url, count):
    target_dir = 'pic'
    if os.path.exists(target_dir):
        if not os.path.isdir(target_dir):
            os.remove(target_dir)
    else:
        os.mkdir(target_dir)
    content = requests.get(url).content
    picture_url = re.search(r'<img id="bigImg" src="(.*?)"', content).group(1)
    picture = requests.get(picture_url).content
    suffix = re.sub(r'.*\.', '.', picture_url)
    with open('pic/' + str(count) + suffix, 'wb') as f:
        f.write(picture)

def spider(url, count):
    regex1 = r'<ul class="pic-list2  clearfix">.*?</ul>'
    regex2 = r'<ul id="showImg".*?</ul>'
    urls = get_urls(url, regex1)
    for each_url in urls:
        picture_urls = get_urls(each_url, regex2)
        for each_picture_url in picture_urls:
            download_picture(each_picture_url, count)
            print 'Downloading picture ' + str(count)
            count += 1
    return count

def get_next_page_url(url):
    base_url = 'http://desk.zol.com.cn'
    content = requests.get(url).content
    tail = re.search(r'<a id="pageNext" href="(.*?)"', content).group(1)
    return base_url + tail

if __name__ == '__main__':
    url = 'http://desk.zol.com.cn/meinv/'
    count = 1
    count = spider(url, count)
    while True:
        key = raw_input('Input y/Y to continue download next page, or input other words to exit.')
        if re.match(r'Y', key, re.I):
            url = get_next_page_url(url)
            count = spider(url, count)
        else:
            exit()
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复 支持 反对

使用道具 举报

发表于 2018-2-12 17:11:59 | 显示全部楼层
好棒
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复

使用道具 举报

发表于 2018-2-24 14:10:50 | 显示全部楼层
学习学习~
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复 支持 反对

使用道具 举报

发表于 2018-2-25 16:30:31 From FishC Mobile | 显示全部楼层
求解
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复

使用道具 举报

发表于 2018-2-25 16:47:56 | 显示全部楼层
感谢分享
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复 支持 反对

使用道具 举报

发表于 2018-2-25 18:06:26 | 显示全部楼层
k
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复

使用道具 举报

发表于 2018-2-25 18:07:02 | 显示全部楼层
学习。。。
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复

使用道具 举报

发表于 2018-2-25 18:11:00 | 显示全部楼层
谢谢分享
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复 支持 反对

使用道具 举报

发表于 2018-2-25 18:25:36 | 显示全部楼层
zhnega
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复 支持 反对

使用道具 举报

发表于 2018-2-26 15:10:36 | 显示全部楼层
学习学习
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复 支持 反对

使用道具 举报

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

小黑屋|手机版|Archiver|鱼C工作室 ( 粤ICP备18085999号-1 | 粤公网安备 44051102000585号)

GMT+8, 2024-10-5 19:10

Powered by Discuz! X3.4

© 2001-2023 Discuz! Team.

快速回复 返回顶部 返回列表