鱼C论坛

 找回密码
 立即注册
查看: 2228|回复: 6

[已解决]python爬取煎蛋网妹子图时报错了,好气哦!有哪位大佬可以给小白解答下吗?谢谢!

[复制链接]
发表于 2021-2-9 10:41:17 | 显示全部楼层 |阅读模式

马上注册,结交更多好友,享用更多功能^_^

您需要 登录 才可以下载或查看,没有账号?立即注册

x
import urllib.request
import re

def open_url(url):
    req = urllib.request.Request(url)
    req.add_header('User-Agent','Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36')
    page = urllib.request.urlopen(req)
    html = page.read().decode('UTF-8')
    return html

def get_img(html):
    p = r'<img src="([^"]+\.jpg)"'
    imglist = re.findall(p,html)

    for each in imglist:
          print(each)

    for each in imglist:
        filename = each.split('/')[-1]
        urllib.request.urlretrieve('http://' + each,filename,None)


if __name__ == '__main__':
    url = 'http://jandan.net/ooxx/MjAyMTAyMDgtOTk=#comments'
    get_img(open_url(url))

报错截图如下:

报错.png

网址是最近煎蛋网上的随手拍那一栏的网址,这里小白先谢过各位鱼油们!谢谢~


最佳答案
2021-2-9 14:58:12
import urllib.request as u_request
import os, re, base64, requests

header ={}
header['User-Agent'] = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 11_1_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'

def url_open(url):

    html = requests.get(url, headers=header).text
    return html

def find_images(url):
    html = url_open(url)

    m = r'<img src="([^"]+\.jpg)"'
    match = re.findall(m, html)

    for each in range(len(match)):
        match[each] = 'http:' + match[each]
        print(match[each])

    return match

def save_images(folder, img_addrs):
    for each in img_addrs:
        try:
            req = u_request.Request(each, headers = header)
            response = u_request.urlopen(req)
            cat_image = response.read()
            filename = each.split('/')[-1]
            with open(filename,'wb') as f:
                f.write(cat_image)
            #print(each)
        except OSError as error:
            print(error)
            continue
        except ValueError as error:
            print(error)
            continue

def web_link_encode(url, folder):
    for i in range(180,200):
        string_date = '20201216-'
        string_date += str(i)
        string_date = string_date.encode('utf-8')
        str_base64 = base64.b64encode(string_date)
        page_url = url + str_base64.decode() + '=#comments'
        print(page_url)
        img_addrs = find_images(page_url)
        save_images(folder, img_addrs)

def download_the_graph(url):
    folder = 'graph'
    os.mkdir(folder)
    os.chdir(folder)
    web_link_encode(url, folder)

if __name__ == '__main__':
    url = 'http://jandan.net/pic/'
    download_the_graph(url)
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复

使用道具 举报

发表于 2021-2-9 11:47:21 | 显示全部楼层
本帖最后由 Daniel_Zhang 于 2021-2-9 11:49 编辑

我觉得是你这个 url有点问题

我这里直接就 404 not found 了

我自己的代码也是,估计是反爬了,换一个网站吧
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复 支持 反对

使用道具 举报

发表于 2021-2-9 12:10:46 | 显示全部楼层
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复

使用道具 举报

发表于 2021-2-9 14:16:31 | 显示全部楼层
import requests
import re


url = "http://jandan.net/ooxx/MjAyMTAyMDgtOTk=#comments"
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36",
}
req = requests.get(url, headers=headers).text

p = r'<img src="([^"]+\.jpg)"'
img_list = re.findall(p, req)

print(img_list)

这里没问题
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复 支持 反对

使用道具 举报

发表于 2021-2-9 14:17:34 | 显示全部楼层
第21行
url双斜杠多了一套,去掉就行了
import urllib.request
import re

def open_url(url):
    req = urllib.request.Request(url)
    req.add_header('User-Agent','Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36')
    page = urllib.request.urlopen(req)
    html = page.read().decode('UTF-8')
    return html

def get_img(html):
    global imglist
    p = r'<img src="([^"]+\.jpg)"'
    imglist = re.findall(p,html)

    for each in imglist:
          print(each)

    for each in imglist:
        filename = each.split('/')[-1]
        urllib.request.urlretrieve('http:' + each,filename,None)


if __name__ == '__main__':
    url = 'http://jandan.net/ooxx/MjAyMTAyMDgtOTk=#comments'
    get_img(open_url(url))
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复 支持 反对

使用道具 举报

发表于 2021-2-9 14:58:12 | 显示全部楼层    本楼为最佳答案   
import urllib.request as u_request
import os, re, base64, requests

header ={}
header['User-Agent'] = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 11_1_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'

def url_open(url):

    html = requests.get(url, headers=header).text
    return html

def find_images(url):
    html = url_open(url)

    m = r'<img src="([^"]+\.jpg)"'
    match = re.findall(m, html)

    for each in range(len(match)):
        match[each] = 'http:' + match[each]
        print(match[each])

    return match

def save_images(folder, img_addrs):
    for each in img_addrs:
        try:
            req = u_request.Request(each, headers = header)
            response = u_request.urlopen(req)
            cat_image = response.read()
            filename = each.split('/')[-1]
            with open(filename,'wb') as f:
                f.write(cat_image)
            #print(each)
        except OSError as error:
            print(error)
            continue
        except ValueError as error:
            print(error)
            continue

def web_link_encode(url, folder):
    for i in range(180,200):
        string_date = '20201216-'
        string_date += str(i)
        string_date = string_date.encode('utf-8')
        str_base64 = base64.b64encode(string_date)
        page_url = url + str_base64.decode() + '=#comments'
        print(page_url)
        img_addrs = find_images(page_url)
        save_images(folder, img_addrs)

def download_the_graph(url):
    folder = 'graph'
    os.mkdir(folder)
    os.chdir(folder)
    web_link_encode(url, folder)

if __name__ == '__main__':
    url = 'http://jandan.net/pic/'
    download_the_graph(url)
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复 支持 反对

使用道具 举报

发表于 2021-2-9 15:35:17 | 显示全部楼层
def getit(session, url):
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36"
    }
    res = session.get(url, headers=headers)
    xmls = etree.HTML(res.text)
    comments = xmls.xpath("//ol[@class='commentlist']/li/div/div/div[@class='text']/p/img/@src")
    for cm in comments:
        print(cm)
        urlretrieve(url="http:" + cm, filename=cm.split("/")[-1])
    if xmls.xpath("//div[@class='comments']/div[@class='cp-pagenavi']/a[last()]/@href"):
        next = "http:" + xmls.xpath("//div[@class='comments']/div[@class='cp-pagenavi']/a[last()]/@href")[0]
        getit(session, next)


if __name__ == '__main__':
    session = requests.Session()
    start = "http://jandan.net/ooxx/comments"
    getit(session=session, url=start)
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复 支持 反对

使用道具 举报

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

小黑屋|手机版|Archiver|鱼C工作室 ( 粤ICP备18085999号-1 | 粤公网安备 44051102000585号)

GMT+8, 2025-1-9 14:32

Powered by Discuz! X3.4

© 2001-2023 Discuz! Team.

快速回复 返回顶部 返回列表