|
马上注册,结交更多好友,享用更多功能^_^
您需要 登录 才可以下载或查看,没有账号?立即注册
x
代码如下:测试的时候在 def find_imgs(url) 函数中不能获取图片网址,不知道什么原因,求大神解答。
import urllib.request
import os
def url_open(url):
req = urllib.request.Request(url)
req.add_header('User-Agent','Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36')
response = urllib.request.urlopen(url)
html = response.read()
return html
def get_page(url):
html=url_open(url).decode('utf-8')
a=html.find('current-comment-page')+23
b=html.find(']',a)
return html[a:b]
def find_imgs(url):
html=url_open(url).decode('utf-8')
img_addrs = []
a=html.find('img src=')
while a != -1:
b = html.find('.jpg',a,a+255)
if b != -1:
img_addrs.append(html[a+9:b+4])
else:
b=a+9
a=html.find('img src=',b)
return img_addrs
def save_imgs(folder,img_addrs):
for each in img_addrs:
filename=each.apilt('/')[-1]
with open(filename,'wb') as f:
img = open_url(each)
f.write(img)
def downlaod_mm(folder='oooxx',pages=10):
os.mkdir(folder)
os.chdir(folder)
url="http://jandan.net/ooxx/"
page_num=int(get_page(url))
for i in range(pages):
page_num -= i
page_url = url + 'page-'+str(page_num)+'#comments'
img_addrs = find_imgs(page_url)
save_imgs(folder,img_addrs)
if __name__ == '__main__':
downlaod_mm()
本帖最后由 wongyusing 于 2018-10-10 16:05 编辑
反爬了,5年前的代码已经不能用了。
如果你要爬,需要抓取网页源代码的哈希值。
然后通过下面的js代码进行编译后就可以获取真正的图片url地址了
注意,还是不行,原因是不知道这个网站从哪里调用了16进制的md5解密函数。
所以换网站吧。
- function md5(a) {
- return hex_md5(a) //找不到hex_md5都函数
- }
- var jd82tylpAK1P0Tvmga2rljssRTRVhio67x = function(n, t, e) {
- var f = "DECODE";
- var t = t ? t : "";
- var e = e ? e : 0;
- var r = 4;
- t = md5(t);
- var d = n;
- var p = md5(t.substr(0, 16));
- var o = md5(t.substr(16, 16));
- if (r) {
- if (f == "DECODE") {
- var m = n.substr(0, r)
- }
- } else {
- var m = ""
- }
- var c = p + md5(p + m);
- var l;
- if (f == "DECODE") {
- n = n.substr(r);
- l = base64_decode(n)
- }
- var k = new Array(256);
- for (var h = 0; h < 256; h++) {
- k[h] = h
- }
- var b = new Array();
- for (var h = 0; h < 256; h++) {
- b[h] = c.charCodeAt(h % c.length)
- }
- for (var g = h = 0; h < 256; h++) {
- g = (g + k[h] + b[h]) % 256;
- tmp = k[h];
- k[h] = k[g];
- k[g] = tmp
- }
- var u = "";
- l = l.split("");
- for (var q = g = h = 0; h < l.length; h++) {
- q = (q + 1) % 256;
- g = (g + k[q]) % 256;
- tmp = k[q];
- k[q] = k[g];
- k[g] = tmp;
- u += chr(ord(l[h]) ^ (k[(k[q] + k[g]) % 256]))
- }
- if (f == "DECODE") {
- if ((u.substr(0, 10) == 0 || u.substr(0, 10) - time() > 0) && u.substr(10, 16) == md5(u.substr(26) + o).substr(0, 16)) {
- u = u.substr(26)
- } else {
- u = ""
- }
- u = base64_decode(d)
- }
- return u
- };
- function jandan_load_img(b) {
- var d = $(b);
- var f = d.next("span.img-hash");
- var e = f.text();
- f.remove();
- var c = jd82tylpAK1P0Tvmga2rljssRTRVhio67x(e, "Bd7a2II4A1V50tQ92EKDtvTIUxJ9Smvt");
- var a = $('<a href="' + c.replace(/(\/\/\w+\.sinaimg\.cn\/)(\w+)(\/.+\.(gif|jpg|jpeg))/, "$1large$3") + '" target="_blank" class="view_img_link">[鏌ョ湅鍘熷浘]</a>');
- d.before(a);
- d.before("<br>");
- d.removeAttr("onload");
- d.attr("src", location.protocol + c.replace(/(\/\/\w+\.sinaimg\.cn\/)(\w+)(\/.+\.gif)/, "$1thumb180$3"));
- if (/\.gif$/.test(c)) {
- d.attr("org_src", location.protocol + c);
- b.onload = function() {
- add_img_loading_mask(this, load_sina_gif)
- }
- }
- }
复制代码
|
|