|
马上注册,结交更多好友,享用更多功能^_^
您需要 登录 才可以下载或查看,没有账号?立即注册
x
本帖最后由 炽离 于 2015-4-21 22:36 编辑
这两天看了小甲鱼的Python爬虫视频,试着写了一个抓取FishC的Python教学视频的代码。所用到的技术均为小甲鱼视频中讲述,很感谢小甲鱼和FishC的工作。新手代码比较烂,大家将就着看吧。
//----------------------------------------------代码开始---------------------------------------------------------
import re,urllib2,random,time
def dlfile(xlurl,iplist,f):
urlobj = urllib2.Request(xlurl)
urlobj.add_header('User-Agent','Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.118 Safari/537.36')
#代理不好使,总是403、502或其他错误,所以屏蔽掉。有兴趣的自己打开
'''
proxy_support = urllib2.ProxyHandler({'http':random.choice(iplist)})
opener = urllib2.build_opener(proxy_support)
urllib2.install_opener(opener)
'''
temp = urllib2.urlopen(urlobj)
urlopen = temp.read()
h = urlopen.find(r'file_url')
if h!= -1:
t = urlopen.find(r'file_size',h)
fileurl = urlopen[h+10:t-2]
namehead = urlopen.find(r'file_name="')
filename = urlopen[namehead+11:h-2]
print filename
print fileurl
f.write(fileurl+'\n')
else:
pass
## time.sleep(15)
iplist = ['58.215.187.2:82','163.177.79.5:80','101.71.27.120:80','14.29.80.34:80','218.204.140.106:8118','101.69.199.99:80','120.132.50.69:8080','121.14.4.111:80']
url_1 = "http://blog.fishc.com/category/python/page/"
f = open("download.txt",'w')
i = 1
#一共有8页
try:
while i<=8:
url_2 = url_1+str(i)
i+=1
#url_2为视频讲解集合页面,1-7页每页有10条链接,第8页有2条链接
print url_2
url2 = urllib2.urlopen(url_2).read()
urllist = re.findall(r'<h2><a href="http://blog.fishc.com/(?:\d){4}\.html',url2)
for each in urllist:
url_3 = each[13:]
print url_3
print '------------------------------------'
url3 = urllib2.urlopen(url_3).read()
xlurl_h = url3.find(r'download_button_part')
#如果当前网页能够找到下载地址,则进入下载地址
if xlurl_h!= -1:
xlurl_t = url3.find(r'target="_blank"',xlurl_h)
xlurl = url3[xlurl_h+52:xlurl_t-2]
print xlurl
dlfile(xlurl,iplist,f)
#如果当前网页找不到下载地址,则表示当前网页有分页
else:
p = re.compile(r'%s/\d{1}'%url_3)
fenye = p.findall(url3)
if len(fenye) == 1:
lastpage = fenye[0]
else:
lastpage = fenye[-2]
#最后一个button是“下一页”,因此,倒数第二个才是真实的最后一页
#print lastpage
url4 = urllib2.urlopen(lastpage).read()
xlurl_h = url4.find(r'download_button_part')
xlurl_t = url4.find(r'target="_blank"',xlurl_h)
xlurl = url4[xlurl_h+52:xlurl_t-2]
print xlurl
dlfile(xlurl,iplist,f)
print '+++++++++++++++++++++++++++'
finally:
f.close()
//----------------------------------------------代码结束---------------------------------------------------------
Shell输出:(第一次发帖,图片没上传成功,下面Shell里面输出结果)>>> ================================ RESTART ================================
>>>
http://blog.fishc.com/category/python/page/1
http://blog.fishc.com/4371.html
------------------------------------
http://kuai.xunlei.com/d/BdsUAwL8BQD1gy9V369
071GUI的终极选择:Tkinter8.zip
http://gdl.lixian.vip.xunlei.com/071GUI%E7%9A%84%E7%BB%88%E6%9E%81%E9%80%89%E6%8B%A9%EF%BC%9ATkinter8%2Ezip?fid=W0VtzI6nRhj6sN6oVMTfDHaUOyOkungFAAAAAP9okG7nWcSLPiaBMmOf2l53BlIs&mid=666&threshold=150&tid=5B0E55B36C54C398A06AE4168DEF2A4C&srcid=6&verno=1&g=FF68906EE759C48B3E268132639FDA5E7706522C&ui=xlkuaichuan&s=91798180&pk=kuaichuan&ak=8:0:999:0&e=1431545367&ms=1433600&ci=&ck=08F8281FC3EA3B6D5E4A448A78515CE0&at=AA74E3C8664266115B744B5315B65C02&n=08597553A4D648E79AEDA5D96B659F81E9E9CB84682AEEBC9A3D290B8DF764723847380B9383010000
http://blog.fishc.com/4366.html
------------------------------------
http://kuai.xunlei.com/d/BdsUAwLoAgCIqChVe1e
070GUI的终极选择:Tkinter7.zip
http://gdl.lixian.vip.xunlei.com/070GUI%E7%9A%84%E7%BB%88%E6%9E%81%E9%80%89%E6%8B%A9%EF%BC%9ATkinter7%2Ezip?fid=vhdVjX4DGQVlZAdAwuLv0NqzgNp16TcHAAAAAEN9JbqkmtLAEBH+ugf5FEEjRog4&mid=666&threshold=150&tid=5414059D3984524BAB1209A13619309D&srcid=6&verno=1&g=437D25BAA49AD2C01011FEBA07F9144123468838&ui=xlkuaichuan&s=121104757&pk=kuaichuan&ak=8:0:999:0&e=1431372339&ms=1433600&ci=&ck=9319AABE9FBCA7E638257B4F2C160C80&at=831379C054553A8E5C1E24B828E7CEC1&n=09A9DBF40F8D44E79A1D0B7FC03E9381E9196522C371E2BC9ACD87AD26AC687237B796AD38D80D0000
http://blog.fishc.com/4353.html
------------------------------------
http://kuai.xunlei.com/d/BdsUAwIfKgBltSZVea0
http://blog.fishc.com/4349.html
------------------------------------
http://kuai.xunlei.com/d/BdsUAwJ6IAC28hZVee5
068GUI的终极选择:Tkinter5.zip
http://gdl.lixian.vip.xunlei.com/068GUI%E7%9A%84%E7%BB%88%E6%9E%81%E9%80%89%E6%8B%A9%EF%BC%9ATkinter5%2Ezip?fid=ltglEL/Ctob98T9P/K6XPm+a4GchBk8FAAAAAJF5sjDvTmvYlXBS5Xo3DkO/LC9K&mid=666&threshold=150&tid=1F5F43D740160FC608CAFF46766CE022&srcid=6&verno=1&g=9179B230EF4E6BD8957052E57A370E43BF2C2F4A&ui=xlkuaichuan&s=89064993&pk=kuaichuan&ak=8:0:999:0&e=1431250897&ms=1433600&ci=&ck=2465183EFFD12EA7B9D55108E9726E89&at=E8B9FC4090F436EF35232D49A8533D89&n=0C55F6AA961436E39AE1272959A7E185E9E549745AE890B89A31ABFBBF351A76354BBAFBA1417F0400
file:///C:\Users\Arthur\AppData\Roaming\Tencent\Users\369272127\QQ\WinTemp\RichOle\LL_7A48{1OUF6~@N{~A8WAQ.png
。。。
实现代码过程中的问题:
问题描述:抓取迅雷链接到一定量后,迅雷会弹出万恶的验证码,屏蔽链接地址。
解决方式:使用代理(目前只会这个)。由于代理IP是从网上随意找的,因此,使用这些代理IP会常常无端弹出403、502或其他错误。每次都需要修改i的值,从第i页重新抓取,大概2-3次可抓取全部视频链接。
|
|