[已解决]爬取1000个网站的图标，耗时60分钟，太慢了，怎样提速？

胡萝卜_yx · 发表于 2016-3-15 11:17:13

马上注册，结交更多好友，享用更多功能^_^

您需要登录才可以下载或查看，没有账号？立即注册

x

我开了两个线程，分别读取500个url。例如：www.51.la
1：直接将读取的url比如：www.51.la 处理成http：//www.51.la/favicon.ico 然后使用re=urllib.urlopen(url) 获取，html=re.read()，html.write()
2：如果1失败了，则，访问www.51.la ，使用re模块匹配ico图标的url地址。
遇到的问题：有些网站是http的，有些网站是https的。在使用re模块匹配的时候，有些url链接是http://这种完整的链接地址，有些是/favico.ico这种简写的。将各种情况进行判断，运行十分慢，爬取了60分钟。

  求指导：我就是想快速的获取1000个url的图标。url有http和https两种模式。

import urllib,urllib2,re,sys,os
from multiprocessing import Process

def url_open_gif(filename,pic_name,name):
#filename="http://"+name.lower()[:-1]+"/favicon.ico"
#pic_name=name.lower()[:-1]+".gif"
try:
      url=urllib2.urlopen(filename)
      code=url.getcode()
      if code==200:
         html=url.read()
         fp=open(r'C:/Python27/lesson/test_3_13/picture/'+pic_name,'w+')
         fp.write(html)
         fp.close()
except:
      return  name
      pass
def url_findall(name2):
name=str(name2)
url='http://'+name
try:
      respon=urllib.urlopen(url)
      html=respon.read()
      respon.close()
      string=re.findall(r'href="(.*favico.*ico)"',html)
      pic_ico=string[0]
      if pic_ico:
         if "http" in pic_ico:
            pic_name=name[:-1]+".gif"
            url_open_gif(pic_ico,pic_name,name)
         elif "//" in pic_ico:
            pic_name=name[:-1]+".gif"
            url_new="http:"+pic_ico
            url_open_gif(url_new,pic_name,name)
         else:
            pic_name=name[:-1]+".gif"
            url_new="http://"+name[:-1]+"/"+pic_ico
            url_open_gif(url_new,pic_name,name)
except:
      tmp='https://'+name
      try:
         respon=urllib.urlopen(url)
         html=respon.read()
         respon.close()
         string=re.findall(r'href="(.*favico.*ico)"',html)
         pic_ico=string[0]
         if pic_ico:
            if "http" in pic_ico:
                  pic_name=name[:-1]+".gif"
                  url_open_gif(pic_ico,pic_name,name)
            elif "//" in pic_ico:
                  pic_name=name[:-1]+".gif"
                  url_new="https:"+pic_ico
                  url_open_gif(url_new,pic_name,name)
            else:
                  pic_name=name[:-1]+".gif"
                  url_new="https://"+name[:-1]+"/"+pic_ico
                  url_open_gif(url_new,pic_name,name)
      except:
         print name
         pass
      pass

def urltask(filename):
fp=open(filename,'r')
for name in fp:
      filename="http://"+name[:-1]+"/favicon.ico"
      pic_name=name[:-1]+".gif"
      fail_url=url_open_gif(filename,pic_name,name)
      fail_url2=url_findall(fail_url)
      #print fail_url2[:-1]
fp.close()

def works(func,worknum):
  proc_record = []
  for i in range(worknum):
   arg=str(i)+".txt"
   p = Process(target = func, args = (arg,))
   p.start()
   proc_record.append(p)
  for p in proc_record:
   p.join()

if __name__ == '__main__':
  procs =2
  works(urltask,procs)

最佳答案

月排行榜 / 总排行榜

wei_Y

2016-3-15 11:58:28

我觉得可以用多进程来代替多线程。
基本思路是
1. 先爬取所需要的所有url进行分类整理。
2. 将爬取到的url分类，每个进程请求多少个url。
3. 用requests库代替urllib库。
--
有个例子的话更好了。

跳转到最佳答案楼层