鱼C论坛

 找回密码
 立即注册
查看: 1491|回复: 0

[分享] 【测试】python抓取url不同线程下测试耗时,结果写入EXCEL

[复制链接]
发表于 2019-6-2 23:27:27 | 显示全部楼层 |阅读模式

马上注册,结交更多好友,享用更多功能^_^

您需要 登录 才可以下载或查看,没有账号?立即注册

x
1、抓取https://www.mzitu.com/mm /page/2-33页面下图片链接,返回URL
2、统计串联、单线程、多线程同条件抓取耗时,返回数组
3、结果写入EXCEL
4、多次测试求平均,最大值,80%分位值,横向对比,了解多线程差异

  1. import re
  2. import time
  3. from multiprocessing import Pool  #多线程
  4. from bs4 import BeautifulSoup
  5. import requests #获取HTTP相应
  6. import xlwt  #Excel写入

  7. def url_open(url):
  8.     name=url
  9.     headers={
  10.         'Referer': 'https://www.mzitu.com/',
  11.         'User-Agent':'Mozilla /5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0'
  12.         }
  13.     response=requests.get(url,headers=headers)
  14.     html=response.text
  15.     soup=BeautifulSoup(html,'html.parser')
  16.     div=soup.find('div',attrs={'id':'pins'})
  17.     soup1=str(div)
  18.     p=r'<a href="([^"]+\d)"'
  19.     url=re.findall(p,soup1)
  20.     #URL去重,排序不变
  21.     url_list = list(set(url))
  22.     url_list.sort(key=url.index)
  23.     #print("页面链接%s:共有%d个有效果链接"%(name,len(url_list)))
  24.     return url_list

  25. #获取每次测试的时间,返回数组
  26. def getTime(i):
  27.     timeData=[]
  28.     urls = ['https://www.mzitu.com/mm/page/{}/'.format(str(i)) for i in range(2, 32)]
  29.     start_0 = time.time()
  30.     for url in urls:
  31.         url_open(url)
  32.     end_0 = time.time()
  33.     time_0=end_0 - start_0
  34.     print('串行爬虫耗时:',time_0)

  35.     start_1 = time.time()
  36.     pool = Pool(processes=1)
  37.     pool.map(url_open,urls)
  38.     end_1 = time.time()
  39.     time_1=end_1 - start_1
  40.     print('1进程爬虫耗时:',time_1)
  41.    
  42.     start_2 = time.time()
  43.     pool = Pool(processes=2)
  44.     pool.map(url_open,urls)
  45.     end_2 = time.time()
  46.     time_2=end_2 - start_2
  47.     print('2进程爬虫耗时:',time_2)
  48.    
  49.     start_3 = time.time()
  50.     pool = Pool(processes=3)
  51.     pool.map(url_open,urls)
  52.     end_3 = time.time()
  53.     time_3=end_3 - start_3
  54.     print('3进程爬虫耗时:',time_3)
  55.    
  56.     start_4 = time.time()
  57.     pool = Pool(processes=4)
  58.     pool.map(url_open,urls)
  59.     end_4 = time.time()
  60.     time_4=end_4 - start_4
  61.     print('4进程爬虫耗时:',time_4)
  62.     timeData=[i,time_0,time_1,time_2,time_3,time_4]
  63.     return timeData

  64. if __name__ == "__main__":
  65.     #新建Excel
  66.     workbook = xlwt.Workbook(encoding = 'ascii')
  67.     worksheet=workbook.add_sheet('My Worksheet')
  68.     workbook.save('Excel_Workbook.xls')
  69.     for x in range(1,2):
  70.         timeData=getTime(x)
  71.         #数组for循环写入单元格
  72.         for i in range(0,6):
  73.             worksheet.write(x, i, label = timeData[i])
  74.         print("已完成第%d次测试,结果写入EXCEL成功"%x)
  75.         workbook.save('Excel_Workbook.xls')
  76.    
复制代码


EXCEL读写
https://www.cnblogs.com/xuxaut-558/p/10166642.html
多进程爬虫
https://www.cnblogs.com/MrLJC/p/3715783.html
实用才是硬道理,先模仿,再进步
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复

使用道具 举报

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

小黑屋|手机版|Archiver|鱼C工作室 ( 粤ICP备18085999号-1 | 粤公网安备 44051102000585号)

GMT+8, 2024-5-27 04:13

Powered by Discuz! X3.4

© 2001-2023 Discuz! Team.

快速回复 返回顶部 返回列表