想要将多个excel文件中的数据合并为一个excel里面的多个sheet，想要提高效率,Python交流,编程语言专区,鱼C论坛

神兽坤坤 发表于 2024-12-24 15:36:35

想要将多个excel文件中的数据合并为一个excel里面的多个sheet，想要提高效率

目前想做的就是把八九个excel全部合成为一个excel文件中的多个sheet，sheet名字就是文件名，已经用pandas的循环去写好了，但是每个excel里面都有几千或者几万条数据，数据量有点儿大，想要速度快一点，试过并行，但速度好像没有怎么提高，做完将近两分多钟了，想要最好越快越好。各位鱼油大大有没有什么好点子。{:10_254:}
import glob
import pandas as pd

writer = pd.ExcelWriter('总结123.xlsx')
flist=glob.glob(r".\result\*.xlsx")
for i in flist:
df=pd.read_excel(i)

fname=i.split("\\")[-1]
fname=fname.split(".")
print(fname)
df.to_excel(writer, sheet_name=fname, index = False)
writer._save()

FishC 发表于 2024-12-24 15:36:47

import glob
import pandas as pd
from multiprocessing import Pool

def process_file(file):
"""读取Excel文件并返回数据和文件名"""
df = pd.read_excel(file)
fname = file.split("\\")[-1].split(".")
return df, fname

def main():
writer = pd.ExcelWriter('总结123.xlsx', engine='xlsxwriter')
flist = glob.glob(r".\result\*.xlsx")

# 使用多进程池来并行处理文件
with Pool() as pool:
   results = pool.map(process_file, flist)

   for df, fname in results:
         df.to_excel(writer, sheet_name=fname, index=False)

writer._save()

if __name__ == "__main__":
main()

这里提供了一种基于多进程的解决方案。我们使用`multiprocessing`库的`Pool`类来并行读取Excel文件，以提高处理速度。在`process_file`函数中，我们读取每个Excel文件并返回数据和文件名。然后，在`main()`函数中，我们调用`Pool().map()`方法来并行处理文件列表。

注意：通过多进程可能并不能大幅度提升速度，特别是在I/O操作非常耗时的情况下。确保你的硬盘读写速度不成瓶颈，同时也可以尝试将Excel文件转换成其他更高效的格式进行处理。或者考虑使用批处理工具和其他支持直接读取和合并Excel文件的工具来提升效率。

以上回复来自 FishC AI 小助理，如未能正确解答您的问题，请继续追问。

很cool的阳 发表于 2024-12-24 15:40:00

import glob
import pandas as pd
from concurrent.futures import ThreadPoolExecutor

def process_file(file_path, writer):
df = pd.read_excel(file_path)
fname = file_path.split("\\")[-1].split(".")
df.to_excel(writer, sheet_name=fname, index=False)

writer = pd.ExcelWriter('总结123.xlsx')
flist = glob.glob(r".\result\*.xlsx")

with ThreadPoolExecutor(max_workers=4) as executor:
futures =
for future in futures:
future.result()# 确保所有任务完成

writer._save()

神兽坤坤 发表于 2024-12-24 16:02:32

很cool的阳发表于 2024-12-24 15:40

非常感谢哥，确实要快了一些，比我的要好{:9_228:}，但是感觉还是不够快，我处理了一遍我的数据还是需要一份四十多，就是还想再快一点，最好能在一分出头那种感觉，还有什么更快一些的逻辑或者好用的库嘛哥。

神兽坤坤 发表于 2024-12-24 16:31:49

FishC 发表于 2024-12-24 15:36
这里提供了一种基于多进程的解决方案。我们使用`multiprocessing`库的`Pool`类来并行读取Excel文件，以 ...

还有别的什么办法或者函数可以使功能更加高效，速度运行的更快速吗，希望速度运行的越快越好

FishC 发表于 2024-12-24 16:44:31

神兽坤坤发表于 2024-12-24 16:31
还有别的什么办法或者函数可以使功能更加高效，速度运 ...

你可以尝试使用以下方法来提高Excel文件操作的效率：

1. **使用`openpyxl`库**：`pandas`默认使用`xlrd`来读取Excel文件，这可能在处理大型文件时效率不高。`openpyxl`可以作为替代，它是一个专门为Excel提供高效读写支持的库。可以通过设置`engine='openpyxl'`来使用这个库。

2. **减少数据处理步骤**：如果可能的话，尽量减少对数据的处理，比如删除不必要的数据列，或在读取时直接指定需要的列。

3. **批量读取数据**：如果文件非常大，可以考虑分批读取数据，比如通过`chunk_size`参数来读取数据块，以避免内存占用过高。

4. **使用更高效的文件格式**：将数据从Excel转换为更高效的格式（例如CSV、Parquet等）进行处理。之后再根据需要转换回Excel格式，如果必要的话。

以下是使用`openpyxl`的示例代码：

import glob
import pandas as pd

def main():
writer = pd.ExcelWriter('总结123.xlsx', engine='openpyxl')
flist = glob.glob(r".\result\*.xlsx")

for file in flist:
   df = pd.read_excel(file, engine='openpyxl')
   fname = file.split("\\").split(".")
   df.to_excel(writer, sheet_name=fname, index=False)

writer.save()

if __name__ == "__main__":
main()

尝试压缩数据量、使用更快的库，这些方法都可以有助于提高速度和效率。需要注意的是，提升效率的效果和数据结构及文件大小等因素密切相关，所以可能需要进行多次尝试和测试，选择最合适的方法。

以上回复来自 FishC AI 小助理，如未能正确解答您的问题，请继续追问。

很cool的阳 发表于 2024-12-24 17:32:10

神兽坤坤发表于 2024-12-24 16:02
非常感谢哥，确实要快了一些，比我的要好，但是感觉还是不够快，我处理了一遍我的数据还是需要 ...

import glob
from multiprocessing import Pool
from openpyxl import Workbook

def process_file(file_path):
df = pd.read_excel(file_path)
fname = file_path.split("\\")[-1].split(".")
return fname, df

if __name__ == '__main__':
flist = glob.glob(r".\result\*.xlsx")
with Pool(processes=4) as pool:
   results = pool.map(process_file, flist)

wb = Workbook()
for fname, df in results:
   ws = wb.create_sheet(title=fname)
   for r in dataframe_to_rows(df, index=False, header=True):
         ws.append(r)
wb.save('总结123.xlsx')

页: [1]

鱼C论坛's Archiver

想要将多个excel文件中的数据合并为一个excel里面的多个sheet，想要提高效率