要将原代码添加多进程和多线程功能,可以按照以下方式修改:import osfrom concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutorfrom PyPDF2 import PdfReaderdef search_keywords_in_pdf(file_path, keyword): with open(file_path, 'rb') as f: pdf = PdfReader(f) for page_num, page in enumerate(pdf.pages, start=1): text = page.extract_text() if keyword in text: print(f"found '{keyword}' in {file_path} (Page {page_num})")def search_files(directory, keyword): files = os.listdir(directory) results = [] for file in files: if file.endswith('.pdf'): file_path = os.path.join(directory, file) results.append(executor.submit(search_keywords_in_pdf, file_path, keyword)) for result in concurrent.futures.as_completed(results): result.result()# 指定要搜索的目录和关键字directory = r'e:\年报'keyword = input('请输入要查询的内容')# 使用多进程池with ProcessPoolExecutor(max_workers=multiprocessing.cpu_count()) as executor: # 使用多线程池 with ThreadPoolExecutor(max_workers=multiprocessing.cpu_count()) as executor: search_files(directory, keyword)
修改后的代码使用了ProcessPoolExecutor来创建多个进程,并使用ThreadPoolExecutor在每个进程中创建多个线程。这样可以利用多核处理器的并行能力,在搜索多个PDF文件时提高效率。
请注意,多进程和多线程需要合理管理资源,如果PDF文件过多,可能会导致系统负载过重。可以根据实际情况调整max_workers参数来控制并发数。 |