错在哪？,Python交流,编程语言专区,鱼C论坛

ssqchina 发表于 2023-7-18 23:25:08

错在哪？

from pdf2image import convert_from_path
import pytesseract

# 将PDF转换为图片
def pdf_to_image(pdf_path):
images = convert_from_path(pdf_path)
return images

# 对图片进行OCR识别
def recognize_text(images):
text_list = []
for image in images:
   text = pytesseract.image_to_string(image, lang='chi_sim')
   text_list.append(text)
return text_list

# 将识别结果保存到文件中
def save_text(text_list, file_path):
with open(file_path, 'w', encoding='utf-8') as f:
   for text in text_list:
         f.write(text + '\n\n')

if __name__ == '__main__':
# 将'example.pdf'转换为图片
images = pdf_to_image(r'f:\123.pdf')

# 对图片进行OCR识别
text_list = recognize_text(images)

# 将识别结果保存到'example.txt'中
save_text(text_list, r'f:\123.txt')

错误信息
Traceback (most recent call last):
File "C:\Users\ssq\AppData\Roaming\Python\Python39\site-packages\pdf2image\pdf2image.py", line 568, in pdfinfo_from_path
proc = Popen(command, env=env, stdout=PIPE, stderr=PIPE)
File "C:\Program Files (x86)\Python39-32\lib\subprocess.py", line 951, in __init__
self._execute_child(args, executable, preexec_fn, close_fds,
File "C:\Program Files (x86)\Python39-32\lib\subprocess.py", line 1420, in _execute_child
hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
FileNotFoundError: 系统找不到指定的文件。

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "D:\Python\pdf2ocr.py", line 25, in <module>
images = pdf_to_image(r'f:\123.pdf')
File "D:\Python\pdf2ocr.py", line 6, in pdf_to_image
images = convert_from_path(pdf_path)
File "C:\Users\ssq\AppData\Roaming\Python\Python39\site-packages\pdf2image\pdf2image.py", line 127, in convert_from_path
page_count = pdfinfo_from_path(
File "C:\Users\ssq\AppData\Roaming\Python\Python39\site-packages\pdf2image\pdf2image.py", line 594, in pdfinfo_from_path
raise PDFInfoNotInstalledError(
pdf2image.exceptions.PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH?

进程已结束,退出代码1

isdkz 发表于 2023-7-18 23:25:09

你的问题看起来是出在 pdf2image 库的依赖 poppler 没有正确安装或者没有在系统的 PATH 中。这导致了 pdf2image.exceptions.PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH? 的错误。

解决这个问题的步骤是：

1. 首先，你需要确保已经安装了 poppler 。对于Windows用户，你可以在[这个链接](http://blog.alivate.com.au/poppler-windows/)下载预编译的 poppler 。对于Linux用户，可以通过包管理器（如 apt 或 yum ）来安装，例如sudo apt-get install poppler-utils 。

2. 接着，你需要将 poppler 的 bin 目录添加到你的系统 PATH 。在Windows上，你可以按照下面的步骤进行操作：

· 在Windows资源管理器中，右键点击此电脑（或我的电脑），选择属性。
· 在左侧的菜单中，选择"高级系统设置"。
· 在弹出的系统属性窗口中，选择"环境变量"按钮。
· 在环境变量窗口中，找到系统变量下的Path变量，双击进入。
· 在弹出的编辑环境变量窗口中，点击新建，然后浏览到你的 poppler\bin 的路径。

这样应该可以解决你的问题。如果你按照上述步骤操作后仍然出现问题，那么可能是其他问题导致的，你可以再详细描述一下你的问题。

sfqxx 发表于 2023-7-18 23:30:16

这个错误信息是由于你的系统中没有安装 poppler ，或者 poppler 没有被添加到你的系统路径（PATH）中。

pdf2image 库需要 poppler 工具包来从 PDF 文件中提取图像。你需要确保你已经安装了 poppler 并且它已经被添加到了你的系统路径中。

对于 Windows 使用者，你可以按照以下步骤安装和配置 poppler ：

1.下载 poppler for Windows 的二进制文件。

2.下载完成后，解压缩文件。

3.将解压缩的文件夹的路径添加到你的系统环境变量 PATH 中。步骤如下：

·在计算机的属性中选择“高级系统设置”。
·点击“环境变量”按钮。
·在“系统变量”区域，滚动找到并选择 Path ，然后点击“编辑”按钮。
·在编辑环境变量窗口中，点击“新建”，然后粘贴你的 poppler 文件夹的路径。
·保存更改并重启你的开发环境（如命令提示符、PowerShell或PyCharm等）。

现在，你应该能够成功运行你的代码了。如果还有问题，请再次检查你的 poppler 安装和配置。

求最佳答案{:10_254:}

ssqchina 发表于 2023-7-19 10:42:03

本帖最后由 ssqchina 于 2023-7-19 12:52 编辑

isdkz 发表于 2023-7-18 23:27
你的问题看起来是出在 pdf2image 库的依赖 poppler 没有正确安装或者没有在系统的 PATH 中。这导致了 pdf2i ...

https://github.com/oschwartz10612/poppler-windows/releases要用代理打开

ssqchina 发表于 2023-7-19 12:51:18

本帖最后由 ssqchina 于 2023-7-19 16:31 编辑

能运行的程序见下楼，但是对大的PDF文件还是出错

ssqchina 发表于 2023-7-19 16:23:15

from pdf2image import convert_from_path
import pytesseract

# 将PDF转换为图片
poppler_path = r'C:\Program Files\poppler-23.07.0\Library\bin'
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
def pdf_to_image(pdf_path):
images = convert_from_path(pdf_path, poppler_path=poppler_path)
return images

# 对图片进行OCR识别
def recognize_text(images):
text_list = []
for image in images:
   text = pytesseract.image_to_string(image, lang='chi_sim')
   text_list.append(text)
return text_list

# 将识别结果保存到文件中
def save_text(text_list, file_path):
with open(file_path, 'w', encoding='utf-8') as f:
   for text in text_list:
         f.write(text + '\n\n')

if __name__ == '__main__':
# 将'example.pdf'转换为图片
images = pdf_to_image(r'f:\123.pdf')

# 对图片进行OCR识别
text_list = recognize_text(images)

# 将识别结果保存到'example.txt'中
save_text(text_list, r'f:\123.txt')

ssqchina 发表于 2023-7-19 16:32:01

小的PDF文件能识别，大的出错，说什么超过容量

sfqxx 发表于 2023-7-19 16:38:48

ssqchina 发表于 2023-7-19 16:32
小的PDF文件能识别，大的出错，说什么超过容量

以后没解决的问题先别设置最佳答案哦

页: [1]

鱼C论坛's Archiver

错在哪？