[已解决]类似图片格式的PDF该如何转txt文件啊？

ssqchina · 发表于 2023-7-18 16:37:20

马上注册，结交更多好友，享用更多功能^_^

您需要登录才可以下载或查看，没有账号？立即注册

x

类似图片格式的PDF该如何转txt文件啊？就是pdf文件用微软edge打开，都不能搜索里面的文字，应该是图片格式的，麻烦各位高手帮忙！！！

最佳答案

月排行榜 / 总排行榜

sfqxx

2023-7-18 16:58:37

你需要的是一个 Optical Character Recognition (OCR) 工具来解析PDF中的图像并转换为文本。Python 中有一些库可以帮助你做到这一点，例如 `PDF2` 和 `pytesseract`。

首先，你需要安装以下库（如果尚未安装）：

pip install pdf2image

pip install pytesseract

pip install PIL
复制代码

此外，你还需要在你的系统上安装 [Tesseract ↗](https://github.com/tesseract-ocr/tesseract/wiki)。

然后，你可以使用以下代码将 PDF 文件中的图片转换为文本：

from pdf2image import convert_from_path

import pytesseract

from PIL import Image

import io

# 你的 PDF 文件路径

pdf_path = 'path_to_your_pdf.pdf'

# 将 PDF 文件转换为 PIL Image 对象列表

images = convert_from_path(pdf_path)

# 初始化一个空字符串用于存储文本

result_text = ''

# 遍历所有的图片

for i, img in enumerate(images):

# 将图片转化为文本

text = pytesseract.image_to_string(img, lang='chi_sim') # 使用'chi_sim'参数进行中文识别

# 将识别后的文本添加到结果中

result_text += text

# 将结果存储到 txt 文件中

with open('output.txt', 'w', encoding='utf-8') as file:

file.write(result_text)
复制代码

注意：这只是一个基本的例子，可能需要根据你的需求进行适当的修改。在具体的环境中，你可能需要指定 Tesseract 的路径（通过 `pytesseract.pytesseract.tesseract_cmd`），并且在处理大型 PDF 文件时，你可能需要考虑内存管理。

求最佳答案

跳转到最佳答案楼层

ssqchina · 发表于 2023-7-18 16:49:44

陶远航发表于 2023-7-18 16:37
如果你遇到了类似图片格式的PDF，即PDF文件里的文字无法被搜索或复制，那么可能是因为这些文字被保存为图像 ...

python如何实现？

ssqchina · 发表于 2023-7-18 16:51:50

isdkz 发表于 2023-7-18 16:37
你的问题是关于将类似图片格式的PDF转化为txt文件。这类PDF一般是经过图像扫描得到的，无法直接提取其中的 ...

那就是python还不能完美解决这个问题

ssqchina · 发表于 2023-7-18 21:38:11

陶远航发表于 2023-7-18 16:57
你可以使用Python中的OCR（光学字符识别）库来将图片格式的PDF转换为可搜索的文本文件。OCR库可以帮助你 ...

import pytesseract
from pdf2image import convert_from_path

def pdf_to_txt(pdf_path, txt_path):
# 将PDF转换为图像列表
images = convert_from_path(pdf_path)

# 创建一个空的文本文件
with open(txt_path, 'w') as f:
      # 对每个图像应用OCR并将结果写入文本文件
      for i, image in enumerate(images):
         text = pytesseract.image_to_string(image, lang='eng')
         f.write(f'Page {i+1}:\n\n{text}\n\n')

print(f'转换完成！文本文件保存在：{txt_path}')

# 使用示例
pdf_path = 'f:\\'+input('请输入要转换的文件名')+'.pdf'
txt_path = 'f:\\'+input('请输入要保存的文件名')+'.txt'
pdf_to_txt(pdf_path, txt_path)

错误信息
请输入要转换的文件名456
请输入要保存的文件名789
Traceback (most recent call last):
  File "C:\Users\ssq\AppData\Roaming\Python\Python39\site-packages\pdf2image\pdf2image.py", line 568, in pdfinfo_from_path
proc = Popen(command, env=env, stdout=PIPE, stderr=PIPE)
  File "C:\Program Files (x86)\Python39-32\lib\subprocess.py", line 951, in __init__
self._execute_child(args, executable, preexec_fn, close_fds,
  File "C:\Program Files (x86)\Python39-32\lib\subprocess.py", line 1420, in _execute_child
hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
FileNotFoundError: [WinError 2] 系统找不到指定的文件。

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "D:\Python\pdf2ocr.py", line 20, in <module>
pdf_to_txt(pdf_path, txt_path)
  File "D:\Python\pdf2ocr.py", line 6, in pdf_to_txt
images = convert_from_path(pdf_path)
  File "C:\Users\ssq\AppData\Roaming\Python\Python39\site-packages\pdf2image\pdf2image.py", line 127, in convert_from_path
page_count = pdfinfo_from_path(
  File "C:\Users\ssq\AppData\Roaming\Python\Python39\site-packages\pdf2image\pdf2image.py", line 594, in pdfinfo_from_path
raise PDFInfoNotInstalledError(
pdf2image.exceptions.PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH?

ssqchina · 发表于 2023-7-18 21:40:51

sfqxx 发表于 2023-7-18 16:58
你需要的是一个 Optical Character Recognition (OCR) 工具来解析PDF中的图像并转换为文本。Python 中有一 ...

from pdf2image import convert_from_path
import pytesseract
from PIL import Image
import io

# 你的 PDF 文件路径
pdf_path = 'f:\\'+input('请输入要转换的文件名')+'.pdf'

# 将 PDF 文件转换为 PIL Image 对象列表
images = convert_from_path(pdf_path)

# 初始化一个空字符串用于存储文本
result_text = ''

# 遍历所有的图片
for i, img in enumerate(images):
# 将图片转化为文本
text = pytesseract.image_to_string(img, lang='chi_sim')  # 使用'chi_sim'参数进行中文识别

# 将识别后的文本添加到结果中
result_text += text

# 将结果存储到 txt 文件中
with open('f:\\'+input('请输入要保存的文件名')+'.txt', 'w', encoding='utf-8') as file:
file.write(result_text)

错误信息
请输入要转换的文件名456
Traceback (most recent call last):
  File "C:\Users\ssq\AppData\Roaming\Python\Python39\site-packages\pdf2image\pdf2image.py", line 568, in pdfinfo_from_path
proc = Popen(command, env=env, stdout=PIPE, stderr=PIPE)
  File "C:\Program Files (x86)\Python39-32\lib\subprocess.py", line 951, in __init__
self._execute_child(args, executable, preexec_fn, close_fds,
  File "C:\Program Files (x86)\Python39-32\lib\subprocess.py", line 1420, in _execute_child
hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
FileNotFoundError: [WinError 2] 系统找不到指定的文件。

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "D:\Python\pdf2ocr.py", line 10, in <module>
images = convert_from_path(pdf_path)
  File "C:\Users\ssq\AppData\Roaming\Python\Python39\site-packages\pdf2image\pdf2image.py", line 127, in convert_from_path
page_count = pdfinfo_from_path(
  File "C:\Users\ssq\AppData\Roaming\Python\Python39\site-packages\pdf2image\pdf2image.py", line 594, in pdfinfo_from_path
raise PDFInfoNotInstalledError(
pdf2image.exceptions.PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH?

进程已结束,退出代码1

ssqchina · 发表于 2023-7-19 16:25:48

陶远航发表于 2023-7-18 16:57
你可以使用Python中的OCR（光学字符识别）库来将图片格式的PDF转换为可搜索的文本文件。OCR库可以帮助你 ...

pip install tesseract这个装的不带中文

ssqchina · 发表于 2023-7-19 16:36:33

sfqxx 发表于 2023-7-18 16:58
你需要的是一个 Optical Character Recognition (OCR) 工具来解析PDF中的图像并转换为文本。Python 中有一 ...

你的回复里有中文识别设置，给你最佳，但是我修改后的只能识别小的PDF文件大的文件出错，说超过容量

账号		自动登录	找回密码
密码			立即注册

[已解决]类似图片格式的PDF该如何转txt文件啊？

马上注册，结交更多好友，享用更多功能^_^

浏览过的版块