[已解决]dict的怪问题

blackantt · 发表于 2023-4-28 11:18:21

马上注册，结交更多好友，享用更多功能^_^

您需要登录才可以下载或查看，没有账号？立即注册

x

本帖最后由 blackantt 于 2023-4-28 11:21 编辑

import typing
from borb.pdf import Document
from borb.pdf import PDF
from borb.toolkit import SimpleTextExtraction


def main():

    # read the Document
    doc: typing.Optional[Document] = None
    l: SimpleTextExtraction = SimpleTextExtraction()
    with open("whose-button-is-this.pdf", "rb") as in_file_handle:
        doc = PDF.loads(in_file_handle, [l])

    # check whether we have read a Document
    assert doc is not None

    # print the text on the first Page
    print(l.get_text(), len(l.get_text()), type(l.get_text()))   #<class 'dict'>

    print(l.get_text()[3])
            for i in range(0, len(l.get_text())):                                #加这2行就会出错
                print('----------------page', i, l.get_text()[i])                #加这2行就会出错




if __name__ == "__main__":
    main()

----------------page 0 Whose
button
is this?
James Woolley Paul Kennedy Louise Gale
----------------page 1 Whose button is this?
This book belongs to
Traceback (most recent call last):
File "bord抽取pdf里的text到列表1.py", line 29, in <module>
main()
File "bord抽取pdf里的text到列表1.py", line 23, in main
print('----------------page', i, l.get_text()[i])
KeyError: 2

最佳答案

月排行榜 / 总排行榜

isdkz

2023-4-28 12:22:47

blackantt 发表于 2023-4-28 11:32
字典的遍历方法可能我搞错了

for key in l.get_text():

在您的第一次尝试中，出现了一个`KeyError`异常，这意味着在您试图使用一个不存在的键访问字典时发生了错误。在这种情况下，您尝试访问第三页（页码为2），但是字典中没有这样的键。

为了遍历字典中的所有页，您可以使用以下代码：

# Iterate through the pages for page, text in l.get_text().items(): print(f"Page {page}: {text}")

这将打印字典中每一页的页码和文本。请注意，字典中的键是字符串类型而不是整数类型，因此我们使用了 `.items()` 方法来同时迭代键和值。

如果您只想遍历所有文本，可以使用以下代码：

# Iterate through all text for text in l.get_text().values(): print(text)

这将打印字典中所有页的文本。

跳转到最佳答案楼层

isdkz · 发表于 2023-4-28 11:21:02

您的代码读取一个名为 "whose-button-is-this.pdf" 的 PDF 文件并试图将其内容提取为文本。`SimpleTextExtraction` 类将文本提取为一个字典，其中键是页码（从0开始），值是相应页的文本。您提到了一个关于这个字典的“怪问题”，但没有提供具体的问题描述。以下是对您可能遇到的问题的一些建议。

1. 您在代码中使用了错误的索引方式。`l.get_text()` 返回的字典类型对象不能直接使用整数索引（例如 `l.get_text()[3]`）。您需要使用页码作为键来访问字典中的相应文本。

改进后的代码如下：

def main(): # read the Document doc: typing.Optional[Document] = None l: SimpleTextExtraction = SimpleTextExtraction() with open("whose-button-is-this.pdf", "rb") as in_file_handle: doc = PDF.loads(in_file_handle, [l]) # check whether we have read a Document assert doc is not None # print the text on the first Page print(l.get_text(), len(l.get_text()), type(l.get_text())) #<class 'dict'> # print the text on the third page print(l.get_text()[2]) # Iterate through the pages for i in range(0, len(l.get_text())): print('----------------page', i, l.get_text()[i]) if __name__ == "__main__": main()

请注意，我将 `l.get_text()[3]` 更改为 `l.get_text()[2]`，因为页码从0开始。

2. 如果您在运行代码时遇到其他问题，请提供详细的错误信息，以便我们可以提供更具体的帮助。

blackantt · 发表于 2023-4-28 11:25:53

isdkz 发表于 2023-4-28 11:21
您的代码读取一个名为 "whose-button-is-this.pdf" 的 PDF 文件并试图将其内容提取为文本。`SimpleTextExtr ...

Traceback (most recent call last):
File "c:\Users\dengz\Downloads\pdf\1\Untitled-1.py", line 29, in <module>
main()
File "c:\Users\dengz\Downloads\pdf\1\Untitled-1.py", line 22, in main
print(l.get_text()[2])
KeyError: 2

blackantt · 发表于 2023-4-28 11:32:52

本帖最后由 blackantt 于 2023-4-28 11:39 编辑

isdkz 发表于 2023-4-28 11:21
您的代码读取一个名为 "whose-button-is-this.pdf" 的 PDF 文件并试图将其内容提取为文本。`SimpleTextExtr ...

字典的遍历方法可能我搞错了

for key in l.get_text():
      print('----------------page', key)
for value in l.get_text().values():
      print(value)

for key in l.get_text():
      print(key, str(l.get_text()[key]))
      print('------------------------------')

isdkz · 发表于 2023-4-28 12:22:47

blackantt 发表于 2023-4-28 11:32
字典的遍历方法可能我搞错了

for key in l.get_text():

在您的第一次尝试中，出现了一个`KeyError`异常，这意味着在您试图使用一个不存在的键访问字典时发生了错误。在这种情况下，您尝试访问第三页（页码为2），但是字典中没有这样的键。

为了遍历字典中的所有页，您可以使用以下代码：

# Iterate through the pages for page, text in l.get_text().items(): print(f"Page {page}: {text}")

这将打印字典中每一页的页码和文本。请注意，字典中的键是字符串类型而不是整数类型，因此我们使用了 `.items()` 方法来同时迭代键和值。

如果您只想遍历所有文本，可以使用以下代码：

# Iterate through all text for text in l.get_text().values(): print(text)

这将打印字典中所有页的文本。

账号		自动登录	找回密码
密码			立即注册