|
发表于 2023-5-1 16:58:35
|
显示全部楼层
当遇到双栏的PDF时,提取表格可能会出现问题,因为每一页中的文本内容不一定按照水平顺序排列。您可以尝试通过指定table_settings参数来解决这个问题。例如,您可以设置一个表格边缘的左和右坐标,使其适应页面的宽度。下面是一个修改后的代码示例:
- import pdfplumber
- import pandas as pd
- pd.set_option("display.max_columns",None)
- pdf_path = "乐清湾盐沼湿地有机碳密度及碳储量估算_陈雅慧.pdf"
- pdf = pdfplumber.open(pdf_path)
- def read_page(page):
- tables = page.extract_tables(table_settings={
- 'vertical_strategy': 'lines',
- 'horizontal_strategy': 'text',
- 'explicit_vertical_lines': [],
- 'explicit_horizontal_lines': [],
- 'snap_tolerance': 0.01,
- 'join_tolerance': 0.1,
- 'edge_min_length': 2,
- 'min_words_vertical': 10,
- 'min_words_horizontal': 1,
- 'keep_blank_chars': False,
- 'text_tolerance': 0.3,
- 'text_x_tolerance': None,
- 'text_y_tolerance': None,
- 'intersection_tolerance': 0.5,
- 'intersection_x_tolerance': None,
- 'intersection_y_tolerance': None,
- }) # 添加table_setting参数进行调整
- for i in range(len(tables)):
- table = tables
- for i in range(len(table)):
- for j in range(len(table)):
- table[j] = table[j].replace('\n','') # 替换换行符
- df.pd.DataFrame(table[1:], columns=table[0])
- print(df)
- df.to_csv('stu_info.csv',mode = "a", endcoding = 'utf-8',index = False)
- print(str(i) + '表格提取成功')
- print("\n")
-
- pages = pdf.pages
- for k in range(len(pages)):
- page = pages[k]
- print("page" + str(k) + '正在解析')
- try:
- read_page(page)
- except Exception as
复制代码 |
|