|
|
马上注册,结交更多好友,享用更多功能^_^
您需要 登录 才可以下载或查看,没有账号?立即注册
x
本帖最后由 Poiink 于 2019-12-1 15:14 编辑
最近想写一个获取pdf文件内容并获得其页码的小脚本,参考网上的资料得知需要安装 pdfminer 模块,便使用pip方法进行安装,安装结果在尾图。但是在参考网上代码导入时,出现了如下情况:部分函数能够导入,但是部分报错
欲参考代码如下:
- # -*- coding: utf-8 -*-
- from pdfminer.pdfparser import PDFParser,PDFDocument
- from pdfminer.pdfinterp import PDFResourceManager,PDFPageInterpreter
- from pdfminer.converter import PDFPageAggregator
- from pdfminer.layout import LTTextBoxHorizontal,LAParams
- from pdfminer.pdfinterp import PDFTextExtractionNotAllowed
- def parsePDFtoTXT(pdf_path):
- fp = open(pdf_path, 'rb')
- parser = PDFParser(fp)
- document= PDFDocument()
- parser.set_document(document)
- document.set_parser(parser)
- document.initialize()
- if not document.is_extractable:
- raise PDFTextExtractionNotAllowed
- else:
- rsrcmgr=PDFResourceManager()
- laparams=LAParams()
- device=PDFPageAggregator(rsrcmgr,laparams=laparams)
- interpreter=PDFPageInterpreter(rsrcmgr,device)
- for page in document.get_pages():
- interpreter.process_page(page)
- layout=device.get_result()
- print(layout)
- output=str(layout)
- for x in layout:
- if (isinstance(x,LTTextBoxHorizontal)):
- text=x.get_text()
- output+=text
- with open('C:\\Users\\user\\Desktop\\pdfoutput.txt','a',encoding='utf-8') as f:
- f.write(output)
- def get_word_page(word_list):
- f=open('C:\\Users\\user\\Desktop\\pdfoutput.txt',encoding='utf-8')
- text_list=f.read().split('<LTPage')
- n=len(text_list)
- for w in word_list:
- page_list=[]
- for i in range(1,n):
- if w in text_list[i]:
- page_list.append(i)
- with open('C:\\Users\\user\\Desktop\\result.txt','a',encoding='utf-8') as f:
- f.write(w+str(page_list)+'\n')
-
- if __name__=='__main__':
- parsePDFtoTXT('C:\\Users\\user\\Desktop\\群体药动学原理建立卡马西平和丙戊酸的定时定量给药模型及临床应用_林玮玮.pdf')
- get_word_page(['群体药动学','服药时间','知情同意书','NONMEM','贝叶斯反馈'])
复制代码
|
|