There is pdftotext which does basically the same but this assumes pdftotext in /usr/local/bin whereas I am using this in AWS lambda and wanted to use it from the current directory.ītw: For using this on lambda you need to put the binary and the dependency to libstdc++.so into your lambda function. Res = n(args, stdout=subprocess.PIPE, stderr=subprocess.PIPE) SCRIPT_DIR = os.path.dirname(os.path.abspath(_file_)) pikepdf does not support text extraction ( source)Īfter trying textract (which seemed to have too many dependencies) and pypdf2 (which could not extract text from the pdfs I tested with) and tika (which was too slow) I ended up using pdftotext from xpdf (as already suggested in another answer) and just called the binary from python directly (you may need to adapt the path to pdftotext): import os, subprocess.Pymupdf import fitz # install using: pip install PyMuPDF Please note that those packages are not maintained: Give it a try :-) from pypdf import PdfReader The community improved the text extraction a lot in 2022. I became the maintainer of pypdf and PyPDF2 in 2022! □ Having said that, the results from November 2022: That means if your use-case requires those points, you might perceive the quality differently.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |