There are manuy useful PDF documents in the internet. These are very useful when we get data for training.
PyPDF2 is a library for manipulating PDF files via Python.
You can install PyPDF2 via pip.
pip install PyPDF2
PdfFileReader Class - Official PyPDF2 document
rb
.PyPDF2.PdfFileReader(file_object)
import PyPDF2
with open("sample.pdf",mode="rb") as in_file:
pdf_reader = PyPDF2.PdfFileReader(in_file)
the_number_of_pages = pdf_reader.getNumPages()
page_one = pdf_reader.getPage(0)
page_one_text = page_one.extractText()
At a glance, there is a writer class in PyPDF2.
PdfFileWriter Class - Official PyPDF2 document
But it can’t generate a new PDF file. More precisely, PdfFileWriter can copy the original PDF page and save it to another file, but it can’t modify the content and write a new text documents1.
Instead, “reportlab” is good at it.
ReportLab - PDF Library User Guide
I use PyPDF2 to get the text data from PDF files used for NLP. So… I gave up to investigate the writing method :(
So far I tried, I don’t know how to generate a PDF file which contains simple text. ↩︎