Py2PDF - read text from PDF file

March 29, 2020

Page content

Concept

There are manuy useful PDF documents in the internet. These are very useful when we get data for training.

PyPDF2 is a library for manipulating PDF files via Python.

Install

You can install PyPDF2 via pip.

pip install PyPDF2

How to use - read a PDF file

PdfFileReader Class - Official PyPDF2 document

We should open a file with mode rb.
Read the file with PyPDF2.PdfFileReader(file_object)

import PyPDF2

with open("sample.pdf",mode="rb") as in_file:
    pdf_reader = PyPDF2.PdfFileReader(in_file)
    the_number_of_pages = pdf_reader.getNumPages()
    page_one = pdf_reader.getPage(0)
    page_one_text = page_one.extractText()

Write a PDF file - don’t try with PyPDF2

At a glance, there is a writer class in PyPDF2.

PdfFileWriter Class - Official PyPDF2 document

But it can’t generate a new PDF file. More precisely, PdfFileWriter can copy the original PDF page and save it to another file, but it can’t modify the content and write a new text documents¹.

Instead, “reportlab” is good at it.

ReportLab - PDF Library User Guide

I use PyPDF2 to get the text data from PDF files used for NLP. So… I gave up to investigate the writing method :(

So far I tried, I don’t know how to generate a PDF file which contains simple text. ↩︎