Py2PDF - read text from PDF file

Page content

Concept

There are manuy useful PDF documents in the internet. These are very useful when we get data for training.

PyPDF2 is a library for manipulating PDF files via Python.

PyPDF2 Official Documentation

Install

You can install PyPDF2 via pip.

pip install PyPDF2

How to use - read a PDF file

PdfFileReader Class - Official PyPDF2 document

  • We should open a file with mode rb.
  • Read the file with PyPDF2.PdfFileReader(file_object)
import PyPDF2

with open("sample.pdf",mode="rb") as in_file:
    pdf_reader = PyPDF2.PdfFileReader(in_file)
    the_number_of_pages = pdf_reader.getNumPages()
    page_one = pdf_reader.getPage(0)
    page_one_text = page_one.extractText()

Write a PDF file - don’t try with PyPDF2

At a glance, there is a writer class in PyPDF2.

PdfFileWriter Class - Official PyPDF2 document

But it can’t generate a new PDF file. More precisely, PdfFileWriter can copy the original PDF page and save it to another file, but it can’t modify the content and write a new text documents1.

Instead, “reportlab” is good at it.

ReportLab - PDF Library User Guide

I use PyPDF2 to get the text data from PDF files used for NLP. So… I gave up to investigate the writing method :(


  1. So far I tried, I don’t know how to generate a PDF file which contains simple text. ↩︎