A PDF is the most commonly used file format for documents since PDFs are extremely light-weight and can be used cross-platform. While several PDF readers and writers exist, you might think it’d be hard to extract text from a PDF programmatically. With Python, though, it’s easy.
Let’s say you want to develop a document classification application based on machine learning models trained on PDF documents. To do that, you’d need to extract text from the PDF documents. In cases like this, you have to find a way to programmatically read PDF files in your applications. That’s what we’re going to talk about today. We’ll show you how to read PDF documents in a Python application using PyPDF2. PyPDF2 is an awesome Python library capable of reading PDF documents and writing text to a PDF file.
It’s important to mention that PyPDF2 can only read PDF documents that contain data in the form of text. Scanned PDF documents which contain text in the form of images cannot be read by PyPDF2 so you’d need to find a way to OCR (optical character recognition) the images first.
To install the PyPDF2 library, execute the following pip command on your command terminal.
$ pip install PyPDF2
To demonstrate how to read a PDF file from your local drive, we’re going to use the PDF file found here. Download this file and save it as “sample.pdf” to your local file system. If you open the file, you’ll see that it contains 2 pages with some dummy data.
To read a PDF file with Python, you first have to import the PyPDF2 module. Next, you need to open the PDF file you want to read using the default Python open method. Since PDF files contain data in binary format, the permission for the open() method should be set to rb (read binary). Once you open the file, the file handler returned by the open() method is passed to the constructor of the PdfFileReader class of the PyPDF2 module. The object of the PdfFileReader class can then be used to read text from a PDF document. Here’s an example showing what we’ve covered so far:
import PyPDF2 sample_pdf = open(r'C:\Datasets\sample.pdf', mode='rb') pdfdoc = PyPDF2.PdfFileReader(sample_pdf)
Once you’ve read your PDF file, you can print all kinds of information about the file. Here’s an example that prints some file details, including the creator of the file and the creation date, using the documentInfo attribute.
pdfdoc.documentInfo
Output:
'/Creator': 'Rave (http://www.nevrona.com/rave)', '/Producer': 'Nevrona Designs', '/CreationDate': 'D:20060301072826'>
You can also print the number of pages in the PDF using the numPages attribute.
pdfdoc.numPages
Since the PDF document you read contains 2 pages, you should see 2 in the output.
Output:
Now for what you came for. To read text from a PDF document, you first have to specify the page number you want to extract the data from. The getPage() method returns the object for the page number passed to it as a parameter. Next, you can call the extractText() method from the page object to extract the text on that page. The following script prints text from the first page of your PDF document. Remember, the page numbering starts from 0, where 0 refers to the first page.
page_one= pdfdoc.getPage(0) print(page_one.extractText())
The output shows all the text from the first page of the PDF document.
Output:
A Simple PDF File This is a small demonstration .pdf file - just for use in the Virtual Mechanics tutorials. More text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. Boring, zzzzz. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. Even more. Continued on page 2 .
To read text on all the pages, you can simply loop through the pages using the getPage() method and then call extractText() method on each page to extract the corresponding text. Here’s an example Python script that prints text from every page of your PDF.
for i in range(pdfdoc.numPages): current_page = pdfdoc.getPage(i) print("===================") print("Content on page:" + str(i + 1)) print("===================") print(current_page.extractText())
Since the PDF document contains 2 pages, you can see text from the corresponding pages in the output below:
Output:
=================== Content on page:1 =================== A Simple PDF File This is a small demonstration .pdf file - just for use in the Virtual Mechanics tutorials. More text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. Boring, zzzzz. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. Even more. Continued on page 2 . =================== Content on page:2 =================== Simple PDF File 2 . continued from page 1. Yet more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. Oh, how boring typing this stuff. But not as boring as watching paint dry. And more text. And more text. And more text. And more text. Boring. More, a little more text. The end, and just as well.
Get Our Python Developer Kit for Free
I put together a Python Developer Kit with over 100 pre-built Python scripts covering data structures, Pandas, NumPy, Seaborn, machine learning, file processing, web scraping and a whole lot more - and I want you to have it for free. Enter your email address below and I'll send a copy your way.
You can also use PyPDF2 to read remote PDF files, like those saved on a website. Though PyPDF2 doesn’t contain any specific method to read remote files, you can use Python’s urllib.request module to first read the remote file in bytes and then pass the file in the bytes format to PdfFileReader() method. The rest of the process is similar to reading a local PDF file.
In the following script, you use the urlopen() method from the urllib.request module to read the remote file located at “http://www.africau.edu/images/default/sample.pdf”, in byte format. Once we do that, the byte file is passed to the constructor of the PdfFileReader() class of the PyPDF2 module.
import urllib.request import PyPDF2 import io URL = 'http://www.africau.edu/images/default/sample.pdf' req = urllib.request.Request(URL, headers='User-Agent' : "Magic Browser">) remote_file = urllib.request.urlopen(req).read() remote_file_bytes = io.BytesIO(remote_file) pdfdoc_remote = PyPDF2.PdfFileReader(remote_file_bytes)
The rest of the process is familiar. You can check the total number of pages with the numPages attribute.
pdfdoc_remote.numPages
Output:
Similarly, you can read the text from all the pages in the remote PDF using the getPage() and extractText() methods. Here’s the full Python code for reading a remote PDF file and printing the contents to your screen:
import urllib.request import PyPDF2 import io URL = 'http://www.africau.edu/images/default/sample.pdf' req = urllib.request.Request(URL, headers='User-Agent' : "Magic Browser">) remote_file = urllib.request.urlopen(req).read() remote_file_bytes = io.BytesIO(remote_file) pdfdoc_remote = PyPDF2.PdfFileReader(remote_file_bytes) for i in range(pdfdoc.numPages): current_page = pdfdoc.getPage(i) print("===================") print("Content on page:" + str(i + 1)) print("===================") print(current_page.extractText())
Output:
=================== Content on page:1 =================== A Simple PDF File This is a small demonstration .pdf file - just for use in the Virtual Mechanics tutorials. More text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. Boring, zzzzz. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. Even more. Continued on page 2 . =================== Content on page:2 =================== Simple PDF File 2 . continued from page 1. Yet more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. Oh, how boring typing this stuff. But not as boring as watching paint dry. And more text. And more text. And more text. And more text. Boring. More, a little more text. The end, and just as well.
Get Our Python Developer Kit for Free
I put together a Python Developer Kit with over 100 pre-built Python scripts covering data structures, Pandas, NumPy, Seaborn, machine learning, file processing, web scraping and a whole lot more - and I want you to have it for free. Enter your email address below and I'll send a copy your way.