If you came here looking for a solution to overcome the limitations imposed by your free PDF reader while trying to merge your pdf documents or extract individual pages from them, you came to the right place.
Python with the help of Google Collaboratory will let you accomplish this over your web browser with very little coding.
All you have to do is open a new Colab notebook and follow the simple instructions in the article to accomplish our tasks.
When you run Google Colab the following screen opens up and creating a new notebook is as simple as clicking the ‘New notebook’ button on the bottom right of the screen.
Merge PDF
First of all, as we are working in on the cloud drive provided by Google Colab, you will have to upload the individual pdf files that you wish to merge together, into the work. Notice the area to the left of your Colab notebook where the folder icon appears. Click the folder icon it if it is collapsed. That will open up the folder pane from where you can upload your pdf files to be merged, by clicking on the upload icon.
Once you have your files in place, it’s time to start scripting in python. Let’s first install the required library, PyPDF2 using the pip command below:
!pip3 install PyPDF2
And then we are ready to import the required modules and start coding.
import os
from PyPDF2 import PdfFileMerger, PdfFileReader
# Call the PdfFileMerger
mergedObject = PdfFileMerger()
#Create a list of all pdf files in the current directory
files = [i for i in os.listdir() if i.endswith(‘pdf’)]
# Loop through all the pdf documents in the list and append them to the merged object
for file in files:
mergedObject.append(PdfFileReader(file, ‘rb’))
# Write all the files into a file which is named as shown below
mergedObject.write(“merged.pdf”)
In the above code we are using the operating system module OS for file management and PyPDF2 for our merging the individual documents. And if it ran correctly you should find your merged pdf in the folder view window pane on the left. If you don’t find it there despite the code cell running without error, just refresh the folder view pane by pressing folder refresh icon next to the upload icon, and you should find your ‘merged.pdf’ there. Remember to download the merged document before you close the window. Drive wipes out all the files on the folder when you close the session. But if you do you can still upload your original documents and run the code, which is usually automatically saved even after a session completes.
Extracting pages from PDF
Extracting pages is also easily achieved using PyPDF2. Additionally here for this purpose we will have to use the PdfFileWriter module.
As before we start with imports after uploading the file from which we wish to extract the page/s.
from PyPDF2 import PdfFileWritersource = PdfFileReader(‘sampledoc.pdf’,‘rb’)pdf = PdfFileWriter()pdf.addPage(source.getPage(3))with open(‘extracted.pdf’, ‘wb’) as f:pdf.write(f)f.close()
We are creating two objects one for reading ‘source’ and the other for writing ‘pdf’. The page number of the page to be extracted is passed as an argument to the getPage method called on the source object. It is the same as telling python to get page 3 of the document that is represented by the variable ‘source’. And finally we write the extracted page using file operations and close the handle.
If there are more than one page to be extracted, we can pass the page numbers as list and iterate over the list using a ‘for’ loop to extract all the pages and combine them into a separate pdf.
Like before remember to download the extracted pages before closing the session. And pay attention to the name of the file fed into source. I have used ‘sampledoc.pdf’, which you have to update to the actual name of your document.
I hope you enjoyed this little tutorial to use open source libraries and infrastructure to build your own application to automate merging and extraction of pdf. Let me know your views in the comments section.