How to extract data from PDF into Excel or CSV using Python

This Python script allows to extract tables from PDF files and save them in Excel or CSV format.

Firstly, we have to import libraries we are going to use, which are Pandas (here we will need it to convert the tables we are going to extract into dataframes and save as Excel files).

import tabula
import pandas as pd

After this we specify the location of the PDF we want to extract data from:

pdf_in = "D:/Folder/File.pdf"

And we record all of the tables into PDF variable. The tables are going to be extracted as nested lists.

PDF = tabula.read_pdf(pdf_in, pages='all', multiple_tables=True)

where pages='all' and multiple_tables=True are optional parameters.

After we got the info from the .pdf file into PDF variable we can save it as Excel or CSV.

In order to do that first we have to specify the full path and filenames of the files we want to get:

pdf_out_xlsx = "D:\Temp\From_PDF.xlsx"
pdf_out_csv = "D:\Temp\From_PDF.csv"

To save it as .xlsx we convert it into pandas dataframe and use pandas.DataFrame.to_excel:

PDF = pd.DataFrame(PDF)
PDF.to_excel(pdf_out_xlsx,index=False)

To save it as CSV we use Tabula's convert_into.

tabula.convert_into (input_PDF, pdf_out_csv, pages='all',multiple_tables=True)
print("Done")

Full script:

# Script to export tables from PDF files
# Requirements:
# Pandas (cmd --> pip install pandas)
# Java   (https://www.java.com/en/download/)
# Tabula (cmd --> pip install tabula-py)
# openpyxl (cmd --> pip install openpyxl) to export to Excel from pandas dataframe

import tabula
import pandas as pd

# Path to input PDF file
pdf_in = "D:/Folder/File.pdf" #Path to PDF

# pages and multiple_tables are optional attributes
# outputs df as list
PDF = tabula.read_pdf(pdf_in, pages='all', multiple_tables=True)

#View result
print ('\nTables from PDF file\n'+str(PDF))

#CSV and Excel save paths
pdf_out_xlsx = "D:\Temp\From_PDF.xlsx"
pdf_out_csv = "D:\Temp\From_PDF.csv"

# to Excel
PDF = pd.DataFrame(PDF)
PDF.to_excel(pdf_out_xlsx,index=False) 

# to CSV
tabula.convert_into (input_PDF, pdf_out_csv, pages='all',multiple_tables=True)
print("Done")