Extracting Data from Tricky PDFs with Google Gemini in 10 lines of Python

Extracting Data from Tricky PDFs with Google Gemini in 10 lines of Python

In this guide, I’ll show you how to extract structured data from PDFs using vision-language models (VLMs) like Gemini Flash or GPT-4o.

Gemini, Google’s latest series of vision-language models, has shown state of the art performance in text and image understanding. This improved multimodal capability and long context window makes it particularly useful for processing visually complex PDF data that traditional extraction models struggle with, such as figures, charts, tables, and diagrams.

By doing so, you can easily build your own data extraction tool for visual file and web extraction. Here’s how:

Gemini’s long context window and multimodal capability makes it particularly useful for processing visually complex PDF data where traditional extraction models struggle.

Setting Up Your Environment

Before we dive into extraction, let’s set up our development environment. This guide assumes you have Python installed on your system. If not, download and install it from https://www.python.org/downloads/

⚠️ Note that, if you don’t want to use Python, you can use the cloud platform at thepi.pe to upload your files and download your result as a CSV without writing any code.

Install Required Libraries

Open your terminal or command prompt and run the following commands:

pip install git+https://github.com/emcf/thepipe
pip install pandas

For those new to Python, pip is the package installer for Python, and these commands will download and install the necessary libraries.

Set Up Your API Key

To use thepipe, you need an API key.

Disclaimer: While thepi.pe is a free an open source tool, the API has a cost, roughly $0.00002 per token. If you want to avoid such costs, check out the local setup instructions on GitHub. Note that you will still have to pay your LLM provider of choice.

Here’s how to get and set it up:

Visit https://thepi.pe/platform/

Create an account or log in
Find your API key in the settings page

Now, you need to set this as an environment variable. The process varies depending on your operating system:

Copy the API key from the settings menu on thepi.pe Platform

For Windows:

Search for “Environment Variables” in the Start menu
Click “Edit the system environment variables”
Click the “Environment Variables” button
Under “User variables”, click “New”
Set the variable name as THEPIPE_API_KEY and the value as your API key
Click “OK” to save

For macOS and Linux:
Open your terminal and add this line to your shell configuration file (e.g., ~/.bashrc or ~/.zshrc):

export THEPIPE_API_KEY=your_api_key_here

Then, reload your configuration:

source ~/.bashrc # or ~/.zshrc

Defining Your Extraction Schema

The key to successful extraction is defining a clear schema for the data you want to pull out. Let’s say we’re extracting data from a Bill of Quantity document:

An example of a page from the Bill of Quantity document. The data on each page is independent of the other pages, so we do our extraction “per page”. There are multiple pieces of data to extract per page, so we set multiple extractions to True

Looking at the column names, we might want to extract a schema like this:

schema = {
item: string,
unit: string,
quantity: int,
}

You can modify the schema to your liking on thepi.pe Platform. Clicking “View Schema” will give you a schema you can copy and paste for use with the Python API

Extracting Data from PDFs

Now, let’s use extract_from_file to pull data from a PDF:

from thepipe.extract import extract_from_file
results = extract_from_file(
file_path = bill_of_quantity.pdf,
schema = schema,
ai_model = google/gemini-flash-1.5b,
chunking_method = chunk_by_page
)

Here, we’ve chunking_method=”chunk_by_page” because we want to send each page to the AI model individually (the PDF is too large to feed all at once). We also set multiple_extractions=True because the PDF pages each contain multiple rows of data. Here’s what a page from the PDF looks like:

The results of the extraction for the Bill of Quantity PDF as viewed on thepi.pe Platform

Processing the Results

The extraction results are returned as a list of dictionaries. We can process these results to create a pandas DataFrame:

import pandas as pd
df = pd.DataFrame(results)
# Display the first few rows of the DataFrame
print(df.head())

This creates a DataFrame with all the extracted information, including textual content and descriptions of visual elements like figures and tables.

Exporting to Different Formats

Now that we have our data in a DataFrame, we can easily export it to various formats. Here are some options:

Exporting to Excel

df.to_excel(extracted_research_data.xlsx, index=False, sheet_name=Research Data)

This creates an Excel file named “extracted_research_data.xlsx” with a sheet named “Research Data”. The index=False parameter prevents the DataFrame index from being included as a separate column.

Exporting to CSV

If you prefer a simpler format, you can export to CSV:

df.to_csv(extracted_research_data.csv, index=False)

This creates a CSV file that can be opened in Excel or any text editor.

Ending Notes

The key to successful extraction lies in defining a clear schema and utilizing the AI model’s multimodal capabilities. As you become more comfortable with these techniques, you can explore more advanced features like custom chunking methods, custom extraction prompts, and integrating the extraction process into larger data pipelines.

Please follow and like us:
Pin Share