Using Python and AI to Extract Data From PDF Documents
During my nearly two years working in a data role, I spent close to 100% of my time using SQL. I analyzed and transformed data in Excel a handful of times and recently began building Power BI solutions.
But there’s a sweet spot in my heart for Python. Unfortunately, my work projects never required Python skills.
Until recently.
I discovered a way to use Python and AI to extract data from PDF documents and backfill the data into our database.
This post is about that project. I’ll walk you through how I initiated it and give a high-level explanation of what I did without revealing any sensitive information.
How I Initiated the Project
My team worked with our software developers to develop an application feature that allowed our stakeholder team to manage product data more efficiently. The team previously managed the product data using Excel files stored in a shared cloud folder and PDF files accessible through our in-house software system.
The stakeholder team easily maintained data this way when the company was smaller, but as our number of SKUs grew, the process became difficult to operate with.
So, my team and our developers stepped in to build a feature that automated the process and provided a friendly, easy-to-use UI.
I served as the data lead, architecting the data model, building SQL views for the application, and developing stored procedures to run calculations and process data. This gave me a strong understanding of how the stakeholder team operated their process and the pain points they struggled with.
For the feature to work optimally, the team required most of their data to be backfilled into the system. The issue was that the most critical data was stored in PDF files—nearly 1,000 of them. We had two options. One was for a temp or intern to manually go through each PDF and input data into the system. I wouldn’t want to be the person stuck with that task. The other was for my team to find a way to automate the data backfill.
I figured AI could help.
I fed an example document into ChatGPT and asked it to extract the data I wanted. This worked well and seemed easy.
But if there’s one thing I’ve learned during my time in the data industry, it’s that the seemingly easy tasks tend to be more complex once you dive deeper into the problem-solving process. And that’s what happened. I set up an OpenAI API account and wrote a Python script to feed the same document to the API. This is where I discovered the OpenAI API only accepts text inputs, not documents.
So, my idea was a bust.
My team and I concluded that we could not backfill the data. We had no choice but to make it our users’ responsibility to manually input the data into our system.
Two months after shipping the feature, the project manager asked me to query the database and check how much data the stakeholder team had entered into the system. The database showed a whopping zero records.
Two months of hard work and creativity culminated in a neglected application.
I don’t blame the stakeholders. They were already swamped with their usual day-to-day work, and no one wanted to take the time (or had the time) to manually enter data.
Around this time, I spent most of my free time improving my Python skills. I felt more comfortable with the language, so I decided to return to my idea of leveraging the OpenAI API for data extraction.
And this time, I built a proof of concept that worked.
I presented my findings to the software and stakeholder teams, and they loved the idea. I moved forward with my development, and after much stress, built a cool tool that solved the data backfill issue.
The Process
I’ll keep things vague here because I don’t want to reveal too many technical details.
First, I wrote a SQL query that pulled document links from our database’s document table. We didn’t actually store documents here, as it’s bad practice to store them in a relational database. The table, instead, contained metadata about the documents. It included attributes like name, S3 bucket links, the product documents related to, and whether they were active or inactive.
The query returned over 900 documents. That’s a lot.
Next, I used the boto3
library (the AWS SDK for Python) to access the proper S3 bucket and download the 900+ documents to my local machine. This was fun. Even though I could always access the data we stored in S3, I never needed to access it in bulk. So, this project taught me the Python basics of pulling data out of somewhere other than a relational database.
Once I downloaded the documents, my next step was to extract the text from them.
Since the OpenAI API only accepts text inputs, I realized I could first extract text from documents and then feed that text into the API. For the text extraction, I used Python’s PyPDF2
library.
I hit a slight roadblock here.
I learned that not all PDF documents are stored the same way. There are two types of PDFs: text—based and image—based. It was easy for me to pull text from the latter, but the former gave me trouble. The PyPDF2
library only worked with text-based PDFs. Image-based PDFs required an OCR to identify the text, which I had never worked with before. I could have learned to use OCRs, but since my normal day-to-day work kept me busy, I didn’t have the time.
Fortunately, over 85% of the documents were text-based, so it was enough for me to give the stakeholder team something to work with.
After extracting all the text I could, I wrote some Python functions to feed the text data into the OpenAI API. This is where the magic happened.
I tested a few prompts, and the first few produced dirty data. To make the data useful, I needed to clean and manipulate it more than I liked. This process gave me a headache, and I almost considered scratching the project.
Then, I had an epiphany…
Rather than continuing to write functions that cleaned and transformed the response data, I realized I could improve the prompt I gave the AI model. I clarified the instructions and added more context, and something beautiful happened…The API returned data in a clean, usable form.
The data still wasn’t perfect. There were some bad conversions and inconsistencies between the PDF data and what our database required. However, I never expected OpenAI to return perfect data. I knew the stakeholder team would have to review what I came up with, and they were okay with that. The goal from the start was to backfill the data and create an application feature for the users to review and approve it.
What I came up with was enough.
My final step was to insert the data into our database.
I could have used Python’s pyodbc
library, which I previously used to query the DB, to insert the data. However, I hesitated here. While I had used Python to run database inserts and updates for side projects, I never did so in a production DB.
So, instead, I used Python to generate SQL insert statements and loaded them into a .sql
file.
I tested the statements in our staging databases, and they executed properly. It was a massive relief. After too much research, experimenting, and stress, the tool worked.
That made me happy.
A Quick Reflection
We only recently executed the script in production, so I do not know yet if the project should be considered a success. There’s still a possibility that the stakeholder team will not go through the review process and approve the data soon. However, they will eventually do it.
When that happens, I’ll discover the project’s true success.
But the project’s completion immediately benefited me personally. This happened in two ways…
The first concerns the fact that this project exposed my programming skills to people in my organization. Since I never needed Python at work, no one knew I could code with it. But now they know, and it opens me up to more Python (or AI) related projects in the future.
In fact, the department head for the stakeholder team already asked me to use my Python skills for a similar project. He found another business process that required a lot of manual data extraction from PDF files, so he asked me to use my Python tool for that.
Unfortunately, this one wasn’t a success. The overwhelming majority of documents were image-based, so I couldn’t deliver enough data. That’s fine. Not all projects succeed. At least my name is still out there, and my programming skills are visible.
I can’t ask for more.
The second way I benefited was the mentality I developed.
A few months ago, I assumed everything I did to be impossible. I doubted I had the skills, yet here I am. I did it anyway.
I realized that while ideas may seem impossible, few are truly impossible. This doesn’t mean everything is easy or worthwhile to pursue. “Impossible” ideas may provide no value to anyone, and those that do will surely be difficult to execute. If it were easy, then the idea wouldn’t seem impossible in the first place.
This project made me more confident in my technical skills and ability to do difficult things. I now know I have what it takes to “do the impossible,” and that will serve me well in the future.
It’s a dangerous, yet lucrative, mentality to adopt.