We are working on PDF data extraction doing a POC with a few LLMs like Claude, Co-pilot, Power automate, Textract etc. While, machine printed pdfs are fairly trivial and accurate from LLMs extraction, we are running into a challenge with hand written PDFs, as the image clarity is often very poor. Despite trying to OCR it and then parsing, it is not working well. I am trying to figure out, if anyone has a better suggestion or solution for this need?

4.7k viewscircle icon1 Upvotecircle icon10 Comments
Sort by:
Manager, Data Science21 days ago

Take a screen shot and feed it to LLMs they do a better Job, Also try Google's Document AI had good experience with it.

IT Manager22 days ago

We had a good experience using the Docling library (https://docling-project.github.io/docling/). It was able to extract accurate data even in low quallity scenarios.

Program Director, Intelligent Automation + Entrepreneur in Healthcare and Biotech22 days ago

We have found quite a bit of success using Microsoft AI Document Intelligence. It's worth checking out.

1 Reply
no title22 days ago

*With human in the loop for validation

Team Leader22 days ago

Handwritten PDFs can be tricky, especially when the scans are low quality. 

Clean up the images first (contrast, noise removal, deskewing)

Use OCR that’s designed for handwriting (like Azure Read or TrOCR)

Or even let a vision-capable LLM look at the image directly to extract info

For tricky parts, a quick human check can save a lot of headaches

This approach usually works much better than just running standard OCR.

IT Coordinator in Education23 days ago

It was a while ago, but due to the nature of the information/content, we were experiencing errors in the OCR text recognition that was detrimental to the overall project. We ended up using OCR text recognition, and hand keying the portions of documents deemed as high value, high impact content. The output was roughly 70/30, with 70% accurate output from OCR and 30% manual intervention in qualifying areas.

Content you might like

Certified Associate in Software Testing (CAST)29%

Certified Software Tester (CSTE)48%

ISTQB Foundation Level32%

ISTQB Agile Tester25%

Certification in a specific automation tool (i.e. Selenium, Ranorex)23%

Other certification5%

None8%

View Results

An excellent framework that has bright future17%

A great framework that enables rapid MVPS, but not full products61%

Somewhat sustainable but should be sunset13%

A dead or dying technology8%

View Results