PDF Types and PDF data extraction

What is PDF?

PDF stands for Portable Document Format. It was introduced by Adobe in 1992, and since 2008 it’s ISO standardized.

A PDF file is a container that can contain text and images. It can also contain additional objects such as curves, annotations and forms, but we can skip these for simplicity.

Two key elements:

Text is represented in the PDF file by letters. If the text is “invoice”, the PDF contains this very text.

An image (aka raster, bitmap) consists of a mosaic of picture elements called pixels. Each pixel has some colour, where the colour is selected from two colours (black and white), shades of grey, or all colours, depending on image type. Pixels are usually square. Note that the text “invoice” is made of small square pixels, every pixel having a colour. This is basically how computers store all images. The mosaic contains only the pixels of different colour, not the text “invoice” as such.

Types of PDF documents

Some of the most common types of PDF documents are:

App-generated PDF documents
Scanned PDF documents
Searchable PDF documents

1. App-generated PDF documents

As the name suggests, these files are generated by applications such as MS Word, MS Excel, accounting software and ERP systems. App-generated PDF documents generally contain text in “plain” form, perhaps with some added logotype or so. The picture below shows the text “CONTOSO” which is stored as such in the PDF, with a logotype on the left side stored as a picture.

2. Scanned PDF documents

Scanning is the process of converting paper images into digital form.

The scanner takes a picture of the page, creating a mosaic of pixels. The mosaic above was created on a computer for clarity, but authentic scanned images look a bit differently. Images bellow showcase a few examples.

3. Searchable PDF documents

Searchable means that the document contains some text that you can search in. It’s created from a scanned PDF file using a software called OCR (Optical Character Recognition). It’s software that tries to guess which letters are present in the pixel mosaic. Note that some copy machines contain the OCR software embedded. The OCR process is inheritably prone to errors. In the illustration below you can see the image layer and the text layer on top of it. The text layer is usually not visible, but you can search in the text, provided that the OCR has recognized it properly. Note that the image was created on a computer for clarity; the OCR error rate was exaggerated for illustrative purposes.

How to recognize different PDF types

Open the PDF in Adobe Acrobat. Press Ctrl+A to select all text. If there is no text selected, you’ve got yourself a scanned PDF. Zoom in on the letters. If you see a mosaic of pixels, it’s a searchable PDF. If the letters are smooth regardless of how much you zoom in, you’re dealing with an app-generated PDF.

Recognizing PDF types in this way might be a bit tiresome, especially if you’re dealing with a larger number of files. That’s why XCENTER DIGITAL provides customers and partners with a utility that can recognize PDF types. It can even create stats of PDF types for you. Contact your representative to get the utility.

How to extract data from PDF with 100% of accuracy?

With xtractor™ software for pdf data extraction

xtractor™ is a product developed by XCENTER DIGITAL. We created a simple, accurate, and fast process to transform and digitalize your documents. When extracting data with xtractor™ you will achieve 100% accuracy of the extracted information. This significantly cuts the costs of document digitalization, as it completely eliminates the need for manual work.

Impact of PDF Type on Data Extraction with xtractor™

xtractor™ works best with app-generated PDF files. These contain the text as such. We apply precise (deterministic) rules to the text in order to extract header fields as well as tables. Theoretically, it’s possible to process searchable and even scanned PDF documents. xtractor™ can use OCR software to guestimate the letters and work on top of OCR results. But since the OCR always in principle makes mistakes in letter recognition, the result loses the 100% accuracy, which xtractor™ is known for. Therefore, for best results, we strongly recommend using xtractor™ with app-generated PDF files.

Download xtractor™ fact sheet:

You liked a topic?

Share it on your social media. It gives us extra motivation to create more content like this.