Atalasoft Knowledge Base

HOWTO: How to OCR a PDF

: 6 Years Ago
: Administrator
: DotImage

NOTE From Support:

This article has been flagged for review. It contains possibly outdated information.

You may wish to review the Searchable PDF demo as it contains correct/tested code for this use case

Original Article Content:

The OCR process is most efficient when you use a class derived from ImageSource that lazily loads each image one at a time, so that all of the pages of the document are not kept in memory. For PDF documents, we have created PdfImageSource, which you will find in the PDF Reader add-on, in the Atalasoft.Imaging.ImageSources namespace. It has the following features:

Lazy loads each page on request
Extracts the exact image from the page if the page is a single image (like from a scanned document)
Rasterizes pages that are not a single image An instance of this class can be passed to Translate() and Recognize() on any OcrEngine. This assumes that the OcrEngine has been initialized and that it supports searchable PDF Translation.

C# Sample Code:

   public void TranslatePdftoSearchablePdf(OcrEngine ocrEng, String pdfIn, String searchablePdfOut)
   {
      using (Stream pdfStream = File.OpenRead(pdfIn))
      {
         using (PdfImageSource pdfSource = new PdfImageSource(pdfStream))
         {
            ocrEng.Translate(pdfSource, "application/pdf", searchablePdfOut);
         }
      }
   }

Original Article:
Q10301 - HOWTO: How to OCR a PDF

Did this article help answer your questions or resolve your problem?

Yes No

Optionally provide additional feedback to help us improve this article...

Thank you for your feedback!

Details

Last Modified: 6 Years Ago

Last Modified By: Administrator

Type: HOWTO

Article not rated yet.

Article has been viewed 807 times.

Options

Print Article

Export As PDF

Search