Tuesday, August 05, 2008 2:12 PM
by
loufranco
Adobe Blog about why PDF Text Extraction is Hard
Jim King's Inside PDF blog is a great way to keep up with what's happening with PDF (especially standardization). Last week he detailed why extracting text from a PDF is so difficult:
PDF specifies text content of pages as glyphs not characters. That is,
one of the appearances for an "a" is chosen by the creator of the PDF
file by choosing a font from which the "a" glyph can be taken. PDF page
contents do not specify characters such as just the Latin letter "a".
The rub comes when we want to work with characters not glyphs. Unicode
is widely used because it is a character encoding technology not a
glyph encoding one. In fact, for many purposes, such as searching for
text strings, we do not want to search by appearance but we want to
search by the Latin letters (or commonly by the Unicode encoding of
characters).
Of course, if you need to extract text from a PDF, there are tools that can do that for you.