Understanding Content Conversion: Unfortunately, There’s No ‘Easy’ Button

By Mark Gross, President, CEO, Data Conversion Laboratory

Mark Gross, President/CEO Data Conversion Laboratory

Data Conversion Laboratory, the company I founded, has been doing document conversion for thirty years and every once in a while I still get asked from someone I haven’t seen in a while “are you still doing that?” or “isn’t there software that does all that?”

The truth is that if it was easy, it would indeed be all automated, which is already the case for news feeds, financial transactions, and other standardized data flows. But when it comes to documents and books, creativity will not be bound by rules and style sheets, especially at deadline when one wants a certain look, and MS-word chooses not to cooperate. The truth is that a document can contain anything, and computer software doesn’t work well with ‘randomness’.

Even what to the human eye looks like a simple book – a book with simple text, no tables, and no links – still contains complications that will thwart software not meant to deal with it. In a recent test of three free software packages, not one book came out perfectly. Each and every one of them had problems. To complicate matters, each book had different problems (recorded webinar includes examples).

Let’s get flexible

In order to deliver the high quality, customized information that consumers expect – and in some cases, demand – we’re going to have to start thinking seriously about creating automated streams of standardized content – something most organizations don’t have today.

© shockfactor - Fotolia.comFor organizations with large collections of information to distribute and monetize their content, the clearest solution seems to be adopting a flexible, standardized content model and maintaining that content in an appropriate type of content management system. In our mobile-connected, always-on world, this means creating and delivering content able to customers on the device of their choosing. It also means future-proofing that content so it will be ready to be quickly and efficiently prepared and delivered to the many devices that have yet to hit the market.

There are many content standards to choose from: HTML, EPUB, XML, PDF, and DITA. But which is one is right for your purposes? For your audience? For the audience of the future? Will you need to support more than one?

For now, let’s assume that some robust form of XML will be the right thing to store your information – the specific form best for your content will need some more discussion. However, it seems that for most large collections, moving to one of the simpler formats like HTML or EPUB can be a risky investment due to the lack of flexibility they offer.

EPUB, for example, now on version 3.0, is specialized to the needs of electronic books as they currently exist. It’s very possible that all the features of your content are not easily definable within the EPUB standard, and are not displayable on current devices. If you limit yourself to converting to the current version of EPUB you may be limiting your content as new capabilities, not currently envisioned, are introduced. The same is true of HTML, which is designed for display of information.

To preserve your investment in converting to a standardized format the safer approach is to convert and store the content in a more robust version of XML, such as DITA, DocBook, NLM, TEI, S1000D, and various other XML standards created for specific purposes. If properly designed, you would then be able to automatically convert your content to EPUB, HTML, PDF, and other final formats in the future.

OK, why is it so difficult to convert?

If everyone wrote their documents with the intent that they be standardized and converted, life would be easy (and we wouldn’t have that much to do). But the reality is most content is not easily extractable, and lacks the details needed for a full conversion. Much needs to be corrected, and much needs to be inferred based on the content.

As an example, let’s look at the difficulties in extracting content from PDF files. Since PDF is a print format, PDF documents are typically less-structured versions of their word-processor originals. While PDF content is laid out to look good, it includes very little structure—that is, it contains few clues as to the function of text elements (e.g., paragraphs, spaces, line breaks) or how they ought to be displayed in a different context (for instance, an e-book). While converting thoroughly structured content to XML is straightforward, PDF doesn’t contain explicit structuring. But an even more basic problem has to do with properly extracting the content from the PDF to begin with.

Examples of problems you are likely to encounter with commercial packages include the following:

Incorrect Word Spaces

While spacing is usually extracted correctly, since PDF documents create spaces visually (i.e., they are not really labeled as “one standard space” or “two standard spaces”), spacing between words is sometimes misinterpreted by conversion software, causing spaces to be added or deleted incorrectly during PDF-to-Word extraction. That’s why ebooks that have not been fully reviewed will have words coming together, or otherwise incorrectly spaced.

Paragraph Delineation

In most cases, PDF documents contain no explicit information to indicate where a paragraph begins or ends, so this too must be guessed at by conversion software, based on “visual” interpretation of the appearance of chunks of text. While conversion software frequently does guess correctly, paragraph delineation can be a source of extraction errors, particularly when paragraphs are very short or span pages, or images and table get in the way.

Hyphens

Hyphens pose a problem because they serve various purposes among which an automated system cannot distinguish. While the hyphen joining a term such as “half-life” should appear no matter where the words are placed within a document, a hyphen that appears halfway through a word because of a line break (e.g., hyphen-ated) becomes an ugly error once the word is moved to the middle of a line. This is also something you’ll see often in ebooks you download.

Emphasis

Depending on how a document is rendered in PDF, extracting the correct emphasis from a PDF document can sometimes pose problems for conversion software. Again, this is because PDF structure is nothing more than a visual representation; while text may appear emphasized, the PDF does not tag it as “emphasized”—conversion software must make its best guess based on what it can glean from the text’s appearance.

Superscripting and Subscripting

Since PDF documents’ treatment of super and subscripts is limited to the way they appear when laid out in the PDF (rather than by some kind of “superscript” or “subscript” tag), extraction software tends to run into problems with determining the vertical alignment of text. As a result, super and subscripts are frequently misinterpreted by extraction software.

Special Characters

In PDF documents, special characters like foreign or mathematical symbols are frequently represented by unusual or proprietary fonts. In order to extract them to a word processor, these characters first need to be converted to a more standard character representation (e.g., ISO or Unicode). While many conversion software suites build conversion tables to handle such characters, it is impossible to keep up with the vast variety of atypical and proprietary fonts in use, and so many special characters fail to extract properly.

Sub-fonting

PDF’s approach to font embedding is another obstacle to proper extraction. Sometimes when PDFs are created, the PDF document does not store the information for the entire font, but rather stores only the parts of the font, which are used in a given document. The characters within this “sub-font” are accessed via an indirect table within the PDF document itself, making correct interpretation and extraction of sub-fonted characters difficult. Many conversion tools cannot extract these characters at all, and produce “garbage” text instead of accurately extracted content.

Tables

Tables are among the trickiest document elements to extract. This is because the appearance of even a simple table is determined by numerous attributes, including but not limited to column and row delineation, header and body delineation, vertical and horizontal cell spanning, cell separators, and vertical and horizontal cell alignment. With none of this information included in the source PDF, it is nearly impossible for an automated tool to reproduce a table exactly as it appeared in the original document.

While some short or simple documents may be able to undergo a PDF-to-Word (and subsequent PDF-to-EPUB) conversion with minimal difficulty, any long or complex document set will encounter several of these obstacles. The obstacles inherent in any PDF text extraction should underscore, first, the utility of retaining original versions of source documents in word processor format, if possible; and second, the critical importance of a good quality assurance strategy in any conversion process.

So what do I do now?

Obviously many millions of pages get converted; we convert millions of pages ourselves. There are solutions to all of the above and approaches to dealing with all the above, and more, which will be discussed in future columns.

Questions about conversion?

If you have specific questions about conversion that you’d like me to answer, use the comment feature of this website to submit them. I’ll do my best to find an answer for you.

  • http://twitter.com/opinaripeople Jean Purcell

    Excellent.

  • http://twitter.com/BookDesignGirl Colleen Cunningham

    Fervent believer of the No Easy Button stance and proclaim it near and far. Now I truly understand why PDF extraction is so difficult. Thanks for the ammo. :)

  • http://twitter.com/ErikaNygaard Erika Schulz Nygaard

    Having completed a PDF to ebook conversion recently, I can vouch for the points in this piece. And I feel better for being reassured I was not crazy, just going crazy over all those missing word spaces and reconstructing every paragraph. It’s nice to see the issues clarified so nicely here.

  • Sherri Henkin

    I’ve had to do the pdf to Word conversion and it’s not pleasant. Thanks, Mark, for explaining why!

  • Lamastep

    dumb question, total newbie
    why is the content conversion thru Word and not something like in-design 5

  • http://twitter.com/gmtwriter GreenMountainWriter

    I wish I’d read this before starting on my latest project. What a mess we’ve encountered! An InDesign file incorrectly laid out (think one giant, 128 page paragraph with any formatting such as indents, para spacing, etc. manually applied).  

  • Reuben Tozman

    Great Post. Check out sLML content structure for learning on source forge

Archives

Connect