Cleaning Up eBook Conversion Messes: Tips For Success
By Mark Gross, with Devorah Bloom, DCL
In my previous column, Understanding Content Conversion: Unfortunately, There’s No ‘Easy’ Button, I examined the various areas that trip up the eBook conversion process –special characters, tables, hyphenations, and so on. This column follows on that theme, based on a question from a consultant with a major consulting firm:
“…I am researching the time it takes on average to convert from PDF to EPUB, including the time it takes to edit the ‘rough’ EPUB file created by the automated conversion to clean up the formatting errors, resulting in a ‘clean’ EPUB suitable for display on an eReader device. I am having trouble locating such a statistic.”
Why is it so difficult to find this information, you ask? That’s easy. Because there isn’t one specific answer. The time it takes to correct an automated conversion (cleanup errors) depends on many factors:
- complexity of content in the source document
- type of source file
- completeness of the automated converter
- post-conversion, manual cleanup
- skill level of the people doing the work
- how clean do you want it
Most novels are easier to clean up than multi-column textbooks or complex technical content containing formatted tables, images, poetry, sidebars, and footnotes. Complexity increases as the number of complex elements in need of being converted are reconfigured to fit the often much smaller areas of screen real estate common on smartphones, tablets and eBook readers. Even novels are not that simple to convert, as discussed in Devorah Bloom’s webinar, What to Expect from Automated Conversion to eBook, but they are usually quicker to do than complex scientific articles with math, chemistry, and all that stuff.
The format of your source content will impact how long it will take to prepare your materials for conversion. It’s important to carefully consider all sources available. It may be obvious that converting from paper will be the most costly, as proofreading will be necessary to make sure everything is correct, but even with electronic files, problems can be expected. PDFs, for instance, are the most common type of source file. They come in many variations, and while some are much better sources than others, common problems introduced in PDF files include:
- word spacing
- paragraph delineation
- special characters.
Word processing files, proprietary software formats, standard file types (XML, HTML, etc), and pretty much every other type of file, will also introduce challenges.
Results from automated conversion software (and automated conversion scripts) vary widely, but in most cases if the software you opt to use isn’t tuned to your content, the conversion will be rough. Rough conversions require serious clean-up. It’s almost always best to invest the effort up front to tune the software to the content being converted.
Sometimes, tuning isn’t enough. Some commercial content conversion software (this includes freeware) is overly strict, lacking in the flexibility department. While conversion rules are necessary, you’ll want to use tools that provide the flexibility needed to handle all the situations you’re likely to encounter when converting source content that isn’t as clean as you’d like.
While writers don’t maliciously introduce problems into the documents they create, they are the source of many conversion challenges. It’s not their fault, actually. They lack an understanding of how their actions create conversion problems and have never been equipped with the knowledge (or the tools) needed to produce easy-to-convert documents. But, knowing this fact will make it easier for you to select the approach that works best for your organization.
One approach is to take whatever content you have been provided and work to clean it up, manually.
Another approach is to use commercial conversion software to make a first pass. When problems crop up (and they will), go back and modify the original documents so that they fit the software’s expectations. While this approach is workable, it’s time-consuming and expensive.
A third approach (that is particularly useful on large document sets) is to work with a firm that specializes in content conversion. Look for a company that has developed conversion software which is designed to be continually adjusted (tuned) to meet new needs. This approach will allow you to continually leverage the power of the conversion software to do as much of the work as possible. At DCL, we use this approach and we do so because every tiny accuracy improvement we make pays tremendous dividends in the clean-up phase.
This is where the big variances enter the equation. If the conversion process worked effectively, this phase would just a review phase and would go very quickly. However, if the conversion is rough, having left behind a lot of debris, it takes longer since you have to find and fix things, and some of those things are difficult and time-consuming to fix by hand. If you’re still intent on doing all this yourself, you should test the results of the conversion on a small, representative sample of your content to better understand what’s involved.
After cleanup, everything has to be reviewed. It’s a necessary step that far too many people skip, leading to content quality problems. Additional, device-specific review and testing will need to be conducted if you’re outputting your content to multiple device types. This step is not intended to clean up errors, but rather to ensure everything worked well. It’s an important task, not to be relegated to clerical staff. Instead, it’s best to be conducted by folks who understand the content, the audience , and the devices on which the content will be displayed.
Review is more than just comparing the original copy to the final result. Because changes are introduced in the eBook versions that were not part of the source file (text is repositioned to support device screen size and orientation, for example), it’s important that reviewers are equipped with the knowledge necessary to know what to check for and what to ignore.
How clean do you want it?
In the traditional book publishing world, perfection was the standard, but that seems to have changed with the rush to get eBooks to market – especially with short run books that need to get out quickly. While a medical text requires checking, double-checking, and triple-checking, other kinds of books might be acceptable with the occasional extraneous hyphen and bullets that don’t wrap exactly right. I’m a little old-fashioned on this, and prefer the perfection approach, but I do recognize that there are short-cuts that some may feel comfortable taking.
So the short answer to the question of how long it should take to produce clean eBook content is based on a number of factors. Each of these variables contributes to the total amount of time you’ll need to spend on correcting an automated PDF to EPUB conversion; it may be 3-4 hours, but it can take also 3-4 days — or longer. It all depends…