Cleaning Up eBook Conversion Messes: Tips For Success

By Mark Gross, with Devorah Bloom, DCL

In my previous column, Understanding Content Conversion: Unfortunately, There’s No ‘Easy’ Button, I examined the various areas that trip up the eBook conversion process –special characters, tables, hyphenations, and so on.  This column follows on that theme, based on a question from a consultant with a major consulting firm:

“…I am researching the time it takes on average to convert from PDF to EPUB, including the time it takes to edit the ‘rough’ EPUB file created by the automated conversion to clean up the formatting errors, resulting in a ‘clean’ EPUB suitable for display on an eReader device. I am having trouble locating such a statistic.”

Why is it so difficult to find this information, you ask?  That’s easy. Because there isn’t one specific answer. The time it takes to correct an automated conversion (cleanup errors) depends on many factors:

  1. complexity of content in the source document
  2. type of source file
  3. completeness of the automated converter
  4. post-conversion, manual cleanup
  5. skill level of the people doing the work
  6. how clean do you want it

 

Complexity

Most novels are easier to clean up than multi-column textbooks or complex technical content containing formatted tables, images, poetry, sidebars, and footnotes. Complexity increases as the number of complex elements in need of being converted are reconfigured to fit the often much smaller areas of screen real estate common on smartphones, tablets and eBook readers.  Even novels are not that simple to convert, as discussed in Devorah Bloom’s webinar, What to Expect from Automated Conversion to eBook, but they are usually quicker to do than complex scientific articles with math, chemistry, and all that stuff.

Source file

The format of your source content will impact how long it will take to prepare your materials for conversion. It’s important to carefully consider all sources available. It may be obvious that converting from paper will be the most costly, as proofreading will be necessary to make sure everything is correct, but even with electronic files, problems can be expected. PDFs, for instance, are the most common type of source file. They come in many variations, and while some are much better sources than others, common problems introduced in PDF files include:

  • word spacing
  • paragraph delineation
  • hyphens
  • emphasis
  • special characters.

Word processing files, proprietary software formats, standard file types (XML, HTML, etc), and pretty much every other type of file, will also introduce challenges.

Conversion software

Results from automated conversion software (and automated conversion scripts) vary widely, but in most cases if the software you opt to use isn’t tuned to your content, the conversion will be rough. Rough conversions require serious clean-up. It’s almost always best to invest the effort up front to tune the software to the content being converted.

Sometimes, tuning isn’t enough. Some commercial content conversion software (this includes freeware) is overly strict, lacking in the flexibility department. While conversion rules are necessary, you’ll want to use tools that provide the flexibility needed to handle all the situations you’re likely to encounter when converting source content that isn’t as clean as you’d like.

While writers don’t maliciously introduce problems into the documents they create, they are the source of many conversion challenges. It’s not their fault, actually. They lack an understanding of how their actions create conversion problems and have never been equipped with the knowledge (or the tools) needed to produce easy-to-convert documents. But, knowing this fact will make it easier for you to select the approach that works best for your organization.

One approach is to take whatever content you have been provided and work to clean it up, manually.

Another approach is to use commercial conversion software to make a first pass. When problems crop up (and they will), go back and modify the original documents so that they fit the software’s expectations. While this approach is workable, it’s time-consuming and expensive.

A third approach (that is particularly useful on large document sets) is to work with a firm that specializes in content conversion. Look for a company that has developed conversion software which is designed to be continually adjusted (tuned) to meet new needs. This approach will allow you to continually leverage the power of the conversion software to do as much of the work as possible. At DCL, we use this approach and we do so because every tiny accuracy improvement we make pays tremendous dividends in the clean-up phase.

Cleanup

This is where the big variances enter the equation. If the conversion process worked effectively, this phase would just a review phase and would go very quickly. However, if the conversion is rough, having left behind a lot of debris, it takes longer since you have to find and fix things, and some of those things are difficult and time-consuming to fix by hand. If you’re still intent on doing all this yourself, you should test the results of the conversion on a small, representative sample of your content to better understand what’s involved.

Review

After cleanup, everything has to be reviewed. It’s a necessary step that far too many people skip, leading to content quality problems. Additional, device-specific review and testing will need to be conducted if you’re outputting your content to multiple device types. This step is not intended to clean up errors, but rather to ensure everything worked well. It’s an important task, not to be relegated to clerical staff. Instead, it’s best to be conducted by folks who understand the content, the audience , and the devices on which the content will be displayed.

Review is more than just comparing the original copy to the final result. Because changes are introduced in the eBook versions that were not part of the source file (text is repositioned to support device screen size and orientation, for example), it’s important that reviewers are equipped with the knowledge necessary to know what to check for and what to ignore.

How clean do you want it?

In the traditional book publishing world, perfection was the standard, but that seems to have changed with the rush to get eBooks to market – especially with short run books that need to get out quickly. While a medical text requires checking, double-checking, and triple-checking, other kinds of books might be acceptable with the occasional extraneous hyphen and bullets that don’t wrap exactly right. I’m a little old-fashioned on this, and prefer the perfection approach, but I do recognize that there are short-cuts that some may feel comfortable taking.

Conclusion

So the short answer to the question of how long it should take to produce clean eBook content is based on a number of factors. Each of these variables contributes to the total amount of time you’ll need to spend on correcting an automated PDF to EPUB conversion; it may be 3-4 hours, but it can take also 3-4 days — or longer. It all depends…

  • Alistair McAlpine

    Thanks Mark, good pointers. One thing to be aware of is that the material may be available in formats other than pdf already. It sometimes pays to check, especially if there’s a lot of it.

  • mattrsullivan

    Hi Mark, I’d like to “second” Alistair’s comment…

    Since PDF is a delivery format, anyone producing ePub should also have access to the source content. By single-sourcing from the original content, you’re likely to get a cleaner mapping to ePub.  

    Do you agree?

  • Linda Ettinger Llieberman

    These are all very good points. Always start with the cleanest original material you can find. PDF is OK, but only after you have completed all up-front copyediting and proofing.

    Our firm originally purchased full packages from CreateSpace for our non-fiction commercial works. As we have learned the ins and outs of On-Demand publishing, we have moved more of the production steps in-house. Our works contain figures and tables as well as illustrations and photos. Therefore, they are more complex. We still have CreateSpace do the manuscript conversion for us, although it would be cheaper to do it in-house. 

    After completing two books with CreateSpace and having compared other On-Demand publishing alternatives, we are now embarking on publishing a series of approximately 50 books on our form of conjoint analysis called Mind Genomics TM.

    • http://www.thecontentwrangler.com/ Scott Abel

      I’m not sure I’d say PDF is okay, but if it’s all you have, it might make sense to get it out of there first, then start the process. There are, of course, varieties of PDF (depending on how source content was created, how (and with what software) the PDF was created, and whether or not the proper settings were employed during the PDFing process. PDF files can be just a photo of a document, which is, not so useful for conversion needs. Or, they can be metadata-rich XML files that have the information needed to easily convert content — at least compared to other alternatives.

      Good luck on your series of 50 books. I’d love to hear more about your project (and your lessons learned) once you’ve accomplished your goal.

  • http://www.adi-mps.com/ Rose Rummel-Eury

    Nice article. I certainly learned the hard way that ebook conversion did not follow the same QA practices as hard-copy book production, the world I was from as a publishers’ comp vendor. We converted over 1 million pages in 2009, ready for the ereader explosion for our customers. Once people started buying the books, and seeing the double hyphenation, etc., our customers complained back to us.

    That’s when I discovered we were using machine checks only. Now, we continually write new scripts to fix bugs, AND we have humans eyeballing the results on emulators and on the devices. I still recommend a level of QA at the publishers’ level, as well.

    Rose

  • http://twitter.com/gmtwriter GreenMountainWriter

    Sometimes the best solution is to start over from scratch. We are putting legacy content from 2004 into Kindle and Nook formats. The original files (InDesign) were not designed or composed with an eye towards multiple formats. What looks good printed doesn’t look good as an Ebook conversion. We are having to do a lot of work “fixing” the problems that have come up.  Still, it has made us step back and re-evaluate our process for all our documentation, and that’s been helpful. Thanks for a good article. 

    • http://www.thecontentwrangler.com/ Scott Abel

      All I can say is, “I hear ya!” This new turf that we call digital publishing creates the need to re-think the book — and the book publishing process. Along with that come all the surprises you no doubt discovered on your own. Please let us know if we can help you in any way. Well, I may not be able to magically reformat your books, but if you need any guidance, advice or help, let me know. I’m always happy to connect those who know the answers with those who seek them.

  • Anthony

    Thank you – helpful, informative, article. Many of the techies I know tell me e-book conversion is a simple process and can’t understand why my company would even need to hire a converter: on the other hand, three of the converters we’ve have failed to produce clean copies without a lot of correction from us. Your article helps to explain why this isn’t as smooth as I have been expecting.

  • Pingback: Wordpreneur Reader 04.20.2012 | Wordpreneur

  • Chandra Shakher

    Hi I am php developer,
    I have a problem when i convert any pdf file into epub file mathematical formula can’t convert properly to epub.
    How can i resolve it. please some one help me.

    Thanks in advance.

Archives

Connect