Migrating to DITA: How Automated Conversion Works and Why it Matters to You
by Patrick Baker, VP Development and Professional Services, Stilo International
An unavoidable part of moving to the Darwin Information Typing Architecture (DITA), or any other structured authoring system, is converting your existing content into the new format. Most organizations that make the move to structured writing have to make the change while still continuing to meet their regular delivery schedules. This means you need to convert content and get it up and running correctly in the new system between two product cycles, and usually without much in the way of added staff or resources.
Content conversion is a key part of your migration strategy and the quality and completeness of that conversion is essential to getting the migration done within the constraints of your schedule. However, content conversion is a bit of a black box for many people. It is hard for writers and managers to anticipate how difficult the conversion process is going to be, how long it is going to take, how much it is going to cost, and how much cleanup of the output is going to be required after the conversion is done. The purpose of this article is to lift the lid off that black box. In particular, understanding how automated conversion works will help you form a more reasonable expectation about how your own conversion project is going to go, and what you can do to make it go more smoothly.
No automated conversion is ever 100% clean, but the difference between an 80% clean conversion and one that is 95% clean is huge – it means a fourfold difference in cleanup costs. What makes the difference between 80% clean and 95% clean? Between 95% clean and 98% clean? Such outcomes certainly depend upon how well managed the conversion process is. However, having the right approach to content conversion is of critical importance.
Knowledge is the key to intelligent content conversion
There are three essential mechanisms that content conversion technology may leverage. They are:
- guided conversion
When encoding content in a semantically rich format such as DITA, it is important to understand the meaning of the content in order to apply the correct tags. While people can understand the full meaning of the text they are reading, a computer does not, at least not very deeply. What a computer is exceedingly good at is recognizing patterns in the content. But patterns don’t provide the full solution. Patterns, when found in a given context, carry much more insight as to the meaning of the text in question. A sequence of 5 digits, for example, may represent a zip code, in the context of a US postal address, or an ICD-9 diagnostic code, in the health care sector. Guided conversion is supported by the provision of high-level mapping rules that hint at the current context so that patterns are interpreted correctly by the automated conversion tool. Compiling these hints depends on having an intimate familiarity with the document set destined for conversion. It is the content owners, armed with this content knowledge, who are best positioned to specify the mapping rules.
Patterns are everywhere in content. Patterns occur both in the content itself, and in the file format that contains the content. The foundation of all content conversion tools is the ability to recognize patterns.People also use patterns to recognize things in content.
For instance, a reader will immediately recognize what these numbers mean based on their pattern:
- +1 (613) 745-4242
Software can recognize these patterns as well, so if your target format requires semantic markup for date, telephone numbers, or monetary amounts, a simple-pattern matching algorithm can find them and supply the markup. For example, a conversion program could recognize the sequence:
- “+” numbers space “(“ number*3 “)” space number*3 “-” number*4
It can then capture each number sequence in this pattern and write it out using whatever XML format you choose for phone numbers, for instance:
- <phone-number country=”1” area=”613” exchange=”745” number=”4242”/>
Of course, recognizing phone numbers is a bit more complicated than this. For one thing, people do not always include the country code when they write a phone number. People often omit the parentheses and the dash from the number, especially when the country code is used. This is one place where local knowledge of your content comes in – if you have a corporate style for phone numbers, you can tell your conversion software exactly what to look for. Otherwise, the conversion program can use multiple patterns to detect phone numbers in different formats.
Also, this pattern only works for North American phone numbers. Many other countries write their phone numbers differently. This is a case where we can use a context clue to improve our detection of phone numbers. For instance we can use the country code to determine which pattern to expect. The following pattern detects a UK phone number:
- “+44” numbers-and-spaces
UK phone numbers use a different format from North America, so our original pattern will not detect them correctly. A conversion program can detect phone numbers as a two-step process. First you detect the country code to determine which country the number belongs to, then you select a pattern appropriate to the chosen country to fully analyse the number.
You can expect support for matching common patterns, such as phone numbers, to be built in to conversion software. However, it should be easy to extend the system with new patterns specific to the vocabulary of a particular domain.
Patterns, though they are an indispensable part of automated conversion, cannot on their own address the challenge of imparting to the content the depth of meaning, or understanding, required for the intelligent application of semantic markup. This is where context comes in.
For example, consider a list in an Adobe FrameMaker document. In FrameMaker, while a table is a distinct type of object, a list is not. In FrameMaker, you create a list simply by adding a bullet or number style to a set of paragraphs. The result is something that looks like a list in the output. However the FrameMaker file format does not record the fact that the content is a list. The human eye can see the list in the output, but it is a little more challenging for a conversion program to figure out where a list begins and ends and what belongs to each item in a list.
Why does the conversion program have to figure out where the list begins and ends? Because most XML formats treat lists as distinct objects. When an XML document is styled, the style is generally applied to the list as a whole, rather than to the individual paragraphs in the list. This is usually the only way that an XML-based system provides for styling lists, so if the conversion software does not recognize the list in the source and create a proper XML list element in the output, chances are that the list will not be styled properly in the final output.
Example: a nested list
- Prepare the dough.
a. Beat the egg in a large bowl.
b. Add flour.
c. Stir in milk.
- Prepare the topping.
a. Mix brown sugar and cinnamon in another bowl.
- Form 1-inch round balls of dough.
It is helpful to use a spoon when forming these balls.
- Roll each ball in the topping.
- Place each ball on an ungreased cookie sheet.
Bake at 425 °F for 12 to 15 minutes.
This is the kind of construct that often occurs in complex procedures in technical documentation, the conversion program has to deal with multiple paragraphs within a single list item, as well as nested lists.
In this example, a paragraph that begins with a numeral indicates a first level list item, while a paragraph beginning with a letter indicates a nested, second level list item. An automated conversion should leverage this pattern to determine the logical nesting level of each item. Alternatively, it should identify nesting level by the indentation or styles that were used. Regardless, the conversion needs to track the current nesting level in order to ensure that the lists are properly opened and closed, and that each list item belongs to the correct list. For our example, this means emitting an opening <ol> each time we transition from an outer list item to a more deeply nested list item, and emitting a closing </ol> when transitioning in the other direction. The correct output is:
<p>Quick-drop cookies</p> <ol>
<li>Prepare the dough.</li>
<li>Beat the egg in a large bowl.</li>
<li>Stir in milk.</li>
<li>Prepare the topping.</li>
<li>Mix brown sugar and cinnamon in another bowl.</li>
<li><p>Form 1-inch round balls of dough.</p>
<p>It is helpful to use a spoon when forming these balls.</p>
<li>Use a spoon to make 1-inch round balls of dough.</li>
<li>Roll each ball in the topping.</li>
<li>Place each ball on an ungreased cookie sheet.</li>
<p>Bake at 425 degrees Fahrenheit for 12 to 15 minutes.</p>
Note that the list markers (1., 2., a., etc.) have been removed by the conversion.
So, how can we establish the appropriate context of a given piece of content? The most reliable authority on this is the content owner who is familiar with the content. A mechanism is required which enables the content owner to easily express what the correct context is for any document content. This must be a high-level interface that does not require the user to be a programmer or technical expert.
Example: task steps
Upon further reflection, the markup provided by the previous example is not ideal. An improved DITA markup of these instructions for preparing the quick-drop cookies would use steps within a task topic. But, to target a semantically rich content model such as a DITA task, a conversion tool requires guidance. Such guidance may be provided by means of annotations attached to portions of the content, as illustrated in the table below.
The task title annotation can be based on the formatting properties of bold and underline. The annotation of step level 1 or 2 can be based on the presence of the list markers or the indentation level of the text. The tip might be recognized by the paragraph styling. The conversion should be smart enough to try to fit the last sentence into a task in a way that makes sense, in a way that is permitted by the DITA task content model. The elements <result>, <example> and <postreq> are good candidates. A preference can be set for the documentation set, and in this case <postreq> is the best choice.
Guided by these annotations, the conversion software should produce the following output:
<title>Quick-drop cookies</title> <taskbody>
<cmd>Prepare the dough.</cmd>
<substep><cmd>Beat the egg in a large bowl.</cmd>
<substep><cmd>Stir in milk.</cmd></substep>
<cmd>Prepare the topping.</cmd>
<cmd>Mix brown sugar and cinnamon in another bowl.
<cmd>Form 1-inch round balls of dough.</cmd>
<info><note type="tip">It is helpful to use a spoon when forming these balls.</note></info>
<step><cmd>Roll each ball in the topping.</cmd></step>
<step><cmd>Place each ball on an ungreased cookie sheet.</cmd>
<postreq>Bake at 425 degrees Fahrenheit for 12 to 15 minutes.
Typical problems to look out for
Here are some examples of the types of conversion issues that cause problems for conversion solutions that do not make full and integrated use of patterns, context, and guided conversion.
Multiple sets of steps within a task topic
A DITA task topic must contain only one procedure. However, many existing user guides are not written that way, and may have more than one procedure in a section. If you are converting sections into topics, and a section has more than one procedure, the conversion software needs to do something to produce valid output that includes both procedures.
Some control of context is required even to recognize that this problem exists. A conversion that depended solely on pattern matching would not even notice that it was creating an illegal second procedure. For a conversion tool to avoid this error, it has to be aware of the context of the procedure, not only in the input it is reading, but in the output it is creating.
Though the content cannot be automatically re-authored, the conversion software can insert an empty task <title> based on context, effectively breaking the topic into two tasks. This allows the conversion software to apply the semantically correct <step> and <cmd> markup to the content of the second procedure. The user still needs to provide the proper text for the title of the second procedure, post-conversion, but this is much quicker and easier, and less error-prone, than re-authoring the topic, either in the input or in DITA.
Procedures authored as a table
A number of organizations use tables to lay out the steps of a procedure. For a generic conversion program, this structure is going to look like a table, not a procedure, and the result will be that the content will come out as a table rather than a task in the DITA XML, which is not what you want.
Guided conversion can identify such tables based on, for example, the content of the first column (Step 1, Step 2 etc) or the header row, or possibly the table style. The identified tables can be stripped of their table markup, and their contents automatically mapped into step commands, info, examples etc. Again, the paragraphs can be identified based on the fact that they were contained in such a table, so there is no need to rely on styles.
Tables that contain definition lists, advisories, or any other content, can be similarly identified and stripped of their table markup.
Some conversion tools have trouble working with files that contain conditional text. Sometimes the tool requires that all conditions be turned on before conversion, and then they lose the conditions in the output.
Guided conversion should be used to specify a rule which indicates how different conditions in the source content map to XML. The conversion rule can target the DITA otherprops attribute, or a specialization of the props filtering attribute, for the capture of the conditional information. A guided conversion rule could also cause conditions of a specified type to lead to the creation of entries in the relationship table of the DITA map.
Constructing book and map files
While the aim of a conversion to DITA is to be able to reuse topics in many places, the first place you are probably going to want to use your converted topics is in the same book they came from. That means you will need a ditamap and/or bookmap that reproduces the structure of the converted book. Your conversion tool should be able to produce the required ditamap and bookmap for you.
Discerning the hierarchy is not always as simple as matching heading levels. Not every heading marks a change in hierarchy, and authors do not always use headings in strict hierarchical sequence. Additionally, different topic divisions may be indicated by the use of different heading types. Managing all of these issues requires sophisticated management of context informed by a detailed knowledge of the content and the style conventions that were used to create it.
Another important issue is discovering the book information such as publication date, document number, etc. For some organizations, this may involve the creation of a customized bookmap, if the standard DITA bookmap does not capture all of the publication information the organization uses.
This information in not always easy to find in the source files. No generic conversion software can ever accurately detect, extract, and preserve this publication information, since its format and location is always specific to an individual organization. However, with guided conversion pinpointing the location, pattern, and context of this information, a conversion tool can build the correct map.
In some cases, important metadata is found in the headers and footers rather than the main text flow of the document. Once again, guided conversion can pinpoint the data of interest and relate it correctly to ditamap and bookmap files you are building.
Choosing your conversion strategy
Knowledge, as defined by Joe Gollner:
“Knowledge is the meaningful organization of information, expressing an evolving understanding of a subject and establishing a basis for judgment and the potential for action.”
The level of success that an automated conversion technology can hope to achieve is bounded by the depth of knowledge it can attain of the content to be converted. Context, supported by guided conversion, provides for the meaningful organization of the information revealed by patterns. The conversion software can act on this evolved understanding of your content to produce the richest XML possible. Knowledge is the key to intelligent content conversion.
Because intimate familiarity with the content is so important to specifying the patterns and the context that will produce a high quality conversion that requires little cleanup, you probably don’t want to simply send your files away to be converted. Without your specialist knowledge to supply the patterns and context clues, the conversion you get back is going to be pretty generic, and that is going to mean you will have to do a lot of manual cleanup before the content is really usable.
On the other hand, the people with this knowledge are writers and editors in your organization, and they generally don’t know how to express these kinds of context clues in a programming language. Trying to learn to do conversion programming, so that you can write your own conversions that exploit your knowledge of the content, is going to be even more time consuming than cleaning up all the problems left by a generic conversion.
To get the best of both worlds, you need to work with a conversion service provider who understands the importance of patterns, context, and knowledge of the content in the conversion process, and who will work with you to define the conversion rules that will greatly improve the quality of your conversion output, and thus save you weeks or months of cleanup effort. You need a conversion service provider that possesses the intelligent conversion tools that allow you to capture and express all the context recognition rules in a high-level human-readable way, without the need for programming or technical expertise.