What is XML Really About?

by Mark Baker

We all know that XML is a good thing. The pundits and the vendors all tell us so. XML opens up many possibilities for content. But exactly how (and why) it does so is not always made clear. Getting your content into XML is not, by itself, enough to deliver any or all of the things that are promised for it. To understand why, we need to look at what XML is really about.

I’m going to assume your know the basic mechanics of XML and that you recognize elements, attributes and the tags that define them. I’m assuming therefore that you recognize that the sample below, which is an excerpt from an XHTML document, contains a <p> element with two <i> elements inside it, and that these elements are defined using tags, which are the things inside the angle brackets:

<p><i>War and Peace</i> is a <i>very</i> long book.</p>

These tags are metadata. When people talk about metadata, they often think of it only as a label attached to a piece of content, usually by a content management system, and used to track the document and make it easier to find. That is certainly one important use of metadata, but metadata is much broader than that. Metadata means data that describes other data. There are many useful forms of metadata, including XML tags. (For more on this, see The Meaning of Metadata).

The reason that XML is so useful is that it allows us to assign metadata to content at any level of granularity, not just to a document as a whole, but to an individual sentence or an individual word or phrase. In the sample above, the <i> tag tells us that the string “War and Peace” should be rendered in italics. (This is what <i> means in XHTML. An <i> tag could mean something completely different in another tagging language.)

Of course, “render this in italics” is not the most sophisticated piece of metadata in the world. Also, it seems to violate the oft-cited rule that XML separates content from formatting. Actually, that oft-cited rule is wrong. Some XML tagging languages, such as XHTML and XSL-FO are designed specifically to apply formatting to content. Some XML languages are designed to separate content from formatting. Others have nothing to do with content at all. A much better rule would be to say that XML allows you to apply metadata to content, and that the metadata you apply can express just about anything you want it to.

So, XHMTL does attach formatting to content, and an XHTML processing application (like, say, a web browser) would render our sample content like this:

War and Peace is a very long book.

That’s fine, as far as it goes, but XML can enable us to do a lot more.

Separating text from formatting

If we do want to separate content from formatting, we need to add some metadata to create the separation. XHTML provides a way for us to take the first step along that road by using the <em> (emphasis) tag rather than the <i> tag:

<p><em>War and Peace</em> is a <em>very</em> long book.</p>

Rather than specifying that “War and Peace” and “very” be printed in italics, this markup simply says that they are to be emphasised. It leaves it up to the processing application to decide how to emphasized them. For instance, it could do this:

War and Peace is a very long book.

But there is a problem here. The processing application has every right to print the emphasized content in bold red text, because making text bold and red certainly does emphasize it. But there are conventions for how you show the title of a book, and that convention is that it should be rendered in italics.

If we are going to separate content from formatting, therefore, we had better do it properly. If we just use <em> as a synonym for <i>, then we are not actually separating the content from the formatting and we would be better served to stick to <i> since it is actually a better way of saying what we mean.

If we truly want to separate content from formatting, we had better find a more discriminating way to go about it than simply replacing <i> with <em> everywhere. If we are not going to format the text directly, then we need to give the processing application enough metadata that it can distinguish things that ought to be formatted differently.

In order to give the processing application enough information to format “War and Peace” correctly, we need to provide metadata that says that the string “War and Peace” is a title:

<p><title>War and Peace</title> is a <em>very</em> long book.</p>

Now we have provided enough information for an XML processor to render the sentence appropriately:

War and Peace is a very long book.

However, in adding the <title> tag to our tagging language, we have moved away from XHTML. XHTML does support a <title> tag, but it uses it inside the <head> element to capture the title of the current document. It does not support the use of <title> inside the <p>element for marking up the titles of books.If we are no longer using XHTML, what are we using? We are now using a tagging language of our own invention. I’ll call it YAMLX (Yet Another Markup Language eXample). We are beginning to capture metadata that is specific to our business. That means that we have to start taking responsibility for our own markup design, and either find an existing markup language that provides the metadata we need, or create one ourselves. For purposes of this article, it doesn’t matter whether YAMLX is a publicly available language or one you create yourself. What matters is that YAMLX captures the metadata you need to run your content process efficiently.

Of course, since YAMLX is not XHTML, web browsers will not understand it directly. In order to publish it, you will need to process it in some way so that browsers will know how to render it. There are a couple of ways to do this. One is to create a CSS style sheet to tell the browser how to display YAMLX elements, and another is to create an XSLT script to convert YAMLX to HTML. Some of the stuff we are going to add to YAMLX later will move us in the direction of using XSLT, so that is what we will look at here.

Fortunately, YAMLX (so far) only differs from XHTML in one tag, so we can write an XSLT template to convert a <title> element occurring inside a <p> element to a valid XHTML tag and just copy the rest over. Here’s the template that converts the <title> element to an <i> element:

<xsl:template match=”p/title”>
<i>
<xsl:apply-templates/>
</i>
</xsl:template>

Don’t let this markup confuse you. It’s very simple. It says to the content rendering engine, if you see a <title> element inside a <p> element, output an <i> element in its place. It doesn’t matter if you get exactly how this works, but as you can tell, it isn’t rocket science.

Semantic markup

Okay, so we have separated text from formatting, and made a distinction between titles and general emphasis so that they can be formatted differently, but the only thing we can really do with this markup is apply formatting to it. Separating content from formatting isn’t exactly a productivity revolution if all you are going to do is slap them back together again. To get any real benefit from the separation, we need to do more. So lets update YAMLX to capture some more useful metadata:

<p><title isbn=”1400079985″>War and Peace</title> is a <em>very</em> long book.</p>

Here we have added some more metadata to the <title> tag in the form of an isbn attribute. With this additional metadata, the markup does not merely identify “War and Peace” as a title, it identifies it as the title of a particular work.

What can we do with this additional metadata? An ISBN number is the key to a large amount of data about a published book. If we have the ISBN number, we can look up all sorts of other information. For instance, we can use the ISBN to look up publication details using a web service like ISBNdb.

Most web services return information in XML, which is perfect for us, since our content is in XML. A hypothetical ISBN web service might return an XML document that looked like this (this is not what ISBNdb returns, just a simplified example):

<book>
<isbn>1400079985</isbn>
<title>War and Peace</title>
<author>Leo Tolstoy</author>
<publisher>Vintage</publisher>
<publication-year>2008</publication-year>
<page-count>1296</page-count>

</book>

We could then pull pieces from that XML document to add to our own content, thus allowing us to produce output like this:

War and Peace (Leo Tolstoy, Vintage, 2008, 1296 pages) is a very long book.

Of course, we don’t do this by hand. We use a script to do it. Just to demonstrate that this is not rocket science either, here is a snippet of XSLT code that does this:

<xsl:template match=”p/title”>
<!– capture the isbn number to look up –>
<xsl:variable name=”isbn” select=”@isbn”/>

<!– call the web service to get book info using the isbn –>
<xsl:variable name=”book-info” select=”document(concat(‘http://example.com/isbn/lookup?’, $isbn))”/>

<!– output the book title –>
<i>
<xsl:apply-templates/>
</i>

<!– output the additional book info –>
<xsl:text> (<xsl:text>
<xsl:value-of select=”$book-info/book/author”/>
<xsl:text>, <xsl:text>
<xsl:value-of select=”$book-info/book/publisher”/>
<xsl:text>, <xsl:text>
<xsl:value-of select=”$book-info/book/publication-year”/>
<xsl:text>, <xsl:text>
<xsl:value-of select=”$book-info/book/page-count”/>
<xsl:text>) <xsl:text>
</xsl:template>

Again, not rocket science, but this basic technique opens all kinds of doors. The publication information was not in the original XML. It was pulled from another source using metadata captured in the XML. The power of semantic markup to enable the merging of information from different sources is enormous. Here are just a few of the tricks we could pull using information retrieved using the ISBN number:

  • Pull in a picture of the book cover.
  • Create a link to an article on War and Peace on your website.
  • Create a link to an online bookstore, where the the reader could buy the book. If you belonged to an affiliate program for an online bookstore, you could pick up some cash every time a reader followed your link and bought a book. Now the ISBN number metadata has turned a casual reference into a potential source of revenue.

Making authors more efficient

There are also some major process efficiencies to be realized by capturing this kind of metadata in your XML content. If you can use metadata keys to pull information from external sources, authors don’t have to look up all that information themselves when they write. Authors don’t have to decide which of the book details are going to appear in the final output. That decision is made by editing the XSLT stylesheet, and it can be changed, for all your existing content, simply by changing the stylesheet.

As you can see, inserting one simple piece of metadata into our XML, lets us save a lot of time when authoring, and leaves all our options open as to which details will be published. This efficiency and flexibility can turn into substantial cost savings and increased revenues when dealing with a large content set.

Further refinement of the metadata

Though including the ISBN number in YAMLX gives us a lot of options, there are some problems with this markup.

The first problem is the accuracy of the metadata. An ISBN number does not identify a literary work directly. It identifies a particular edition of a book from a particular publisher in a particular binding in a particular year. This distinction can be important. There are many other editions of War and Peace, in many languages. War and Peace is a very long book in all those editions and all those languages. The paragraph is not referring specifically to the the Vintage Edition of 2008. It is referring to War and Peace as a novel generally. Using the ISBN actually makes the metadata more specific than the text it is marking up. That could be a problem for some of the ways we might want to query this content.

Suppose we wanted to make a list of all the statements about the novel War and Peace in our content. We can do this easily enough using an XPath expression like this:

p[title="War and Peace"]

This says, give me all the <p> elements that contain a <title> element with the content “War and Peace”. Again, don’t worry about the syntax. My only reason for showing you this is to show that none of this is rocket science.

There is a problem with this XPath expression, however. It will return all the paragraphs that contain a title “War and Peace”. But this might include references to the title of the movie of the same name, or the opera, or the non-fiction book called “War and Peace”, and that is not what we want. We just want statements about the novel.

To narrow it down to the novel, we look for additional metadata that can help narrow the focus. One such piece of metadata is the ISBN, so we could try this:

p[title/@isbn="1400079985"]

This says, give me all the <p> elements that contain a <title> element with an isbn attribute that has a value of “1400079985″. That will certainly eliminate any movies, operas and non-fiction books, but it could also miss some references to the novel.The problem is that there is more than one ISBN number that can refer to War and Peace the novel, since it exists in many different editions. There is nothing to say that every author who marks up a reference to War and Peace will look up the same ISBN. This is why it is a problem that our metadata is more specific than the content it describes: it can cause us to miss some instances of the content.

A second problem is author productivity. The author who wrote the paragraph probably doesn’t know the ISBN of any particular edition of War and Peace off the top of their head. If the markup called for an ISBN, the author would have to stop and look one up. Saving authors from having to stop and look things up can produce some significant productivity benefits. It can also potentially increase your pool of available authors.

So, using the ISBN as metadata is too precise and makes life difficult for authors. We need to come up with some markup that is at the right level of precision and is easier for authors to create. For example, we could do this:

<p><novel author=”Leo Tolstoy”>War and Peace</novel> is a <em>very</em> long book.</p>

Here, we have replaced the <title> tag with the more specific <novel> tag, and replaced to too-specific isbn attribute with the just-specific-enough author attribute (just in case another author has also written a novel called War and Peace).

This markup is obviously easier for authors to create. It only asks them for the things they already know, so they won’t have to stop while authoring to look anything up.

What about selecting every paragraph that refers to the novel by Tolstoy? We can do this more accurately as well, using an XPath like this:

p[novel[@author="Leo Tolstoy"] = “War and Peace”]

This says, give me all the <p> elements that have a <novel> element that has an author attribute with the value “Leo Tolstoy” and whose content is “War and Peace”. This is what we wanted, so it looks like we have got our metadata correct now.

Or have we? With the ISBN metadata, we are able to pull in publications information by using the ISBN number to query the ISBN database. Without an ISBN number, how can we get that data? We still can get that data, but we have to use a different query to extract it. Our original code did the lookup like this:

<xsl:variable name=”book-info” select=”document(concat(‘http://example.com/isbn/lookup?’, $isbn))”/>

Now we need to change it to do the lookup based on the metadata we have: category (novel), title and author:

<xsl:template match=”p/novel”>
<!– capture the metadata to look up –>
<xsl:variable name=”title” select=”.”/>
<xsl:variable name=”author” select=”@author”/>

<!– call the web service to get book info –>

<xsl:variable name=”book-info” select=”document(concat(‘http://example.com/isbn/lookup?category=novel&title=’, $title, ‘&author=’, $author))”/>

The only thing different about the results we will get from this query is that there may be more than one book with that ISBN (actually, there will certainly be, since there are many editions of War and Peace in print). So the code that adds the book info to the content will need to pick one of the alternatives based on some relevant pieces of publication data (such as the most recent publication date). A side benefit of this is that the publication information we show will be consistent wherever we refer to War and Peace in our content.

So should we always mark up “War and Peace” using the <novel> tag and an author attribute?  Not necessarily. What we are really seeing here is that when we mention “War and Peace” in our content, we could actually be referring to different things. We could, as in the case we are looking at, be talking about the novel generally. But in another circumstance, we might be referring to a specific published edition of the novel. Even though the string is “War and Peace” in both cases, that text is referring to different things. One of the most important roles of XML-based metadata is to make these kinds of distinction clear so that we can process each case appropriately.

What we actually need in YAMLX are two different tags so that we can apply the right metadata to the words “War and Peace” depending on what they mean in each case.  For references to the novel could write:

<p><novel author=”Leo Tolstoy”>War and Peace</novel> is a <em>very</em> long book.</p>

For references to a particular edition of the novel, we could extend YAMLX to include an edition element which takes an isbn attribute:

<p><edition isbn=”1400079985″>War and Peace</title> is <em>still</em> on back order.</p>

Now our processing application can recognize that the words mean something different in each case, and can process them accordingly.

Own the data format, own the functionality

So what is the right tag to pick to mark up the string “War and Peace” — <i>, <em>, <title>, <novel>, or <book>? There is no single right answer. It all depends on what you want to do with your data. Content is data. Content becomes data as soon as you enter it into a computer system. Whether it is a Microsoft Word document, an Adobe FrameMaker file, or XML, content is data. The difference is that with XML you own the structure of your own content data.

Owning the data structure of your content is important, because owning the data structure puts you in control of the content creation and publishing functionality. Using a binary format like Word or FrameMaker means you get the functionality of Word or FrameMaker. With some scripting, you may be able to add some additional functionality, but only insofar as the Word or FrameMaker data model supports it. With XML, you decide on the file format, and you decide what functionality you will implement to process your data. You can define the data format that best meets your particular business needs.

None of this means much, however, unless you do something with your data that you couldn’t have done with Word or FrameMaker — something like automatically pulling in additional content from a database in order to enhance your content and save your authors work.

The real take-away here is that XML makes your content into a database. Because XML allows you to apply metadata to your content at any level of granularity, it allows you to query your content at any level of granularity, and that lets you process your content at any level of granularity. At the same time, it allows you to embed metadata in your content that you can use to create a database query that pulls in other data (as we saw with the ISBN example).

This is what every XML tagging language does — it turns content into a database. The differences between different XML languages lie in how detailed and specific the structure of that database is, and how well it aligns with a particular set of business needs. This has a really important consequence whether you are shopping for an authoring system or building your own: The data format is more important than the application functionality.

When you buy a conventional off-the-shelf tool like Word or FrameMaker, your buying decision is based largely on the application’s functionality. You are not looking at what it might do in the future, but on what it actually does today. How the file format that the application uses supports its functionality is not something you generally worry about. You are buying functionality, not data structure.

But with XML, the primary consideration should not be what functionality you get out of the box, but what functionality the data structure of the XML supports. Even if you use a tool or a tool kit with existing functionality, you are not confined to that functionality, and it should not be the primary thing you base your decision on. You can always add any functionality you need, if your data structure supports it. But you can’t build or buy functionality that the data does not support.

This is what you need to know, therefore, when you consider an XML solution:

  • XML turns your content into a database. You can query that database to organize and structure your content and to combine content from different sources.
  • The design of that database determines what you can do with the data. You need to pay careful attention to your markup design to make sure it allows you to implement the functionality you need to streamline your content development process.
  • You can improve author productivity, and bring more authors into the fold, by designing the markup to ask for information they already know, and then use that information to pull in related data.
  • The key decision you have to make is not what functionality a particular tool supports out of the box, but what functionality the format of your content will support.

About the author

Mark Baker, President of Analecta Communications Inc., has over 20 years of experience in the technical communications industry and over 15 years designing, implementing, and using structured authoring systems. He blogs at everypageispageone.com.
  • http://twitter.com/georgebina George Bina

    Really nice article Mark, congratulations!
    One small typo that you may want to correct:
    War and Peace is a still on back order.
    should probably be:
    War and Peace is still on back order.

    Regards,
    George

  • Mike McNamara

     Some good background information here.

  • http://twitter.com/juliov27612 Julio Vazquez

    Very nice article, Mark. I like the description of how you can leverage the power of the content if there is enough semantic richness to allow you to query the data correctly. Great examples and it definitely gives anybody reading a lot to chew one.

  • Joakim_N

    Great article! Thanks!

Archives

Connect