First, in 15 seconds, give me a quick introduction to XML.
eXtensible Markup Language (XML) is a technology that allows data to be 1) self-describing, 2) extremely portable, and 3) easily extended with new information. XML is really just a way of formatting information in text data files so you can read it. XML looks like this:
<occupation>Smart Alec… errrrr… I mean consultant</occupation>
In this example, we see three “tags” (firstname, lastname, and occupation) that surround the actual data in the file. The closing tag is prefixed with a slash. Tags can be nested inside of each other to support complex data structures.
So how is XML used?
Even though it was not originally intended for this purpose, XML has become the language of choice for transferring data across disparate computer systems. It is the underlying layer for a number of more sophisticated technologies that depend on portability of files, such as XHTML, WSDL, SOAP, and BPEL. XML is also used for just about every application’s configuration file now-a-days.
Since we began by stating that XML is self-describing, portable, and extensible, let’s take a very quick look at what each of those really means in practice.
What do you really mean by "self-describing"?
XML is self-describing since information resides right next to the descriptive text that helps people (and software) easily understand the meaning of the file’s contents. Note that all of this description comes at the expense of file size and parsing efficiency. So, for example, it is not easy to understand the purpose of a file with only the text “100” in it unless you happen to know what software accesses that file and how that software uses the file’s contents. Obviously, a file that contains the text “My bank account balance is: 100” is much easier to read since it has explanatory text right next to the data, and this is the idea behind XML. We can translate that same account balance information into XML, like this:
What do you really mean by "portable"?
XML is extremely portable because all information in an XML file is text (rather than binary), and therefore it is easily passed from computer to computer regardless of who made the hardware or the operating system. Even numbers in an XML document are text. This portability comes at the price of having to constantly translate between text and whatever the native form is, prior to communication. But regardless of the performance costs, XML is widely embraced.
What do you really mean by "Extensible"?
You can easily add new information to XML files without breaking existing applications that read the file. For example, we could change our sample file to add overdraft information without breaking any existing applications:
Here we have added the additional information that there have been two overdrafts this year without breaking any programs that use this file.
So is XML easy to learn?
There are, of course, lots of books on the syntax of XML, but it’s usually not too hard to figure out the essence of what is contained in an XML document. After looking at a few XML files, you can readily see that tags can be nested inside of each other, tags are case sensitive, and comments are surrounded in this manner: “<!—Here’s a comment -->”. To save space, it is possible to replace the opening/closing tag pair (like “<tag> …values… </tag>”) with a slightly shorter equivalent (as in “<tag …values… />”).
What's the difference between XML and HTML?
XML is often confused with HTML. It looks just like it . But even though XML and HTML documents look very similar, both use tags, and both have the same parent technology, HTML focuses on formatting information for display while XML emphasizes formatting data for easy understanding. XML uses free form opening tags (like “<bankAccountBalance>”) and closing tags (like “</bankAccountBalance>”) to surround data. With HTML, the tags must be from a pre-defined list of tags, including:
- <b>words to appear in bold </b>
- <i>words to appear in italics</i>
- <p>text to appear in a paragraph</p>
XML element names can be anything you want them to be since it is your responsibility to write the software that interprets those tags. We could have used “<accountBalance>” rather than “<bankAccountBalance>” as long we correspondingly adjusted the software that looks for and uses the account balance. By contrast, when constructing a web page, you can’t just decide to use the tag “<boldText>” instead of the more cryptic but proper “<b>” tag since browsers expect and interpret only HTML tags that are defined in the HTML standard. “<boldText>” is not in that standard and the browser won’t know what to do if it encounters a tag like that.
So what happens if I give a browser an XML file?
It’s worth noting that you can actually drag and drop XML files into most browsers. Even though XML is not HTML, it is so close to HTML, browsers can easily take a crude guess at how to display an XML file, and most do. With technologies like Cascading Style Sheets (CSS) and XML Stylesheet Language Transformations (XSLT), you don’t have to accept a browser’s crude guess. With XSLT, for example, you can ensure that browsers display your XML file using the fonts, colors, and fancy formatting options that you prefer. XSLT is used to assign HTML formatting instructions with data prior to display, but without interfering with other software that uses the XML data file. So an XML file might contain all the items in a company’s product catalog, while an XSLT file might extract some items from that XML file, sort them, and make that catalog of products appear nicely on the company’s web page.
Does XML provide any protection against typographical errors?
While it is true that XML allows you to define your own element names, it is frequently desirable to enforce proper content. You obviously want to avoid misspellings (like “<acountbalance>”). XML provides two optional mechanisms to do this – Document Type Definitions, or DTDs, and XML Schema Definitions, or XSD (also known as XML Schema language). DTDs are on the way out, so XSD is the important technology to focus on.
There's not many topics more dry than XML Schemas. You've got 10 seconds to bring me up to speed.
XML schemas (XSD files) are used to enforce correctness in XML documents. The XML parser can detect errors in data files when the files do not conform to a given XSD schema or schemas. XSD defines what elements and attributes are valid in an XML document, and which ones must appear. It defines what the child elements should be, if any, and what order they should appear in, and also how many there should be. It defines what elements can be empty and what default values exist. And very importantly, XSD defines what data types (like floats, integers, strings, etc.) each element and attribute should be. The schema often serves to further document expected data content (beyond the description in the XML file alone) and can serve as a communications contract between multiple companies. XSD is compatible with web services which return XML files. Schemas allow you to define custom types, support object inheritance, and XSDs have fancy features not found in DTDs. XML schemas are covered in detail in our XML Schema page (simultaneously our most excruciatingly boring and funniest material).
Tell me just the tiniest bit about the syntax of XML
The remainder of this section covers a little more about the syntax of XML, in case you’re not just looking for a high level summary. If you were just looking for a high level summary, this is a good place to allow your eyes to start mindlessly drifting across paragraphs as though you are really taking them in.
You may notice that tags sometimes have addition information embedded in them. For example, you may see an open tag that looks like this: “<bankAccountBalance dataValidatedOn=1/1/2008>” rather than this: “<bankAccountBalance>”. In this example, we’ve used the tag attribute “dataValidatedOn” to specify even more information about my depressing account balance – that the data was checked for validity on January 1st. Tag attributes, if properly designed, are supposed to be used to further describe the enclosed data and not to contain actual data. In this example, the date the data was validated is an attribute rather than another field in the data set because it is about the data and not about the bank account itself. But it’s not unlikely that the data validation date might be designed as a field rather than a tag attribute, despite any best practices. Tag attributes are information about information (called meta-information) and not information itself. Anyone got an aspirin?
A big problem with XML is tag collisions. To illustrate this problem, consider the following two fragments of XML:
XML fragment #1:
<FORM action=http://www.PracticeSafeTechs.com/prog/example method=”post”>
<INPUT name=”sex” value = “female” />
That fragment is really part of a web page, but it’s valid HTML too. Now consider this second XML fragment.
XML fragment #2:
<observation>Not so bored with this example now, are you?</observation>
If a single program attempts to read these two XML fragments, an “element name conflict” error will occur because both fragments define a “FORM” element with opening and closing “<FORM>” tags. There is confusion about what contents to expect in the FORM element. To solve this problem of two developers using the same element name, XML introduced 1) prefixes and 2) namespaces. If you’re going to read XML, you really need to master these two concepts. Unfortunately prefixes and namespaces make formerly easy-to-read files into impossible-to-read files, which is why it’s important to understand them.
What are all these letters doing in my XML files?
Prefixes allow XML developers to establish an area where their chosen element names remain unique by simply prepending some letters to each opening tag. “<FORM>” becomes “<gh:FORM>”, “<observation>” becomes “<gh:observation>", and so on, as each element is prefixed with a few letters (in this case, we use the author’s initials, but the choice of letters is up to you). Prefixes help to ensure uniqueness. They also serve to group together like XML tags into a common vocabulary. For example, one set of tags may pertain to text, so the <tx:bold>, <tx:italics>, <tx:underline> tags might each be conceptually grouped together by the prefix “tx”. Another set of tags might pertain to employees, so the tags <em:name>, <em:address>, <em:department> might be prefixed with “em” to conceptually group the attributes that describe an employee. As you read through XML documents and see words prefixed like this, just remember that these prefixes are telling you something about the source and intent of each prefixed word.
Prefixes are especially important to avoid tag collisions, but by themselves collisions are still possible. Two organizations are very likely to select the same prefixes. This is because prefixes are often two or three letters, and many companies use prefixes to group the same types of objects like orders, products, inventory, etc. So even with prefixes, collisions are still likely.
What are namespaces?
Since prefixes help, but do not solve the name uniqueness problem, XML uses namespaces to make tag name collisions even less likely. While prefixes can be used independently from namespaces, XML provides namespaces in order to make the prefixes themselves even more unique and to further group common tags together into a single vocabulary. If you’re a manager, you may not need to master namespaces, but it still helps to know a little about them so you can sympathize with your developers when you hear them say “sheeesh this XML file is hard as hell to read”. If you’re a developer or if you’re likely to get directly involved with XML files, there’s nothing more important than understanding namespaces. You’ll be lost if you don’t.
An example helps to understand namespaces. In the first line of the XML below, an XML namespace (xmlns) called “HereIsAVeryUniqueNameSpace” is associated with the letters “abc”. Everywhere the XML parser sees the prefix “abc” it will automatically substitute the more unique characters “HereIsAVeryUniqueNameSpace”.
So, by using the “HereIsAVeryUniqueNameSpace” namespace, this XML:
automatically gets converted by the XML parser into this XML:
Obviously the converted XML is harder to read with so much lengthy repetitive text. This example demonstrates why namespaces are needed. Namespaces help keep prefixes short but still help to ensure their uniqueness. While prefixes do not have to be only a few letters, they are by convention to preserve readability.
Even with namespaces, collisions are still possible. After all, two companies might choose the same namespace. So XML developers further guarantee the uniqueness of their XML namespaces by using their company’s web address within the namespace. So instead of using the string “HereIsAVeryUniqueNameSpace”, you are more likely to see “” in the XML. By prefixing the corporate URL to your namespace, you now only need to coordinate with people in your own company for unique names rather than contending with the entire world. Developers may append their department to the company’s URL to make a namespace even more unique, as in .
Sidebar - Does the URL that’s used to make a namespace unique really resolve to an actual web site?
If you type in
Http://www.PracticeSafeTechs.com/HereIsAVeryUniqueNameSpace or any XML namespace URL into a browser, what will it displayed? Probably nothing, but if you’re asking this question, you’ve stumbled into a hotly contested issue because this is a very confusing part of XML. Developers constantly attempt to hit these namespace URLs from browsers because they naturally albeit incorrectly expect a web page whenever they see a URL. Sometimes companies really do set up a web page to correspond with their XML namespace URL even though that web site is never accessed when parsing the XML. To make matters even more confusing, if a company actually does provide a web page for the namespace URL, it is frequently something technical looking, like an “RDDI” document or a suggested schema, which looks an awful lot like it might be needed at runtime to parse the XML file that references the page. Sure - you may elect to download a local copy of one of these suggested schemas so your XML parser can enforce integrity at run time, but the namespace web page itself will never be referenced by the XML parser automatically. If it did, your computer would always have to be connected to the internet to parse XML files, which would be unacceptable for lots of reasons. Namespace URLs are frequently a major source of confusion when people attempt to read XML for the first time. The namespace URL is just a unique string - that’s it. It’s not a web site.
Why are namespaces so important?
Namespaces become particularly important when dealing with XML Schema Definition Language (XSDL). XSDL allows developers to define consistent data structures within their organization but still extend them. XSDL is the topic of the next chapter and it’s important to understand namespaces before trying to tackle XSDL. In the XML Schema chapter, we’ll see how namespaces can be used to allow multiple developers to augment pre-existing schemas.
What is the default namespace?
It’s possible to define a “default namespace”. The default namespace is used to house variables if no prefix is used. Default namespaces prevent having to enter repetitive prefixes. Many XML developers often don’t use them since they can detract from understandability.
How do I parse an XML file?
You can use either a document-based (DOM) or event-based (SAX) parser to interpret the contents of an XML document. A document-based parser pulls the entire XML file into memory at once and you have random access to any part of the tree of information. An event-based parser is used for large files where you don’t want to pull the entire contents into memory but instead want to plow through the whole file and process everything in it sequentially. With event-based parsers, you read some, process it, discard what you processed, then ask for the next chunk of information when you’re ready. So SAX parsers are good for large files. In addition to DOM and SAX parsers, there are also XSLT parsers, which are used to prepare XML files to be displayed in HTML browsers.
What are the common and serious mistakes that companies repeatedly make with XML?
So how well does XML really work? Everyone seems to want to use it for everything. Should they? When is it appropriate NOT to use XML? What are the practical alternatives? What are the most common mistakes that companies make with XML? A little extra knowledge now can save you a lot of grief later. Don't just settle for the book excerpt on this web page - get the answers to these questions, but get them in a form that you can stand to read – brief, humorous, and insightful. The book is a business expense, and you'll gain key insights on a multitude of other technologies as well. Click on the “Buy Now” button below before your project explodes, spreads debris for miles and miles, and kills innocent bystanders!