Developer Drain Brain

March 9, 2010

Normalised XML

Filed under: Development — Tags: , — rcomian @ 9:09 am

Compound data in Xml is a problem that’s been highlighted many times before by others, but it’s something that still occasionally trips developers up.

Every now and then you’ll come across an Xml document that stores multiple data items in a single location. A good example is the settings* or vsprops** file for visual studio 9, which stores C++ search paths inside a single element – go look, it’s fun***. You usually find these when you’re trying to take advantage of one of the main strengths of Xml – ie, tool support like XPath or Xslt. In general, things that mean you don’t have to write or import specialised parsing libraries to manipulate and use the data.

Coming across one of these compound formats when using a generic tool means that work inevitably grinds to a halt. As if XPath and Xslt aren’t hard enough, now you’re thinking of using regex, or getting the values into variables and writing some procedural code, which incidentally, doesn’t fit at all inside your xslt, so now you have to completely change your approach to the problem. A simple tool that was to being scripted from a commandline has become a real development effort.

And it’s not just for scripting languages, even making use of these values in normal code can be odd. There you are with your document object, just extracting and using the values in a single operation as and when you need, then all of a sudden, you’re calling ‘split’ and iterating over the results. Of course you’re also throwing away the empty results at the end of split – because whatever formats the data for that value always puts a delimter on the end of every string. But you’re not throwing away all empty results though, are you, because an empty value in the middle of the sequence is fine, it’s just the last one that gets ignored.

And the Xml gets very hard to read very quickly when we’re using compound data. Do your individual values need quoting? Ok, now your document contains a load of unnecesary " values well. And exactly when is that ; the end of an entity, and when is it a delimiter?

There are many really subtle complexities like this that can come up when you’re storing data in your own format. The more you’re stuffing into that value, the more likely you are to come across the awkward edge cases.

If you’re anything like me, when you’re writing this complex split logic, or trying to visually parse out the quotes from the semi-colons, you’re thinking: But formatting data like this is what Xml is for – why am I having to write this code?

There’s many possible causes of this stuffing. Xml does look very inefficient, a single character delimeter takes a lot less space than the bracketing of element names. But if you’re using Xml at all, is that really a concern? And if it is, would zipping the results give you similar gains? Xml compresses extremely well, and the various zips are all fairly ubiquitous tools that won’t trip up too many situations.

The other main reason, is of course, the one size fits all schema that doesn’t extend. But if that’s your schema, shouldn’t you be designing something thats flexible enough to at least store multiple values? Name-value pairs are great, but an Xml document that’s a glorified name value pair system is a bit weak.

Most people are pretty good at designing reasonably normalised databases, so why are we still stuffing so many values into a single column when it comes to Xml?

* In the case of the settings file, it’s not just an array, but an array of arrays. The outer array consists of the pattern: Platform1|Paths1|Platform2|Paths2. The paths themselves are stored as semi colon delimited paths: path1;path2. Good luck pulling the first search path for Win32 out of that with XPath, or even regex.

** VSprops is a single array of semi-colon separated paths, but often stores them quoted, for some reason. So you literally get the construct: "path1";"path2";.

*** For some definition of fun.

Blog at