Groovy, Programming, Scripting, Work

Using Groovy to create XML Schemas

I have previously found XML Schema to completely invaluable in defining interface points between systems. Normally file interfaces between systems are done in formats that are deceptively simple: CSV, structured text files. However in nearly all cases the initial simplicity tends to lead to a lot of problems. There is the issue of how to escape characters within your fields, particularly the field separator. The free text field is often used as exactly that, free text. Something is supposed to be a number field… until the letter X appears in it. Historical CSV is the worst as often the exact meaning and origin of each of the fields is undocumented and the meaning lost. I have even come across CSV generators that map meaningless constants to the output just to keep the number of fields the same. The receiving systems ignore those same fields or sometimes even hinge workflow off a value that will never vary in practice. The whole thing ends up being a nightmare.

Introducing an XML Schema reduces that nightmare but does bring in a lot more complexity. Being able to specify the type and order of the fields comes at a price. Previously when I have wanted to develop a new schema I have simply used the Xerces tools at the command line and an XML editor to generate both the Schema and a sample datafile. It works but it is quite laborious. Speeding this up would be great as often the point of capturing the complexity in the data transfer is so that the business or the architects can see the complexity of the integration and decide that they really want to do it before a lot of code gets written to integrate the systems.

Looking through the Groovy website I came across this example of how to validate an XML document and an idea is sparked. The multi-line indicator is a neat feature (borrowed from Python I think) and is (to my mind) a more elegant solution than the Ruby/Perl document syntax. It would allow me to define my schema, my sample document and my validation code in the same file. During iterations I would be more productive and when the interface is captured I just publish the final schema.

So I’ve knocked together a simple PoC and it seems to work pretty soundly. The easiest way to work with it is from the sample document to the Schema but TDD approach is to define the Schema and work back from the validation errors. The latter approach tends to avoid the situation where you’re validating your test document rather than your document template.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s