Programming, Software, Web Applications, Work

Names are like genders

One thing I slightly regret in the data modelling that is done for users in Wazoku is that I bowed to marketing pressure and “conventional wisdom” and created a pair of first and last name fields. If gender is a text field then how much more so is the unique indicator of identity that is a name?

The primary driver for the split was so that email communications could start “Hey Joe” rather than “Hey Joe Porridge Oats McGyvarri-Billy-Spaulding”. Interestingly as it turns out this is definitely the minority usage case and 95% of the time we actually put our fields back together to form a single string because we are displaying the name to someone other than the user. It would have been much easier to have a single name field and then extract the first “word” from the string for the rare case that we want to try and informally greet the user.

My more general lesson is that wherever I (or we more generally as a business) have tried to pre-empt the structure of a data entity we have generally gotten it wrong, however so far we have not had to turn a free text field into a stricter structure.


Data Modelling with Representations

One of my colleagues at ThoughtWorks, Ian Robinson,  is doing something really interesting with business and data modelling. His ideas are making me think again about some of my views on data modelling.

Data modelling is something that is criminally under-valued in software design. It is actual the crux of most IT problems and yet it often dealt with as an afterthought or from the perspective of making application development easier.

Unlike a lot of agile development tasks data modelling is often something that repays a lot of upfront thought and design. The point being that it is often easier to change a data model when it is still on the whiteboard or paper than when a schema has been constructed and is being managed. The cost of change is far greater the later it is made in the process than with most software development.

The key thing I have taken away from Ian’s work is that where the analysis goes too far is in specifying what data entities are composed of. This tips over into design and is where wasted and unnecessary work begins to appear. Instead if we can agree the language of the entities in our design we can actually have many different representations of the same entity in different parts of our system.

This may initially not seem that exciting but actually its quite subversive and challenges one of the ideas in Domain Driven Design that speaking in terms of data everything is symmetrical (i.e. the table has the same columns as the corresponding object has fields and so on…).

Saying that there is one conceptual element, say a Customer, but that there may be many Representations of a Customer is a really powerful tool particularly when trying to evolve already established systems. What this implies is that there is no single point of truth about what data a given Entity should hold and that instead we can construct representations that are suitable for the context we want to use.

The single conceptual framework allows us to introduce Ubiquitous Language but without the need to have symmetry. It also avoids making the database the single point of truth, which is the default position of most system designs, particularly where the database is also doing double-duty as an integration bus.

I think the language of Representations follows this pattern “the X as a Y”. So the Customer as a Participant in a Transaction can be different from the Customer as a Recipient of an Order.

If we think in set terms our different Representations are sets that intersect with our conceptual Customer. In most cases the Representations are going to be subsets, pairing down the information required to just that that is needed to fulfil the transaction the Representation participates in. In rarer cases we are going to have Representations that are supersets where maybe for a point the Representation carries information that will ultimately reside elsewhere once the transaction is complete but in my mind these are going to be quite rare.

So to summarise the advantages I think Representations will bring:

  • simplified data modelling diagrams, there is no need to record any information about what an entity contains, the only relevant information is its relation to other entities
  • a solution to Fat Objects that hoover up all possible functionality an entity can have over time
  • no need to produce a canonical version of an entity, the only relevant information is how one Representation gets mapped to another
  • a rational way to deal with existing data structures that cannot be changed in the current system
  • a framework for evolving data entities that operate in multiple roles

Who hard-coded the VAT?

With the recent change in UK VAT rates there has been an intriguing chance to see which retailers have a decent data model and those who just assumed that VAT would never change.

If you look at your receipt then the companies that have a decent data model will probably just have a very ordinary looking receipt. Maybe if they are showing off then they will have a little message telling you the current rate.

However the companies that have a bad data model will have an additional line on your receipt that reads something like “VAT Discount” that subtracts 2% from your net bill. This means they have happily hardcoded the VAT rate throughout their system in the mistaken belief that it was a constant, like gravity.

In fact, so mild have the economic climes been recently that it seemed a reasonable assumption. I used to work on a retail system and while we did correctly data drive the VAT rates the data model had no way of tracking the change in VAT rate over time. If we changed the standard VAT rate it would also change it for all historical invoices prior to the switchover. For this reason when an invoice line was calculated the VAT also had to be calculated and written at the time order was finalised. It was obscure but the VAT rate was always the current rate and if you wanted to figure out the historical rate applied you had to do the work yourself.

Who would think that such a simple tax would cause such problems? Or allow you to see so much of the internals of big firm’s systems?


Refactoring RDBMS: Adding a new column

So, the requirements change and you now need to record a new piece of information in one of your tables. That’s a pretty easy one, right? Just add a column and you’re laughing.

Well this is probably the way I see this problem solved 100% of the time (Rails migrations and Grails GORM for example) and I actually think it is the worst solution.

When you add a column to a table you should always consider whether you can add a sensible value. Adding a nullable column to a table is a really bad idea. Firstly, it isn’t relational (and it is almost guaranteed to reduce your normal form). Secondly, it is likely to introduce performance problems if the column needs to be searchable. Finally, nullable columns have a higher cost of maintenance as they imply logic (the null state) but don’t define it. The logic as to when the column may be null will always exist outside the schema.

If you can define a sensible default then you might be able to add the column to the table. For example boolean flag fields could be handled this way as long as the historic values can be accurately mapped to, say, false.

This method however falls down if your historical data cannot be truthfully mapped to a default value. If for example the new value represents something you have started to track rather than some brand new piece of information then you cannot truthfully map the historic data to the default value as you don’t know whether the default really applies or not.

For example if you send goods to customers using a regular or overnight delivery method and you need to start tracking the delivery method it would be wrong to map all historic data to the regular delivery. The truth is you don’t know whether a historical order was an overnight delivery or not. If you do map it incorrectly then you will end up skewing all your data and ultimately poison the very data you want to be tracking.

In this case it is far better to simply introduce a relational child table to hold the new information. Using child tables is a more accurate record of the data as a child row will only exist where the data is known. For unknown historic records there will be no child entry and your queries will not need any special cases.

When using child table data like this you can easily separate your data via the EXISTS predicate and in most databases EXISTS is very performant.

I think using child tables to record new information is relational way to solve the problem of adding new columns but it is usually turned down for one of two reasons (there naturally may be others but I’ve only ever heard these used in anger).

Firstly there is the argument that this technique leads to a proliferation of child tables. This is an accurate criticism but unfortunately if you want your database to be accurate and the new information is introduced piecemeal then you do not have a lot of choice. Pretending that you have historic data you don’t actually have doesn’t solve the problem.

One thing that can help this situation is to separate your database into its transaction processing element and its warehousing or archive aspect. In the warehouse the structure of the database might be quite involved but since the accuracy of the data is more important than the simplicity of the data model and the queries tend to be of a reporting or aggregating nature there tends to be less issues with having many tables. Often views can allow you to present a high-level view of the data and introducing null values into the heading of query, while ugly, is more acceptable than breaking normal form in the raw data itself (though god help you if you start joining aggregate views on the gappy data rather than the underlying true state of the data).

The transactional version of the database is then free to reflect the state of the current data model alone rather than the historical permutations. Here the model can be clean and reflect more of the Domain Model although again you want to retain the Relational aspect of the data.

Having separate datastores with different schemas often makes app developers lives easier as well. Usually an application is only concerned with current interactions and not historical transactions. As long as historical data can be obtained as a Service then you have no need to reflect historical schema concerns in your application.

However even if you cannot separate the data there is still little reason to introduce null value columns into the database. ORM often makes the simple assumption that one data entity is one object, but that is simply to make ORM easier. The application persistence layer shouldn’t be allowed to distort the underlying relational data model. If your data isn’t relational or you can’t define a fixed schema, don’t use an RDBMS. Look into things like CouchDB.

In fact some persistence schemes, like JPA, actually allow you to reflect the reality of the relational data by allowing embedded entities in objects. This is a leaky abstraction because from an object point of view there often is very little reason why, say, an Order object contains a Delivery Dispatch Type object.

If you want to avoid the leak then use the Repository pattern to create a true Domain object and be prepared to handle the complexity of transforming the relational data to the object-orientated data within the Repository. Remember that the Repository represents a true boundary in the system: you are hoping to derive real benefits from both the relational representation and the object representation, not screw one in favour of the other.

So please; don’t just wack in another nullable column and then wonder why queries on your database are slow.