Software

Nulls mean nothing, or something…

My last post touched on something that is a real issue for me. Using Null in a Relational Database, now I’m not Fabian Pascal but I have had enough problems with Nulls to feel that if you are going to use a Relational Database it is well worth making the effort to eliminate Nulls from your data model.

You really haven’t suffered until you’ve had to add a Decode in a query to add a layer of logic to a column which really never should have been Null in the first place. And then there is learning the special logic for Nulls usually when your predicate has failed to evaluate as expected. Could there be anything finer than saying X = Y OR X IS NULL?

However usually my problems with nulls come down to the following:

(,,,3,)

(,2,3,,5) intersection (2, 5)

What do these things mean? They make my head explode in data terms. They don’t really make senses as tuples, vectors or indeed anything other than lists. In the case of lists you can charitably interpret the empty values as being positions that have no assigned value.

Null in programming languages usually has a strong definition. In Java it means that a reference has no associated memory allocation, in Ruby it’s a type of object. In databases people always struggle to say what NULL is.

In the comments on the previous blog post the commentor came up with at least three interpretations of a Null value: unknown, unset and Not A Number (NAN). However really Null is an absence of information, it’s kind of the anti-fact. It’s presence tells you nothing and as soon as you try and interpret it the number of potential meanings multiples exponentially.

Trying to store NAN as a concept in a database is just misguided. If I can say a NUMERIC field is, in fact, NAN then why can’t I store the idea that a Number is not a Lemon? Or a roadsign? If it is really not a number, why are you trying to store it in a numeric field? If it’s the result of a calculation that is either not storable within the allocated storage or infinity then why don’t you store that information some other way?

Relational databases work best without nulls and usually the data model works best if it is seen as a collection of facts, that can be stored in a normalised form with strong relations between the sets of facts.

However the way they often get used is a surrogate (and usually underperforming) hash stores where each entry consists of a set of keys that may or may not have values. This is great for object serialisation and instead of relational queries you can introduce highly parallelisable search algorithms. However it kind of sucks at everything else. Firstly because the usual RDBMS like MySql or Oracle isn’t expecting to be used as a Hash Store so it is doing a lot of things that you aren’t going to need in this model. Secondly because the irregular nature of the hashed data means that trying to get sets of rows back out of the database can underperform because the system is often forced to brute force examine the data in a serial fashion.

The whole point of creating tuple arthimetic is so that you can optimise for relational processing and query big sets of data quickly. Completely ignoring it or, worse still, crippling it so that serialising objects is easy is like shooting yourself in the foot… with a shotgun… for no reason.

Standard
Software

Refactoring RDBMS: Adding a new column

So, the requirements change and you now need to record a new piece of information in one of your tables. That’s a pretty easy one, right? Just add a column and you’re laughing.

Well this is probably the way I see this problem solved 100% of the time (Rails migrations and Grails GORM for example) and I actually think it is the worst solution.

When you add a column to a table you should always consider whether you can add a sensible value. Adding a nullable column to a table is a really bad idea. Firstly, it isn’t relational (and it is almost guaranteed to reduce your normal form). Secondly, it is likely to introduce performance problems if the column needs to be searchable. Finally, nullable columns have a higher cost of maintenance as they imply logic (the null state) but don’t define it. The logic as to when the column may be null will always exist outside the schema.

If you can define a sensible default then you might be able to add the column to the table. For example boolean flag fields could be handled this way as long as the historic values can be accurately mapped to, say, false.

This method however falls down if your historical data cannot be truthfully mapped to a default value. If for example the new value represents something you have started to track rather than some brand new piece of information then you cannot truthfully map the historic data to the default value as you don’t know whether the default really applies or not.

For example if you send goods to customers using a regular or overnight delivery method and you need to start tracking the delivery method it would be wrong to map all historic data to the regular delivery. The truth is you don’t know whether a historical order was an overnight delivery or not. If you do map it incorrectly then you will end up skewing all your data and ultimately poison the very data you want to be tracking.

In this case it is far better to simply introduce a relational child table to hold the new information. Using child tables is a more accurate record of the data as a child row will only exist where the data is known. For unknown historic records there will be no child entry and your queries will not need any special cases.

When using child table data like this you can easily separate your data via the EXISTS predicate and in most databases EXISTS is very performant.

I think using child tables to record new information is relational way to solve the problem of adding new columns but it is usually turned down for one of two reasons (there naturally may be others but I’ve only ever heard these used in anger).

Firstly there is the argument that this technique leads to a proliferation of child tables. This is an accurate criticism but unfortunately if you want your database to be accurate and the new information is introduced piecemeal then you do not have a lot of choice. Pretending that you have historic data you don’t actually have doesn’t solve the problem.

One thing that can help this situation is to separate your database into its transaction processing element and its warehousing or archive aspect. In the warehouse the structure of the database might be quite involved but since the accuracy of the data is more important than the simplicity of the data model and the queries tend to be of a reporting or aggregating nature there tends to be less issues with having many tables. Often views can allow you to present a high-level view of the data and introducing null values into the heading of query, while ugly, is more acceptable than breaking normal form in the raw data itself (though god help you if you start joining aggregate views on the gappy data rather than the underlying true state of the data).

The transactional version of the database is then free to reflect the state of the current data model alone rather than the historical permutations. Here the model can be clean and reflect more of the Domain Model although again you want to retain the Relational aspect of the data.

Having separate datastores with different schemas often makes app developers lives easier as well. Usually an application is only concerned with current interactions and not historical transactions. As long as historical data can be obtained as a Service then you have no need to reflect historical schema concerns in your application.

However even if you cannot separate the data there is still little reason to introduce null value columns into the database. ORM often makes the simple assumption that one data entity is one object, but that is simply to make ORM easier. The application persistence layer shouldn’t be allowed to distort the underlying relational data model. If your data isn’t relational or you can’t define a fixed schema, don’t use an RDBMS. Look into things like CouchDB.

In fact some persistence schemes, like JPA, actually allow you to reflect the reality of the relational data by allowing embedded entities in objects. This is a leaky abstraction because from an object point of view there often is very little reason why, say, an Order object contains a Delivery Dispatch Type object.

If you want to avoid the leak then use the Repository pattern to create a true Domain object and be prepared to handle the complexity of transforming the relational data to the object-orientated data within the Repository. Remember that the Repository represents a true boundary in the system: you are hoping to derive real benefits from both the relational representation and the object representation, not screw one in favour of the other.

So please; don’t just wack in another nullable column and then wonder why queries on your database are slow.

Standard
Software, Work

Don’t hate the RDBMS; hate the implementation

I read through this post with the usual sinking feeling of despair. How can people get so muddled in their thinking? I am not sure I can even bear to go through the arguments again.

More programmers treat the database as a dumb store than there are situations where such treatment is appropriate. No-one is saying that Twitter data is deep, relational and worthy of data mining. However not all data is like Twitter micro blog posts.

The comments on the post were for the most part very good and say a lot of what I would have said. However looking at CouchDB documentation I noticed that the authors make far less dramatic claims for their product that the blog post does. A buggy alpha release of a hashtable datastore is not going to bring the enterprise RDBMS to its knees.

I actually set up and ran Couch DB but I will save my thoughts for another day, it’s an interesting application. What I actually want to talk about is how we can get more sophisticated with our datastores. It is becoming apparent to me that ORM technologies are really making a dreadful hash of data. The object representation is getting shafted because inheritance is never properly translated into a relational schema. The relational data is getting screwed by the fact the rules for object attributes is at odds with the normal forms.

The outcome is that you end up with the bastard hybrid worst of all worlds solution. What’s the answer?

Well the first thing is to admit that application programmers think the database is a big dumb datastore and will never stop thinking that. The second is that relational data is not the one true way to represent all data. They are the best tool we have at the moment for representing rich data sets that contain a lot of relational aspects. Customer orders in a supply system is the classic example. From a data mining point of view you are going to be dead on your feet if you do not have customers and their orders in a relational data store. You cannot operate if you cannot say who is buying how much of what.

If you let developers reimplement a data mining solution in their application for anything other than your very edge and niche interests then you are going to be wasting a lot of time and money for no good reason. You simply want a relational datastore, a metadata overlay to reinterpret the normalised data in terms of domain models and a standard piece of charting and reporting software.

However the application programmers have a point. The system that takes the order should not really have to decompose an order taken at a store level into its component parts. What the front end needs to do is take and confirm the order as quickly as possible. From this point of view the database is just a dumb datastore. Or rather what we need is a simple datastore that can do what is needed now and defer and delegate the processing into a richer data set in the future. From this point of view the application may store the data in something as transient as a message queue (although realistically we are talking about something like an object cache so the customer can view and adjust their order).

Having data distributed in different forms across different systems creates something of headache as it is hard to get an overall picture of what is happening on the system at any given moment. However creating a single datastore (implemented by an enterprise RDBMS) as a single point of reference is something of an anti-pattern. It is making one thing easier, the big picture. However to provide this data is being bashed by layering technologies into all kinds inappropriate shapes and various groups within the IT department are frequently in bitter conflict.

There needs to be a step back and IT people need to accept the complexity and start talking about the whole system comprising of many components. All of which need to be synced and queried if you want the total information picture. Instead of wasting effort in fitting however many square pegs into round holes we need to be thinking about how we use the best persistence solution for a given solution and how we report and coordinate these many systems.

It is the only way we can move forward.

Standard