My last post touched on something that is a real issue for me. Using Null in a Relational Database, now I’m not Fabian Pascal but I have had enough problems with Nulls to feel that if you are going to use a Relational Database it is well worth making the effort to eliminate Nulls from your data model.
You really haven’t suffered until you’ve had to add a Decode in a query to add a layer of logic to a column which really never should have been Null in the first place. And then there is learning the special logic for Nulls usually when your predicate has failed to evaluate as expected. Could there be anything finer than saying X = Y OR X IS NULL?
However usually my problems with nulls come down to the following:
(,2,3,,5) intersection (2, 5)
What do these things mean? They make my head explode in data terms. They don’t really make senses as tuples, vectors or indeed anything other than lists. In the case of lists you can charitably interpret the empty values as being positions that have no assigned value.
Null in programming languages usually has a strong definition. In Java it means that a reference has no associated memory allocation, in Ruby it’s a type of object. In databases people always struggle to say what NULL is.
In the comments on the previous blog post the commentor came up with at least three interpretations of a Null value: unknown, unset and Not A Number (NAN). However really Null is an absence of information, it’s kind of the anti-fact. It’s presence tells you nothing and as soon as you try and interpret it the number of potential meanings multiples exponentially.
Trying to store NAN as a concept in a database is just misguided. If I can say a NUMERIC field is, in fact, NAN then why can’t I store the idea that a Number is not a Lemon? Or a roadsign? If it is really not a number, why are you trying to store it in a numeric field? If it’s the result of a calculation that is either not storable within the allocated storage or infinity then why don’t you store that information some other way?
Relational databases work best without nulls and usually the data model works best if it is seen as a collection of facts, that can be stored in a normalised form with strong relations between the sets of facts.
However the way they often get used is a surrogate (and usually underperforming) hash stores where each entry consists of a set of keys that may or may not have values. This is great for object serialisation and instead of relational queries you can introduce highly parallelisable search algorithms. However it kind of sucks at everything else. Firstly because the usual RDBMS like MySql or Oracle isn’t expecting to be used as a Hash Store so it is doing a lot of things that you aren’t going to need in this model. Secondly because the irregular nature of the hashed data means that trying to get sets of rows back out of the database can underperform because the system is often forced to brute force examine the data in a serial fashion.
The whole point of creating tuple arthimetic is so that you can optimise for relational processing and query big sets of data quickly. Completely ignoring it or, worse still, crippling it so that serialising objects is easy is like shooting yourself in the foot… with a shotgun… for no reason.