Software

Refactoring RDBMS: Adding a new column

So, the requirements change and you now need to record a new piece of information in one of your tables. That’s a pretty easy one, right? Just add a column and you’re laughing.

Well this is probably the way I see this problem solved 100% of the time (Rails migrations and Grails GORM for example) and I actually think it is the worst solution.

When you add a column to a table you should always consider whether you can add a sensible value. Adding a nullable column to a table is a really bad idea. Firstly, it isn’t relational (and it is almost guaranteed to reduce your normal form). Secondly, it is likely to introduce performance problems if the column needs to be searchable. Finally, nullable columns have a higher cost of maintenance as they imply logic (the null state) but don’t define it. The logic as to when the column may be null will always exist outside the schema.

If you can define a sensible default then you might be able to add the column to the table. For example boolean flag fields could be handled this way as long as the historic values can be accurately mapped to, say, false.

This method however falls down if your historical data cannot be truthfully mapped to a default value. If for example the new value represents something you have started to track rather than some brand new piece of information then you cannot truthfully map the historic data to the default value as you don’t know whether the default really applies or not.

For example if you send goods to customers using a regular or overnight delivery method and you need to start tracking the delivery method it would be wrong to map all historic data to the regular delivery. The truth is you don’t know whether a historical order was an overnight delivery or not. If you do map it incorrectly then you will end up skewing all your data and ultimately poison the very data you want to be tracking.

In this case it is far better to simply introduce a relational child table to hold the new information. Using child tables is a more accurate record of the data as a child row will only exist where the data is known. For unknown historic records there will be no child entry and your queries will not need any special cases.

When using child table data like this you can easily separate your data via the EXISTS predicate and in most databases EXISTS is very performant.

I think using child tables to record new information is relational way to solve the problem of adding new columns but it is usually turned down for one of two reasons (there naturally may be others but I’ve only ever heard these used in anger).

Firstly there is the argument that this technique leads to a proliferation of child tables. This is an accurate criticism but unfortunately if you want your database to be accurate and the new information is introduced piecemeal then you do not have a lot of choice. Pretending that you have historic data you don’t actually have doesn’t solve the problem.

One thing that can help this situation is to separate your database into its transaction processing element and its warehousing or archive aspect. In the warehouse the structure of the database might be quite involved but since the accuracy of the data is more important than the simplicity of the data model and the queries tend to be of a reporting or aggregating nature there tends to be less issues with having many tables. Often views can allow you to present a high-level view of the data and introducing null values into the heading of query, while ugly, is more acceptable than breaking normal form in the raw data itself (though god help you if you start joining aggregate views on the gappy data rather than the underlying true state of the data).

The transactional version of the database is then free to reflect the state of the current data model alone rather than the historical permutations. Here the model can be clean and reflect more of the Domain Model although again you want to retain the Relational aspect of the data.

Having separate datastores with different schemas often makes app developers lives easier as well. Usually an application is only concerned with current interactions and not historical transactions. As long as historical data can be obtained as a Service then you have no need to reflect historical schema concerns in your application.

However even if you cannot separate the data there is still little reason to introduce null value columns into the database. ORM often makes the simple assumption that one data entity is one object, but that is simply to make ORM easier. The application persistence layer shouldn’t be allowed to distort the underlying relational data model. If your data isn’t relational or you can’t define a fixed schema, don’t use an RDBMS. Look into things like CouchDB.

In fact some persistence schemes, like JPA, actually allow you to reflect the reality of the relational data by allowing embedded entities in objects. This is a leaky abstraction because from an object point of view there often is very little reason why, say, an Order object contains a Delivery Dispatch Type object.

If you want to avoid the leak then use the Repository pattern to create a true Domain object and be prepared to handle the complexity of transforming the relational data to the object-orientated data within the Repository. Remember that the Repository represents a true boundary in the system: you are hoping to derive real benefits from both the relational representation and the object representation, not screw one in favour of the other.

So please; don’t just wack in another nullable column and then wonder why queries on your database are slow.

Standard

5 thoughts on “Refactoring RDBMS: Adding a new column

  1. dsmith says:

    > “Adding a nullable column to a table is a really bad idea. Firstly, it isn’t relational (and it is almost guaranteed to reduce your normal form).”

    One could argue that all optional attributes or attribute sets should be in separate tables, whether they were introduced as the result of a schema change or not. That would be one scary looking schema, not to mention all of the joins in the query code.

    > “Secondly, it is likely to introduce performance problems if the column needs to be searchable. ”

    If you have queries that heavily scan this new field with lots of nulls, you should be able to address this with proper indexing. Am I wrong on this? If I am, perhaps this would be better addressed as a database optimization. Optional attributes would be logically represented as a nullable field but physically stored as a child table.

    >”Finally, nullable columns have a higher cost of maintenance as they imply logic (the null state) but don’t define it. The logic as to when the column may be null will always exist outside the schema.”

    I agree with you here, but I don’t see how it supports your argument. In cases where nullable field are applied to application logic, you would have an outer join, which produces nulls that the same logic would need to be applied.

    To be more specific, an attribute SHOULD have a null value only if it needs to have meaning that is not otherwise expressible in the domain of the attribute. For example, in RDBMS’s that do not provide for the value NAN in the domain of a number field, null can be used to represent this values.

    Most commonly however, null is used to indicate that no person or process has explicitly set the value. If this is the case, the application is free to calculate an implied value for the display purposes (an implied default).

    Some designers would choose to store an empty string in the text domain to represent an unset field, but this would leave us unable to distinguish an unset text field from a explicitly set empty field.

    Some designers would use an arbitrary date such as ’01/01/01′ to represent an unset ship_date field on an order that is not yet shipped. I would also avoid this since this would imply meaning for an expression such as (ship_date<today), where there should be none.

    In this final example, if the ship_date attribute was added to the schema as the result of a schema change, and no meaningful historical data could be recovered to set the values, then we have the dilemma of the null having more than one meaning — either unset or unknown. Since the application might want to deal with these cases differently, we are stuck. However, your suggestion doesn’t address this.

    In my opinion RDBMS’s should allow designers to extend the data domains of attributes to add arbitrary atoms such as ‘unset’ and ‘unknown’.

    Sorry for the length of the reply (I really need a blog of my own.)

  2. For the record I am a fan of putting all optional values in child tables. I don’t regard it as scary if it results in an accurate description of the data being recorded. Is it really so difficult to say:

    SELECT *
    FROM orders
    JOIN dispatch_date ON (order_id)
    JOIN delivery_type ON (order_id)
    JOIN order_discount ON (order_id)… ?

    If so why not aggregate it into a view?

    I think I need to write a separate post about nulls and RDBMS. The short version is that a pure RDBMS wouldn’t allow them and therefore in the weakened versions of RDBMS that are commonly used today how do we minimise their impact?

    For example, using a index to counteract a bad nullable column choice often doesn’t work. An index can only help if the null rows represent a significant minority of the data. I have had to deal with issues in the past where having a NULL for every historic row in a table, where historic data was over 80% of the data meant that Oracle CBO incorrectly thought it had to check every row, regardless of the index. I’m not an expert on this but I do know pain when I experience it.

    The case I would make is that when using child tables you inner join and if the join is not made the heading consists of zero entries, not a tuple containing nulls. This would tend to be fast in most RDBMS implementations.

  3. John says:

    This blog entry is old, but it makes me shake my head.

    1. K.I.S.S. In the child table approach you end up having having to create an OUTER join and end up with NULLs anyway–unless you want to do an inner join and lose out on the historical data.

    There is no normalization gain and it creates more headaches. Also, while there are good exceptions, creating a one-to-one table should be considered a mistake unless shown otherwise.

    For your example, creating a delivery-type domain table that defines:
    -Historical Unknown (set as default, if you wish)
    -Express
    -Standard

    accomplishes all that’s needed without the ambiguity or confusion.

    2. Many people have said this and it needs to be restated: Relational data is NOT OO. Don’t approach data refactoring the same way!

  4. You don’t want it simple if it is also not right. Deceitful simplicity is its own kind of stupidity.

    I do think it is more acceptable for the projection to generate NULL values than to retain NULL in the tables.

    However the biggest mistake I think you’ve made is the idea that you have historic data when you don’t. You would obviously want an Inner Join in the example above because you want to exclude deliveries for which the delivery option was not recorded.

    My reading of your comment (and apologies if this is wrong) is that you regard “Unknown” to be some kind of fact or value that can be asserted. This is clearly paradoxical, unless you are a fan of the Rumsfeldian “Known Unknowns”, Unknown means that you do not know which of the options was chosen or indeed, more severely, whether the option was relevant to the order (maybe we were using our own delivery fleet at the time of that order).

    Allowing “Unknown” as a relational data variable seems to be to inject vast quantities of “ambiguity or confusion” into the schema for little gain. Giving it a default value instead is just putting lipstick on the pig.

    I am not sure what you mean by “normalization gain” clearly the child table would be of higher normal form than adding another data column would be. My observed experience is that having the historical data present also skews histogram and index generation and causes query plan pain in a way that child tables don’t. However this is entirely based on instinct and experience. In real life I would be willing to experiment with different solutions if performance was critical and denormalise if necessary.

  5. John says:

    Let’s take a step back: By “child” table, do you mean the DELIVERYTYPE domain table? Or do
    you mean an ORDER_DELIVERYTYPE join table?

    Maybe I’ve just been caught up in terminology…

    If you mean DELIVERYTYPE, then we’re mostly in full agreement and I owe you an apology (but do DB developers that bad really exist?) If it’s the latter, then I just don’t see it. What normalization rule would apply? Presumably a semantic relationship of some kind?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s