Software

Refactoring RDBMS: Adding a new column

So, the requirements change and you now need to record a new piece of information in one of your tables. That’s a pretty easy one, right? Just add a column and you’re laughing.

Well this is probably the way I see this problem solved 100% of the time (Rails migrations and Grails GORM for example) and I actually think it is the worst solution.

When you add a column to a table you should always consider whether you can add a sensible value. Adding a nullable column to a table is a really bad idea. Firstly, it isn’t relational (and it is almost guaranteed to reduce your normal form). Secondly, it is likely to introduce performance problems if the column needs to be searchable. Finally, nullable columns have a higher cost of maintenance as they imply logic (the null state) but don’t define it. The logic as to when the column may be null will always exist outside the schema.

If you can define a sensible default then you might be able to add the column to the table. For example boolean flag fields could be handled this way as long as the historic values can be accurately mapped to, say, false.

This method however falls down if your historical data cannot be truthfully mapped to a default value. If for example the new value represents something you have started to track rather than some brand new piece of information then you cannot truthfully map the historic data to the default value as you don’t know whether the default really applies or not.

For example if you send goods to customers using a regular or overnight delivery method and you need to start tracking the delivery method it would be wrong to map all historic data to the regular delivery. The truth is you don’t know whether a historical order was an overnight delivery or not. If you do map it incorrectly then you will end up skewing all your data and ultimately poison the very data you want to be tracking.

In this case it is far better to simply introduce a relational child table to hold the new information. Using child tables is a more accurate record of the data as a child row will only exist where the data is known. For unknown historic records there will be no child entry and your queries will not need any special cases.

When using child table data like this you can easily separate your data via the EXISTS predicate and in most databases EXISTS is very performant.

I think using child tables to record new information is relational way to solve the problem of adding new columns but it is usually turned down for one of two reasons (there naturally may be others but I’ve only ever heard these used in anger).

Firstly there is the argument that this technique leads to a proliferation of child tables. This is an accurate criticism but unfortunately if you want your database to be accurate and the new information is introduced piecemeal then you do not have a lot of choice. Pretending that you have historic data you don’t actually have doesn’t solve the problem.

One thing that can help this situation is to separate your database into its transaction processing element and its warehousing or archive aspect. In the warehouse the structure of the database might be quite involved but since the accuracy of the data is more important than the simplicity of the data model and the queries tend to be of a reporting or aggregating nature there tends to be less issues with having many tables. Often views can allow you to present a high-level view of the data and introducing null values into the heading of query, while ugly, is more acceptable than breaking normal form in the raw data itself (though god help you if you start joining aggregate views on the gappy data rather than the underlying true state of the data).

The transactional version of the database is then free to reflect the state of the current data model alone rather than the historical permutations. Here the model can be clean and reflect more of the Domain Model although again you want to retain the Relational aspect of the data.

Having separate datastores with different schemas often makes app developers lives easier as well. Usually an application is only concerned with current interactions and not historical transactions. As long as historical data can be obtained as a Service then you have no need to reflect historical schema concerns in your application.

However even if you cannot separate the data there is still little reason to introduce null value columns into the database. ORM often makes the simple assumption that one data entity is one object, but that is simply to make ORM easier. The application persistence layer shouldn’t be allowed to distort the underlying relational data model. If your data isn’t relational or you can’t define a fixed schema, don’t use an RDBMS. Look into things like CouchDB.

In fact some persistence schemes, like JPA, actually allow you to reflect the reality of the relational data by allowing embedded entities in objects. This is a leaky abstraction because from an object point of view there often is very little reason why, say, an Order object contains a Delivery Dispatch Type object.

If you want to avoid the leak then use the Repository pattern to create a true Domain object and be prepared to handle the complexity of transforming the relational data to the object-orientated data within the Repository. Remember that the Repository represents a true boundary in the system: you are hoping to derive real benefits from both the relational representation and the object representation, not screw one in favour of the other.

So please; don’t just wack in another nullable column and then wonder why queries on your database are slow.

Standard
Java

Ooo JPA, JPA

I normally have nothing good to say about ORM so I had a pleasant surprise today when I was able to generate four One to Many object relationships in about four or five hours.

I’m currently using TopLink although I try to stick to space inside the standard and avoid getting dragged into vendor extensions. I actually prefer OpenJPA from a testing point of view as TopLink won’t seem to play ball unless there is a persistence.xml in a directory called META-INF at the root of your classpath.

The basic syntax is pretty simple, you create a field that is a Collection of the Entity that represents the other side of the relationship, I used a List but then wondered why I had done that. You then annotate that field with @OneToMany and then specify the Join mechanism. Now this is where it usually gets messy and I have to say there is a certain amount of gnarl to this one. I had a relatively simple Join Table on the go but thankfully the annotation @JoinTable needs to know only three things.

Firstly the name of the table (as per the @Column annotation) and then the JoinColumns and InverseJoinColumns parameters. These names are correct but probably seem fiercer than they need to be. Both parameters take an array of @JoinColumn annotations (nested annotations! a first for me). Even if you just have one, wrap it in curly braces. The @JoinColumn syntax is the same as that for @OneToOne and specifies which column to resolve the @Id lookup with.

Join Columns identify the Id of the entity that has the Collection field; this class in other words. The InverseJoinColumns identify the Ids of the other entities that exist in the relationship.

So if Foo has a Collection of Bars then JoinColumns identifies the database columns that contain the Ids of Foo and the InverseJoinColumns identify the Ids of the Bars.

To be honest TopLink has decent enough logging that you can usually muddle along coredumping the SQL that the annotations are building up. My worst mistake was mixing up the @Table annotation with that that of @JoinTable and trying to select from my join table joined with, my join table.

The nice thing is that I was able to very quickly build up four foreign key relationships, all using join tables (some of which had been badly denormalised) and all by annotating the POJOs that also serve a domain purpose. I had to add some Setters for testing purposes but apart from that if I wasn’t using the data in the application I didn’t have to describe it in the configuration.

Good work JPA!

Standard
Java

ORM is evil

Object relational mapping (ORM) is evil because it is the great big programmer hammer driving the square peg into the relational data round hole. ORM is a self-fulfilling prophecy; because it treats the RDBMS as a big dumb datasource then in fact it makes the database a big dumb datasource. Worse, it allows people to not think of database as being a collection of, y’know, relational data but instead as being a list of entities. That’s true if you are using an OODBMS but ORM doesn’t mean the same thing. ORM makes your RDBMS look a bit like an OODBMS by making the data model teh suck.

The biggest problem with programmers seeing the database through the ORM reality-distortion filter is that they start thinking that the entities they deal with are the data. In most cases that simply isn’t true, or if it is it is because the entities have started to become some kind of uber-object that contains everything about everything. The kind of object in fact that made people say EJB is slow.

If you take something like a sale in an online shop that sale means different things to different people. To the accountant it means margin, to the sales referrer it means commission and to the dispatcher it means a weight, volume and address, to the customer its just a credit card charge and a wheelbarrow (or whatever it is they’ve bought).

The amazing thing about about the RDBMS is that recognises this multi-faceted view of data and accurately decomposes it into the constituent elements. The dispatcher doesn’t have to see the sales commission, the accountant doesn’t have to see the item volume. You use queries to constitute a coherent and logical projection of the underlying data that can be defined by the purpose the data is going to be used for.

ORM fanatics rarely see things this way; instead they want to denormalise here and there, aggregate bits of data directly into the record, add a few nullable columns into the schema because, hey, the Java fields can be null. In short they want to make the schema reflect their view of the world. Because they don’t have to make sure that the various departments get their reports or that auditors have their clear and independent trail they tend to see themselves as the centre of the universe and whatever hellspawned application they are working on as the answer to all the “problems” in the big dumb datastores.

A lot of people complain if you propose splitting logic between layers. The trouble is that someone who bleats about splitting across layers will inevitably followup their complaint with the proposal that all the logic should, in fact, shock horror, be located in application X, their latest wonder-project.

This proposal never works and for one simple reason, if you don’t want to split logic between layers then you are committing yourself to adding every last feature of Oracle, MySql, DB2 or whatever into your application. Access control, query languages, caching, data intersections, transactions, you name it it is in the database for a reason. Unless you can reproduce these features in your application then forget about logic splitting because you should put function where it is best serviced.

ORM should see itself as just being one consumer of data and learn to play better with others. It should be about the convenience of presenting one valid method of data modelling in a form that is more familiar to programmers. It should take away any impedance due to having to comprehend the relational model (which I admit is hard) but it should leave it there and stop thinking that is the true answer to complex data issues.

Standard