Programming

Refactoring abuse and strong type compiler systems

“Refactoring” is one of the most abused terms in programming. It has a formal meaning but when generally used it tends to mean rewriting or restructuring code (or as I like to refer to it: changing stuff). One interesting new use of refactoring I heard recently was to describe extracting common code. Creating some new codebase is perhaps the opposite of refactoring.

So refactoring tends to mean developers are just changing things they have already written. Real refactoring is of course done to code under test so I was interested in a Stuart Halloway quote about compilation being the weakest form of unit-testing. Scala is used a lot at the Guardian and it has a more powerful type system and compiler than Java which means if you play along with the type system you actually get a lot of that weak unit-testing. In fact structuring your code to maximise the compiler guarantees and adding the various assertion methods to make sure that you fail fast at runtime are two of things that help increase your productivity with Scala.

If you’ve seen the Coursera Scala videos you can see Martin Odersky doing some of this “weak refactoring” in his example code where he simplifies chained collection operations by moving or creating simple functionality in his types.

Of course just like regular refactoring there have to be a few rules to this. Firstly weak refactoring absolutely requires you use explicit function type declarations. Essentially in a weak refactor what you are doing is changing the body of a function while retaining its parameters and return type. If you can still compile after you’ve changed code you are probably good.

However the other critical thing is how much covariance the return type has. A return type of Option for example is probably a bad candidate for weak refactoring as it is probably critical whether your changed code still returns Some or None for a given set of a parameters. Only conventional refactoring can determine whether that is true.

Standard
Software

Refactoring RDBMS: Adding a new column

So, the requirements change and you now need to record a new piece of information in one of your tables. That’s a pretty easy one, right? Just add a column and you’re laughing.

Well this is probably the way I see this problem solved 100% of the time (Rails migrations and Grails GORM for example) and I actually think it is the worst solution.

When you add a column to a table you should always consider whether you can add a sensible value. Adding a nullable column to a table is a really bad idea. Firstly, it isn’t relational (and it is almost guaranteed to reduce your normal form). Secondly, it is likely to introduce performance problems if the column needs to be searchable. Finally, nullable columns have a higher cost of maintenance as they imply logic (the null state) but don’t define it. The logic as to when the column may be null will always exist outside the schema.

If you can define a sensible default then you might be able to add the column to the table. For example boolean flag fields could be handled this way as long as the historic values can be accurately mapped to, say, false.

This method however falls down if your historical data cannot be truthfully mapped to a default value. If for example the new value represents something you have started to track rather than some brand new piece of information then you cannot truthfully map the historic data to the default value as you don’t know whether the default really applies or not.

For example if you send goods to customers using a regular or overnight delivery method and you need to start tracking the delivery method it would be wrong to map all historic data to the regular delivery. The truth is you don’t know whether a historical order was an overnight delivery or not. If you do map it incorrectly then you will end up skewing all your data and ultimately poison the very data you want to be tracking.

In this case it is far better to simply introduce a relational child table to hold the new information. Using child tables is a more accurate record of the data as a child row will only exist where the data is known. For unknown historic records there will be no child entry and your queries will not need any special cases.

When using child table data like this you can easily separate your data via the EXISTS predicate and in most databases EXISTS is very performant.

I think using child tables to record new information is relational way to solve the problem of adding new columns but it is usually turned down for one of two reasons (there naturally may be others but I’ve only ever heard these used in anger).

Firstly there is the argument that this technique leads to a proliferation of child tables. This is an accurate criticism but unfortunately if you want your database to be accurate and the new information is introduced piecemeal then you do not have a lot of choice. Pretending that you have historic data you don’t actually have doesn’t solve the problem.

One thing that can help this situation is to separate your database into its transaction processing element and its warehousing or archive aspect. In the warehouse the structure of the database might be quite involved but since the accuracy of the data is more important than the simplicity of the data model and the queries tend to be of a reporting or aggregating nature there tends to be less issues with having many tables. Often views can allow you to present a high-level view of the data and introducing null values into the heading of query, while ugly, is more acceptable than breaking normal form in the raw data itself (though god help you if you start joining aggregate views on the gappy data rather than the underlying true state of the data).

The transactional version of the database is then free to reflect the state of the current data model alone rather than the historical permutations. Here the model can be clean and reflect more of the Domain Model although again you want to retain the Relational aspect of the data.

Having separate datastores with different schemas often makes app developers lives easier as well. Usually an application is only concerned with current interactions and not historical transactions. As long as historical data can be obtained as a Service then you have no need to reflect historical schema concerns in your application.

However even if you cannot separate the data there is still little reason to introduce null value columns into the database. ORM often makes the simple assumption that one data entity is one object, but that is simply to make ORM easier. The application persistence layer shouldn’t be allowed to distort the underlying relational data model. If your data isn’t relational or you can’t define a fixed schema, don’t use an RDBMS. Look into things like CouchDB.

In fact some persistence schemes, like JPA, actually allow you to reflect the reality of the relational data by allowing embedded entities in objects. This is a leaky abstraction because from an object point of view there often is very little reason why, say, an Order object contains a Delivery Dispatch Type object.

If you want to avoid the leak then use the Repository pattern to create a true Domain object and be prepared to handle the complexity of transforming the relational data to the object-orientated data within the Repository. Remember that the Repository represents a true boundary in the system: you are hoping to derive real benefits from both the relational representation and the object representation, not screw one in favour of the other.

So please; don’t just wack in another nullable column and then wonder why queries on your database are slow.

Standard
Programming, ThoughtWorks

Aftermath of a Geek Night

Last night was my first Geek Night at ThoughtWorks. There have been many Geek Nights but this was the first under the management of Paul Nasrat and myself.

I was pretty happy with the evening and I think both speakers and audience came away pretty happy too, which is great. Paul and I will be reviewing the feedback from the event before the next one on the 12th of June.

I learned about State machine support in JMock, something I have never used despite having used the framework a lot. The audience also go to see how Nat and Steve think JMock tests should look. It made me realise that I have tended to set assertions in my mocks rather than using stubs in the past. Well no more! Allowing is my new friend.

During the dojo I also got to do a segment on a classic refactoring of a block of code to a method, to a method in a private class, to a private class implementing an interface, to a collaborator decoupled by an interface. It is a classic technique (and you will be able to see it in the dojo code when it gets posted this weekend) and seeing it put to use by Nat, who is a great developer and someone who really loves to code, was very cool. It was an experience I genuinely felt privileged to be part of and I hope the other people in the pairs felt the same.

I would also hope that it illustrates how mocking should lead to changes in design of objects rather than “pickling” their behaviour at a certain point in time.

Standard