Programming

Learning to work atomically

I have been doing a lot of work with MongoDb recently and I have made a few noob mistakes despite being relatively well-grounded in the theory. One of the key mistakes I have made using the Java driver is to not have the driver in the right mode. By default the driver will not block on an insert, you need to be in Safe Mode for that to happen.

What is the impact? Well if you are trying to update a record that you have just inserted and the update neither fails nor is applied then chances are that the update failed to find the record you had just inserted because it wasn’t there when the update query ran. Of course a few milliseconds later it appeared and is there are the end of the batch process.

Updates in Mongo consist of a query and a data change operation and there is an art in getting the query to work on the set of data you want it to. I find myself doing a conditional match in Scala and then thinking “at this point is that still going to be valid?” and then going back tweaking the query so that the update is guaranteed to be valid at the point it happens.

Today I spent a lot of time buggering about trying to avoid writing keys in the document that held no data, after doing it I realised that I could have just written a single remove statement that would have removed the empty keys in one big cleanup after the data had been stored.

Atomic independence also means losing some things that we take for granted like sequence ids. People like numbers but guaranteeing even ascending values can rapidly become a nightmare if you want to avoid contention and single point of failure.

Cursors are similarly tricksy, I have a long-running batch job and I realised today that it runs long enough that you cannot guarantee a known state by the time it finishes. Instead you have to do these kind of “loop until there’s nothing left to do” constructs where the loop condition expresses the state of the store you are trying to achieve and you get at least one cursor that has no entries.

There’s a lot of stuff about datastores that is ingrained deeper than you realise and it takes more than one difficult experience to start genuinely thinking differently about things.

Standard
Programming

Redis 2, worth the hype

So I’ve flirted a little bit with Redis but never really had anything that fitted it’s solution profile, until this week!

I needed to cross reference postcodes and their corresponding longitude and latitudes. I tried a few other solutions but in the end I decided that a postcode was a nice normalisable key (it just needs the country details adding in) and that since I had thousands of records to relate, I really valued speed above everything else.

Redis (in conjunction with the Python client library) tore through the data set in terms of both inserting approximately 1.7M records and reading roughly 300K entries. I also liked using the hash functionality to store the long and lat under the same key rather than having to write my own logic to create the pairing.

Redis really helped solved my problem and lived up to its promises. It definitely has a place in my toolkit from now on.

Standard
Programming

Why do you think Riak is a column-store?

Someone is wrong on the internet, the unfortunate thing is, that person is me. In one version of my NoSql article for ThoughtWorks I put Riak under the heading of a column store. Twitter feels this is wrong and that it is better classified as a distributed key-value store. I agree this description is more accurate than mine but it feels like calling limes, lemons, oranges and grapefruit citrus fruits.

When I was putting together the article I knew I would have to look at BigTable and Dynamo derivatives that were generally available. Now BigTable self-identifies as a column store where as Dynamo describes itself as a distributed key-value store. However both systems have a lot of properties in common:

  • They are peer-aware
  • Fault-tolerant
  • Distributed
  • Can be scaled to demand

Now for me these common characteristics also explain why something like Redis isn’t the same as Riak and it’s an over-simplification to just call Riak and Cassandra distributed key-value stores.

So I saw this similarity between the situations (and here’s where the mistake occurs) and thought: well you know if you look at things like Riak that allow metadata on their bucket you can consider them to be a column-store because the columns are the keys on the metadata and the bucket is actually just a key of a special type. Each key can have a variable number of columns but must have one, the bucket. Then when you query the metadata you are kind of just doing a column sort and search. Brilliant! I can just lump all these different systems under this term.

Now I don’t want to jump to another label, I want to try and get the description of the group more or less correct. Looking at the things that are common I feel that something along the lines of “P2P clustered datastores” might be more accurate.

Standard
Programming, Web Applications

Can you use NoSql?

I think the answer is yes. The reason is that traditionally relational datastores have ended up as being the dumping ground for data. Everything has ended up there and with the advent of new data storage technology there is a chance to rummage around the various piles of data and ask whether things are in the right home or not.

One thing I’ve been doing a lot recently is data-driving HTML Form components. That’s a lot easier when you are just reading the data out of documents and lists rather than out of tables. The first advantage is that you don’t have to size your option text for example. Variable text labels? No problem. The second is that you can move away from numeric values to having text-based slug keys or even use existing conventions like ISO language short codes.

You don’t have to use numbers with relational data of course but it tends to happen due to leaky ORM solutions that are orientated around the Long Primary Key.

Another area where you can probably take advantage of a NoSql store is in the small bits of text that occurs around your site but which should be maintained by business owners rather than the front-end team. Thing of those straplines, boxed text and success stories. Maybe they are stored in CLOBs somewhere in the database perhaps in a table called something cryptic like user_text. Let’s liberate that data into a key-store!

I find myself using a lot of Textile and Markdown text in my sites and it is an almost trivial exercise to process and display it from a NoSql database. I would encourage you to give it a go, it’s low risk but it should illustrate some of the benefits of the new stores and suggest the kinds of other problems you have in your application that some NoSql could solve.

Standard
Programming

The Cathedral and the Lemonade Stand

Software is big, hard and complicated. It has also traditionally been long-lived, often lasting beyond all the reasonable expectations of its creators.

These realities have driven a lot of software “best practices” in the last few years. Suites of tests to make sure the software is easy to change with confidence, the understanding that software needs to be complete in it itself rather than accompanied by auto-generated documentation files or 200 page manuals, the understanding that software is written to be read not to do things.

It has also led to more difficult arguments that have yet to be won. Things like the fact that software needs to be considered more like infrastructure, with ongoing costs. Most people realise that if they don’t clean their buildings they get dirty and if they don’t service their cars then eventually they stop working. Most people though are happy to pay large sums creating software only to avoid paying anything further and thereby creating a system that slowly slips into weed-ridden obsolescence.

It is this tendency to regard software as something you purchase once rather than an ongoing investment that I want to talk about here.

An interesting thing is starting to happen just now with the arrival of high-productivity languages such as Python and Ruby and flexible NoSql data solutions like CouchDb and key-value stores. Software is probably easier to create (without compromising the practices we’ve come to understand are important) now than ever before. It is also getting easier and easier to deploy applications with cloud services like Google App Engine and Heroku for the web and EC2 for raw machines.

In short we can now turn around software very quickly if chose to. This creates an interesting avenue for tackling the maintenance problem by caving in to what budget holders tend to do naturally. Budgets are generally spent on specific items of functionality. Maintaining that functionality as service tends to come out of a generalised pot, if at all. Normally arguments about solving this problem have focused on trying to include the true cost of the product into the initial budget.

But why? Why don’t we just create a piece of software and then later when we want it do something else through it away and start again. Take a lemonade stand. You don’t build a lemonade stand out of stone and marble with a team of master masons. When the hot weather comes round you grab some wood and cardboard and make a stand that will be good enough for the weekend or evening that you need it for. If next week it is still hot and you want to sell lemonade again you just make it again.

Websites are a lot like this, they get old very quickly and on the front end, they probably have a six-month lifecycle before design trends changes or new browser standards are implemented or some new way of using the web is discovered. Building a website to last four years (or the 10 or 20 years that key infrastructure software tends to be used for) is a waste of time, it’s very unlikely to repay your effort over it’s actual lifetime of, maybe, two years.

Even in software that does last 10 years there is often a feeling of regret that due to tremendous sunk costs in developing it it is not viable to move it to commodity hardware or the cloud or in the case of banking mainframe code make any change to it of any significance.

We can now develop really powerful applications in the timeframe of four to ten weeks. If that application lasts three months with no additional effort and we then spend another four to ten weeks replacing it, are we not actually better off? We are able to implement our lessons learnt sooner. Solutions are much more flexible and mistakes do not come with long-running costs attached. A bad idea can just be left as is.

I’m not arguing for a complete rewrite every three months, just like the lemonade stand we can reuse the good bits, our sign or the good piece of wood for the counter perhaps. Things like Guerilla SOA are taking us in these directions anyway with loosely coupled services with standard-based protocols and interchange formats. If we write a great authentication service we can keep that, or we could replace it with OAuth or OpenId. Our options are open and wider when we play in the short term rather than the long term.

Standard
Programming

Continuous Testing

Continuous testing is one of those things that has crept up on me slowly. About two years ago I was aware of people using a trigger on their TextMate save to run tests and, if green, commit to git. At the time it felt too much effort for too little gain but it was a cool trick.

Now as we forage out into the post-Java world we are starting to get some pretty cool revisions of familiar tools and one of the most engaging for me are continuous build tools. The daddy is clearly SBT, which while simple is also tremendously sophisticated. Adding a REPL to a build is a simple change but has all kinds of nice consequences, my favourite of which currently is the continuous test (~test) target that detects changes in your source and test files, compiles them and runs your tests. SBT cuts out the whole compile-link-run cycle for you, you just make a change, hit save, see the consequences and code again. It’s very fast and far more effective in giving feedback than any of the current IDEs (all of which need to get on this bandwagon fast is they want to stay relevant).

Clojure by comparison has been suffering in this regard with Leiningen becoming an unfortunately early defacto standard despite it standing shoulder to shoulder with the benighted Maven. The key thing that Leiningen does wrong is stay at the command-line and force you to cold-boot a JVM with each new command (the second is dependency resolution, SBT favours Ivy). Fortunately Lazytest by Stuart Sierra can hopefully save us here. Although still alpha Lazytest is an awesome way of developing Clojure and it’s hard to beat that feeling of smug satisfaction as the tests go green.

It is these kind of step change enhancements in development that are going to carry us forward more than shopping list of features that the new languages have or lack.

Standard
Programming, Work

Code Coverage 90% meaningless

I have always been quite sceptical about the value of code coverage metrics. Most of the teams I have seen who have been productive and produced code with few defects in production have not used code coverage as a metric. Where code coverage tends to be obsessively tracked it seems to be more as a management reporting metric (often linked to “Quality”) and rarely seems to correlate with lower defects or malleable software instead it often appears in low-collaboration or low trust environments.

Code coverage has most benefit in an immature unit testing environment or in a “test-after” team. With test-after you have to have code coverage to ensure that people remembered to test all the possible execution paths. My personal preference is to push TDD as a practice in preference to code coverage because a side-effect is that you get 100% code coverage.

Code coverage is also quite a different beast to static or complexity analysis of code bases. Static analysis is a useful tool and some complexity measures actually make good indicators of the “quality” of the code base. It is also not the same as instrumented code, which is invaluable with dealing with code you’ve inherited or to discover how much of the codebase actually gets used in production.

Standard
Groovy, Programming

Low Expectations for the Build

I attended the talk on Gradle by Hans Doktor tonight and while I found myself agreeing that Maven is wholly unsatisfactory I did end up thinking that actually our expectations of build tools in the Java space are really low. What kind of things does Gradle offer us? Proper event interception, genuine integration with the build lifecycle, build targets dynamically defined at runtime, a directed cyclic dependency graph.

Looking at the list some of things you can’t believe are not part of our standard build package. We should be able to know when a build starts and stops and be able to attach code to those events. We should have decent target resolution that avoids duplication of tasks.

Gradle is head and shoulders about the morass that is Maven and is clearly superior to the ageing but faithful Ant but that it manages to be so on so little functionality is a shame.

Standard
Programming, Software

The One True Layer Anti-Pattern

A common SQL database anti-pattern is the One True Lookup Table (OTLT). Though laughable the same anti-pattern often occurs at the application development layer. It commonly occurs as part of the mid-life crisis phase of an application.

Initially all objects and representations are coded as needed and to fit the circumstances at hand. Of course the dynamics of the Big Ball of Mud anti-pattern are such that soon you will have many varying descriptions of the same concept and data. Before long you get the desire to clean up and rationalise all these repetitions, which is a good example of refactoring for simplicity. However, at this point danger looms.

Someone will point out eventually that having one clean data model works so well that perhaps there should be one shared data model that all applications will use. This is superficially appealing and is almost inevitably implemented with a lot of fighting and fussing to ensure that everyone is using the one true data model (incidentally I’m using data models here but it might be services or anything where several applications are meant to drive through a single component).

How happy are we then? We have created a consistent component that is used across all our applications in a great horizontal band. The people who proposed it get promoted and everyone is using the One True Way.

What we have actually done is recreated the n-tier application architecture. Hurrah! Now what is the problem with that? Why does no-one talk about n-tier application architecture anymore? Well the issue is Middleware and the One True Layer will inevitably hit the same rocks that Middleware did and get dashed to pieces.

The problem with the One True Layer is the fundamental fact that you cannot be all things to all men. From the moment it is introduced the OTL must either bloat and expand to cover all possible Use Cases or otherwise hideously hamstring development of the application. If there was a happy medium between the two then someone would have written a library to do the job by now.

There is no consistency between which of the two choices will be made; I have seen both and neither of them have happy outcomes. Either way from this point on the layer is doomed: it becomes unusable and before long the developers will be trying to work their way around the OTL as much as possible, using it only when threatened with dismissal.

If the codebase continues for long enough then usually what happens is the OTL sprouts a number of wrappers around its objects that allow the various consumers of its data to do what they need to. When eventually the initial creators of the OTL are unable to force the teams to use the layer then the wrappers tend to suck up all the functionality of the OTL and the library dependency is removed.

In some ways this may seem regressive, we are back at the anarchy of objects. In fact what has been created is a set of vertical slices that represent the data in the way that makes sense for the context they appear in. These slices then collaborate via external API interfaces that are usually presented via platform neutral data transfer standards (HTTP/JSON for example) rather than via binary compatibility.

My advice is to try to avoid binary dependent interactions between components and try to avoid creating very broad layers of software. Tiers are fine but keep them narrow and try to avoid any tier reaching across more than a few slices (this particularly applies to databases).

Standard