Keep the focus on the read

One of the interesting things about Wazoku‘s Startup Challenge app is that a lot of the functionality is created via “out of the box” CouchDB features. In fact it is often where we haven’t lent heavily on the features of our store and frameworks where we have issues.

One of the interesting things we decided to do in the app relatively late in the day was provide little encouragements to say how many more votes an entry needed to get to the next place in the ladder. As this was a late feature we didn’t really think through where this feature would sit. We had code that re-ranks entries when their vote ordering changes and so when an entry was being re-ranked it also acquired the target to beat at the same time.

With a store like CouchDB you are really aiming to keep on reading data and minimising writes. That’s via denormalisation and also about strategies to generate related and derived data when you are changing the parent data.

So this placement made sense from that point of view. It was only later that I have begun to realise that we were choosing the wrong point to read. With hindsight it is actually only necessary to calculate the target entry when someone looks at the entry. This is because the views of the entries are distributed unequally and the vote totals already exist as a CouchDB view and therefore we can do a key lookup to find all entries with more votes than the current entry when needed.

If we wanted to cache that result to avoid needless recalculation we would be better off storing the information in front-side cache like Memcached or Redis but in practice key reads in CouchDB are pretty damn fast and low load.

So we thought we were saving ourselves problems by denormalising derived data but in fact we were creating a lot more work at a point where it is uncertain that the additional data will ever be consumed.

Sometimes it can be hard to pick the right point to read!



How do I query data with CouchDB?

This question comes up a lot when dealing with Couch and I have given various answers before but my latest answer is simply that you don’t. In reality what you want to do in Couch, like a lot of the NoSql databases, is look for key lookups.

Now the key lookup may be a range of keys you are interested in but in reality there is nothing in Couch that is similar to the SQL “WHERE” clause.

So if you cannot do queries then how do you relate data? Well that’s the thing about storing documents instead of rows, if you have related data then you have to ask whether that data has any meaningful existence outside of its parent. In relational terms it is like asking whether you ever access the content of table outside of JOIN with its parent.

Initially you might think: of course I do! But often data is often explicitly related to its parent’s primary key by things like ORDER and GROUP BY. In these kind of cases then you move the related data into the parent record, effectively denormalising to avoid a lookup.

If the data does have a meaningful existence outside the parent (for example in Wazoku comments are an example of a piece of data that exists separately from the thing they are a comment on) then you have a few options but essentially instead of querying you are still trying to do a direct key lookup.

The first simple case is to include a reference to key of the related data in the associated document. Then from one key lookup you can go direct to the next. As an example we store a list of comment document ids on any document that can be commented on and then we can load the comments as needed (often the count of the comments can be as relevant as the full content). I describe the ids used this way as “forward references” as they lead you on to the related document.

The second, slightly more involved approach, is the creation of a view that allows the document to be looked up via an alternative key. For example if we store the document id of the thing being commented on in the comment document under the key comment_on we can then create a mapping view of all comment documents to their comment_on key. Then given any document we can simply do a direct lookup on the key in the view to determine whether it has any associated comments.

The final common technique I use is something I refer to as “unrolling” of collections. So again we create a CouchDB view that consists just of a map job and in it we take each item in an array of “forward references” (related document ids) and emit a document in each view mapping the id to the current document id.

So if an idea document has five comment forward references the resulting view will have five documents, each relating a comment document id to the idea document id.

If things get more complicated then I also have the Couch databases indexed in Elasticsearch and in Neo4J and these alternative views of the data give me powerful adhoc queries on properties or relationships in the data.

In general though I am always trying to think ahead as to how my documents relate and then express that in terms of a key lookup so that I am always working with the simplest case.

Programming, Python

Django and JSON stores: a match in heaven

My current project is using CouchDB as its store and Django to implement the web frontend. When you have a JSON store such as CouchDB then Python is a natural complement due to its brilliant parsing of JSON into native data structures and its powerful dictionary data type that makes it really easy to work with maps.

In a previous project using Python and Mongo we used Presentation objects to provide domain logic on top of the raw data maps but this time around I wanted to try and cut out a layer and just work with maps as much as possible (perhaps the influence of the Clojure programming I’ve been doing on the side).

However this still leaves two problems that need addressing. Firstly Django templates generally handle dictionaries like a dream allowing to address them with the standard dot syntax. However both Mongo and Couch use leading underscores to indicate “special” variables and this clashes with the Python convention of having a leading underscore indicate a private member of the class. The most immediate time you encounter this is when you want to use the id of a document in a url and the naive doc._id does not work.

The other problem is the issue of legitimate domain logic. In Wazoku we want to use people’s names if they have supplied them and fallback to their email if they haven’t supplied their name.

The answer to both of these problems (without resorting to an intermediary object) is Django’s filters. The necessary logic can be written in a few lines of Python that simply examines the dictionary and does the necessary transformation to derive the id or user’s name. This is much lighter than the corresponding Presentation solution.

Web Applications

Replicating data in Cloudant and Heroku

Heroku allows you to use CouchDB via the Cloudant cloud service which is great but compared to the documentation for the relational stores it is not clear how you are meant to deal with backups and importing of data. I also couldn’t find a way to use Futon on the Heroku instance (which comes from the Heroku account, you can’t use your own Cloudant account with the plugin) or share the database instance with my personal Cloudant account.

This post from Cloudant helps a lot, essentially you can get your Heroku instance URL and then the cool thing about Couch’s painless replication is that once you have a Couch URL you can replicate that database to a local instance or even back into Cloudant.

heroku config --long

curl CLOUDANT_URL/_replicate -H 'Content-Type: application/json' -d '{"source" : "CLOUDANT_URL", "target" : "TARGET_URL"}'

You can edit the database locally and then replicate back to the Heroku instance by just swapping the URLs in the Curl above.

That seems to pretty much be it. I’ve replicated my data out of Cloudant and then back into it, which feels bizarre but it’s all symmetrical with Couch and it’s a handy cloud-based backup mechanism.


Get your own Couch

At Erlounge in London I recently had the chance to catch up with the guys J Chris Anderson and Jan Lehnardt. The conversation was interesting as ever and I was intrigued by JChris’s ideas that CouchDb has twisted conventional data storage logic on it’s head by making it easy to create many databases with relatively small amounts of information; the one database per user concept.

More importantly though I discovered it was okay to talk about the hosted Couch instances that (who also do CouchDb consulting if you need some help and advice) are offering. The free service offers you a Basic authentication account with 100Mb of storage to play around with Couch to your heart’s content. Paying brings more space but also more sophisticated authentication options.

The service is the perfect way to play around with Couch and learn how you could use it go get an account today! It’s on the Cloud as well: schema-less data on the Cloud, how buzzword compliant is that?!

On a more serious note, this is an excellent service and one I have been asking for as it allows people who have no desire to build and maintain Erlang and Couch to use the datastore as a web service rather than as a managed infrastructure component.

Web Applications

Nicely Couched is the new wiki application on the block. I got pointed towards it because it is running on top of CouchDB. But having mucked around with it for a bit I have to say that it’s not just a good example of the kind of thing you can do with Couch, it is also the most exciting take on Wiki software I’ve seen for a while.

What is there to like? Well firstly there is the idea of the anonymous wiki that you can create on the fly. Products like GitHub have significently reduced the barrier to setting up an open source project. Just choose a name and that’s it you are away coding and sharing your code. is the same, got some text that you need to organise? Got some ideas you want to get down? Just create a wiki, write some stuff and then if it is working out claim it and give it a decent name. just feels totally spontaneous.

It also totally resolves the issue between using a wiki-syntax (in this case Markdown) and having the pretty WYSIWYG interface. The near-realtime preview is fantastic and makes you wonder why you have ever had to click on another tab or button to see your preview.

The site has a really unfussy design (except for the goddamn rounded corners (stop it, this isn’t 2006)) but gives you a few customisation options.

The support has also been great and turnaround has been good on issues.

The subdomain naming is really easy to use, compare how easy it is to mucking around with settings in something like Google Sites.

If you are looking for a wiki or you want to see some CouchDB in action I totally recommend this site.


CouchDB: Pros and Cons

I have been looking at a lot of the different databases that are working outside the traditional SQL style RDBMS (and I want to point out that having a true RDBMS might actually be something very different to what we have now). I have spent a lot of time with CouchDB compared to the others because it seems to occupy a special niche in data storage. It also builds in a REST interface right off the bat which effectively decouples the storage mechanism from the delivery and makes the whole system very language neutral.

CouchDB represents the first serious contact I have had with JSON and in deciding whether CouchDB is right for your project you really need to understand what JSON does and doesn’t offer. The first important thing is that JSON offers a kind of compromise from the heavy data definition in SQL or XML Schema and totally typeless data of something like flat files or Amazon SimpleDB. However to achieve that simplicity there are two consequences. Firstly data integrity is palmed off onto client applications and you will need to check data going in and coming out. I don’t personally believe that there is ever an application that doesn’t have some explicit or implicit data integrity. Second JSON documents can be very rich and complex but they cannot have detailed field structures, a telephone number is simply going to decompose into two numbers and its hard to get much finer grain description of the data. As a rule of thumb if you would normally create a well-formed XML document for the data or you would define a SQL data definition that would simply declare all the types of the column using all the defaults and never specifying a NOT NULL column then you probably have a good match.

JSON also favours dynamic languages over static types like Java. You need to decide if you are going to be able to take advantage of the full set of features in your chosen client language. XML might remain a better choice for static languages as you describe document instances declaratively.

Other good features about CouchDB include the fact that its incubating for Apache and will be a good addition to the projects there. It also leverages existing Javascript skills which are probably more common than people who know XPath, XQuery or even XML Schema and Relax NG. It has excellent support for incremental data and also fits well with highly irregular data like CMS page content. It also has an excellent UI that makes it easy to interact with the data. Finally it seems to have a good solution for scalability without doing anything too esoteric.

The negative features are tricky because we obviously have an alpha project here that is probably closer to the start of the public testing phase than the end. Some of these cons may well be addressed before the first release candidate. The first obvious omission for a server product is that there is no security built into the server. Just supporting optional HTTP Authentication would be enough to make it practical to start running some experimental servers for something more than sandbox exercises. One thing that is a major difference to the feature set in the standard pseudo-RDBMS is that there is no way of interacting with sets of data. There is a method for changing multiple documents in one pass but what you really want to be able to do is apply a set of changes to documents identified by a query, the equivalent of SQL’s UPDATE.

Related to this is the fact that data in CouchDB is currently heavily based on silos. If I have a set of data referring to authors and a set of data relating to books then I currently need to duplicate data in both. One of the problems that the developers are going to face in evolving CouchDB is how to address this without introducing a solution so complex that you ask why you are not using an RDBMS? Similarly I notice in the roadmap there is an item about data validation. If you start introducing data validation and rules for validating data then before long there is going to be a question as to why you don’t simply use one of the existing document systems as all the current simplicity will have gone.

One thing that definitely needs improvement is error logging and reporting. Often the only error feedback you have is a Javascript popup that says “undefined” and a log message that tells you that the Erlang process terminated. There needs to be some more human-readable issue logging that points you towards what is going wrong.