What is CouchDB?
CouchDB is a dedicated document based database that kind of puts it in the same space as Exist, Xindice and Oracle Berkeley XML Db. What makes it very different is that instead of building around XML it is built around Javascript. The document storage format is JSON and the query language is Javascript. Something like Exist uses a minimum of well-formed XML for data and XPath or XQuery for querying.
Now the merits and flaws of the various markup languages is really holy war and I don’t want to get into it too much. I think it is enough to say that CouchDB aims to be lightweight in implementation without being lightweight in features or performance. JSON is very popular in the web space and by focussing on making common cases easier it is less work to use than XML. The more complex the data or specification requirements though the less viable it is as a solution.
Using JSON means that stored data is extremely flexible in structure, unlike relational data you can have very gappy or bitty information and not have a problem. Take something like a customer relationship management system. This is often a poor fit to relational data because you tend to discover information in small tranches. Initially a lead might be a first name and a telephone number. As the relationship develops you add a surname, a company name, a position, an collection of issues or queries and so on. JSON is a really good way of capturing this evolving picture of things.
Using Javascript to query this data surprised me initially, I feel pretty comfortable with my limited Javascript skills and therefore didn’t have a problem with it as a choice. I’ll come back to this later in the post but actually the idea sounds strange but is a natural fit when you use it. Again it is about the right tool for the right job but in the context of the data you are using in CouchDB creating a query by combing iterators, if statements and the functional programming map function is a better fit than trying to adapt SQL.
The final component of CouchDb is the functional language Erlang which is used to service the HTTP REST interface that CouchDB uses to provide an interface to the database. Erlang provides very cheap, lightweight concurrency that provides a good fit for handling lots of HTTP requests.
The interesting thing is that CouchDB automatically provides the REST front end that you tend to build on top of Java object stores like JBoss Cache or Coherence (which I have also been looking at recently). With those products you tend to stay native until you need to share data with other languages or systems and then you tend to REST serve it out as XML with a lightweight HTTP server. If you can see that need ahead of time then CouchDB might well be compelling due to the built-in support.
Installing
After reading about CouchDB I wanted to take a look at this new document database in a bit more detail. I managed to install the base tarball from the Google Code page easily. I then cut and pasted the Ruby sample REST code from the CouchDb Wiki into irb and on running the server I found that I could store and retrieve documents really easily. The combination of interactive Ruby and the REST interface made it easy to play around with data objects.
However when trying to run the tests via the administration web page only four of the tests would pass. The only error message was just an Error Code of 136. Since I seemed able to store and pull documents (and could confirm that the datafile had been generated and had data) I put it down to alpha flakiness and pushed on.
It was only when I tried executing ad-hoc queries and kept having them fail that I realised what was linking my problems. Every time the main Erlang server handed off a query to the Javascript interpreter the process died. Erlang is really resilient to thread death so there was zero impact from this.
CouchDB uses the Mozilla Spidermonkey libraries to construct its own View Server program which is configured via the couch.ini file. Having double-checked that the interpreter was there, was executable and could read the main.js script it was passed I was baffled. I decided to pull down the code base from SVN and then rebuild it. Building from SVN is a bit more involved than the tarfile and its worth double-checking the documentation on the wiki before you get stuck in. Having built and installed myself a new copy I still had the same problem and was feeling pretty cheesed off. With no other source of help I headed to IRC and checked out the Freenode #couchdb channel. There Christopher Lenz gave me a good steer that he had had a problem with Erlang’s HIPE (high performance virtual machine) on Debian.
I had been using Erlang 5.5.5 which I think I used apt-get to obtain. Downloading the latest Erlang release (5.6.1, they release pretty frequently) I tried building that from source and hung when configure tried to check for floating-point exceptions. Running configure with disable-hipe allowed configure to complete and then Erlang was straight-forward to build and install.
Restarting the CouchDB server I found my queries were working and all but one test (“conflicts” fails on an error assertion and this is apparantly a known issue) passing, hurrah!
The machine I had been working on was a Feisty Fawn Ubuntu instance so I loaded up Gutsy Gibbon in VM Ware Player and tried building Erlang and CouchDb there. Erlang was fine, even with Hipe. For CouchDB I used the latest Erlang I had built and used apt-get to obtain the Spidermonkey and ICU depdencies. Everything worked out of the box with this combination.
This problem highlights a few of CouchDB’s weaknesses. Firstly it is all bleeding edge stuff, Erlang is bounding along and so is CouchDB. Secondly there is no test suite in the source distribution that can help troubleshoot issues. Finally the error logging is tricky to understand for a neonate. I’m certain that having an EUnit test suite would have directed me towards the fact that the Javascript was failing much quicker than using trial and error and IRC.
The final issue was where CouchDB stores its logs and datafiles by default. These default to /usr/local/var which is not really where I want to store data like this. /var is a more natural choice and FHS seems to agree. You can change this when building in the etc/couch.ini.tpl file of the source directory but it would be nicer to have a more natural choice by default.
I also tried to install CouchDB on OSX without using a packaging mechanism but instead referring to an instance of Spidermonkey 1.7 I had on my system. While I could get configure to accept the Javascript library it wouldn’t recognise the jsapi.h header, maybe because configure doesn’t define the right Macros when it tries to build the test file.
Working with CouchDB
My first experience of CouchDB’s web interface was that it was good-looking but that the error messages, unresponsive links, Firebug warnings and occasional Javascript pop-ups were all signs that it was a work in progress. On fixing my passthrough to Javascript I had a very different experience of a slick and good looking interface that uses amazingly responsive AJAX to provide a really good experience.
Learning how to generate queries for example is easy with the in-built query browser (I’ll save the actual syntax for a later post) as the feedback from the system is very quick. Similarly creating new databases, documents and fields is actually not torturous but slick and quick.
I’m certain this is because Erlang and AJAX form a natural partnership in creating and servicing small requests that need to be handled quickly. I may be wrong but this is certainly my most positive AJAX webservice experience to date.
The CouchDB wiki contains information on getting started with various languages and I choose Ruby. It was a simple cut and paste into irb and then it was straight-forward to interact with database from the shell and the web client.
One of CouchDB’s more unusual feature seems to stem from Erlang’s concurrency model. Data is assigned a revision and in addition to applying multiple changes in order you can also view previous revisions. That’s a pretty weird feature compared to most of the current datastores. You also need to refer to a revision if you want to update a record. If the revision of the target of the update has changed when you submit your changes your changes get refused. The revision mechanism is also the basis of the replication mechanism although there isn’t enough documentation to understand when the replication pairs check their revision ids.
Incremental data-construction is easy via the web-gui but you need to fiddle a little bit to get the revision number to target if doing it programatically. Presumably a library would make that easier going. For information that has little inherent structure or is very gappy or incremental then CouchDB is a great data store and is currently occupying a niche as far as I know.