Programming

Keep the focus on the read

One of the interesting things about Wazoku‘s Startup Challenge app is that a lot of the functionality is created via “out of the box” CouchDB features. In fact it is often where we haven’t lent heavily on the features of our store and frameworks where we have issues.

One of the interesting things we decided to do in the app relatively late in the day was provide little encouragements to say how many more votes an entry needed to get to the next place in the ladder. As this was a late feature we didn’t really think through where this feature would sit. We had code that re-ranks entries when their vote ordering changes and so when an entry was being re-ranked it also acquired the target to beat at the same time.

With a store like CouchDB you are really aiming to keep on reading data and minimising writes. That’s via denormalisation and also about strategies to generate related and derived data when you are changing the parent data.

So this placement made sense from that point of view. It was only later that I have begun to realise that we were choosing the wrong point to read. With hindsight it is actually only necessary to calculate the target entry when someone looks at the entry. This is because the views of the entries are distributed unequally and the vote totals already exist as a CouchDB view and therefore we can do a key lookup to find all entries with more votes than the current entry when needed.

If we wanted to cache that result to avoid needless recalculation we would be better off storing the information in front-side cache like Memcached or Redis but in practice key reads in CouchDB are pretty damn fast and low load.

So we thought we were saving ourselves problems by denormalising derived data but in fact we were creating a lot more work at a point where it is uncertain that the additional data will ever be consumed.

Sometimes it can be hard to pick the right point to read!

 

Standard
Programming

How do I query data with CouchDB?

This question comes up a lot when dealing with Couch and I have given various answers before but my latest answer is simply that you don’t. In reality what you want to do in Couch, like a lot of the NoSql databases, is look for key lookups.

Now the key lookup may be a range of keys you are interested in but in reality there is nothing in Couch that is similar to the SQL “WHERE” clause.

So if you cannot do queries then how do you relate data? Well that’s the thing about storing documents instead of rows, if you have related data then you have to ask whether that data has any meaningful existence outside of its parent. In relational terms it is like asking whether you ever access the content of table outside of JOIN with its parent.

Initially you might think: of course I do! But often data is often explicitly related to its parent’s primary key by things like ORDER and GROUP BY. In these kind of cases then you move the related data into the parent record, effectively denormalising to avoid a lookup.

If the data does have a meaningful existence outside the parent (for example in Wazoku comments are an example of a piece of data that exists separately from the thing they are a comment on) then you have a few options but essentially instead of querying you are still trying to do a direct key lookup.

The first simple case is to include a reference to key of the related data in the associated document. Then from one key lookup you can go direct to the next. As an example we store a list of comment document ids on any document that can be commented on and then we can load the comments as needed (often the count of the comments can be as relevant as the full content). I describe the ids used this way as “forward references” as they lead you on to the related document.

The second, slightly more involved approach, is the creation of a view that allows the document to be looked up via an alternative key. For example if we store the document id of the thing being commented on in the comment document under the key comment_on we can then create a mapping view of all comment documents to their comment_on key. Then given any document we can simply do a direct lookup on the key in the view to determine whether it has any associated comments.

The final common technique I use is something I refer to as “unrolling” of collections. So again we create a CouchDB view that consists just of a map job and in it we take each item in an array of “forward references” (related document ids) and emit a document in each view mapping the id to the current document id.

So if an idea document has five comment forward references the resulting view will have five documents, each relating a comment document id to the idea document id.

If things get more complicated then I also have the Couch databases indexed in Elasticsearch and in Neo4J and these alternative views of the data give me powerful adhoc queries on properties or relationships in the data.

In general though I am always trying to think ahead as to how my documents relate and then express that in terms of a key lookup so that I am always working with the simplest case.

Standard
Programming, Python

Django and JSON stores: a match in heaven

My current project is using CouchDB as its store and Django to implement the web frontend. When you have a JSON store such as CouchDB then Python is a natural complement due to its brilliant parsing of JSON into native data structures and its powerful dictionary data type that makes it really easy to work with maps.

In a previous project using Python and Mongo we used Presentation objects to provide domain logic on top of the raw data maps but this time around I wanted to try and cut out a layer and just work with maps as much as possible (perhaps the influence of the Clojure programming I’ve been doing on the side).

However this still leaves two problems that need addressing. Firstly Django templates generally handle dictionaries like a dream allowing to address them with the standard dot syntax. However both Mongo and Couch use leading underscores to indicate “special” variables and this clashes with the Python convention of having a leading underscore indicate a private member of the class. The most immediate time you encounter this is when you want to use the id of a document in a url and the naive doc._id does not work.

The other problem is the issue of legitimate domain logic. In Wazoku we want to use people’s names if they have supplied them and fallback to their email if they haven’t supplied their name.

The answer to both of these problems (without resorting to an intermediary object) is Django’s filters. The necessary logic can be written in a few lines of Python that simply examines the dictionary and does the necessary transformation to derive the id or user’s name. This is much lighter than the corresponding Presentation solution.

Standard
Web Applications

Replicating data in Cloudant and Heroku

Heroku allows you to use CouchDB via the Cloudant cloud service which is great but compared to the documentation for the relational stores it is not clear how you are meant to deal with backups and importing of data. I also couldn’t find a way to use Futon on the Heroku instance (which comes from the Heroku account, you can’t use your own Cloudant account with the plugin) or share the database instance with my personal Cloudant account.

This post from Cloudant helps a lot, essentially you can get your Heroku instance URL and then the cool thing about Couch’s painless replication is that once you have a Couch URL you can replicate that database to a local instance or even back into Cloudant.


heroku config --long

curl CLOUDANT_URL/_replicate -H 'Content-Type: application/json' -d '{"source" : "CLOUDANT_URL", "target" : "TARGET_URL"}'

You can edit the database locally and then replicate back to the Heroku instance by just swapping the URLs in the Curl above.

That seems to pretty much be it. I’ve replicated my data out of Cloudant and then back into it, which feels bizarre but it’s all symmetrical with Couch and it’s a handy cloud-based backup mechanism.

Standard
Software

Get your own Couch

At Erlounge in London I recently had the chance to catch up with the Couch.io guys J Chris Anderson and Jan Lehnardt. The conversation was interesting as ever and I was intrigued by JChris’s ideas that CouchDb has twisted conventional data storage logic on it’s head by making it easy to create many databases with relatively small amounts of information; the one database per user concept.

More importantly though I discovered it was okay to talk about the hosted Couch instances that Couch.io (who also do CouchDb consulting if you need some help and advice) are offering. The free service offers you a Basic authentication account with 100Mb of storage to play around with Couch to your heart’s content. Paying brings more space but also more sophisticated authentication options.

The service is the perfect way to play around with Couch and learn how you could use it go get an account today! It’s on the Cloud as well: schema-less data on the Cloud, how buzzword compliant is that?!

On a more serious note, this is an excellent service and one I have been asking for as it allows people who have no desire to build and maintain Erlang and Couch to use the datastore as a web service rather than as a managed infrastructure component.

Standard
Web Applications

Nicely Couched

Couch.it is the new wiki application on the block. I got pointed towards it because it is running on top of CouchDB. But having mucked around with it for a bit I have to say that it’s not just a good example of the kind of thing you can do with Couch, it is also the most exciting take on Wiki software I’ve seen for a while.

What is there to like? Well firstly there is the idea of the anonymous wiki that you can create on the fly. Products like GitHub have significently reduced the barrier to setting up an open source project. Just choose a name and that’s it you are away coding and sharing your code.

Couch.it is the same, got some text that you need to organise? Got some ideas you want to get down? Just create a wiki, write some stuff and then if it is working out claim it and give it a decent name. Couch.it just feels totally spontaneous.

It also totally resolves the issue between using a wiki-syntax (in this case Markdown) and having the pretty WYSIWYG interface. The near-realtime preview is fantastic and makes you wonder why you have ever had to click on another tab or button to see your preview.

The site has a really unfussy design (except for the goddamn rounded corners (stop it, this isn’t 2006)) but gives you a few customisation options.

The support has also been great and turnaround has been good on issues.

The subdomain naming is really easy to use, compare how easy it is to mucking around with settings in something like Google Sites.

If you are looking for a wiki or you want to see some CouchDB in action I totally recommend this site.

Standard
Programming

CouchDB: Pros and Cons

I have been looking at a lot of the different databases that are working outside the traditional SQL style RDBMS (and I want to point out that having a true RDBMS might actually be something very different to what we have now). I have spent a lot of time with CouchDB compared to the others because it seems to occupy a special niche in data storage. It also builds in a REST interface right off the bat which effectively decouples the storage mechanism from the delivery and makes the whole system very language neutral.

CouchDB represents the first serious contact I have had with JSON and in deciding whether CouchDB is right for your project you really need to understand what JSON does and doesn’t offer. The first important thing is that JSON offers a kind of compromise from the heavy data definition in SQL or XML Schema and totally typeless data of something like flat files or Amazon SimpleDB. However to achieve that simplicity there are two consequences. Firstly data integrity is palmed off onto client applications and you will need to check data going in and coming out. I don’t personally believe that there is ever an application that doesn’t have some explicit or implicit data integrity. Second JSON documents can be very rich and complex but they cannot have detailed field structures, a telephone number is simply going to decompose into two numbers and its hard to get much finer grain description of the data. As a rule of thumb if you would normally create a well-formed XML document for the data or you would define a SQL data definition that would simply declare all the types of the column using all the defaults and never specifying a NOT NULL column then you probably have a good match.

JSON also favours dynamic languages over static types like Java. You need to decide if you are going to be able to take advantage of the full set of features in your chosen client language. XML might remain a better choice for static languages as you describe document instances declaratively.

Other good features about CouchDB include the fact that its incubating for Apache and will be a good addition to the projects there. It also leverages existing Javascript skills which are probably more common than people who know XPath, XQuery or even XML Schema and Relax NG. It has excellent support for incremental data and also fits well with highly irregular data like CMS page content. It also has an excellent UI that makes it easy to interact with the data. Finally it seems to have a good solution for scalability without doing anything too esoteric.

The negative features are tricky because we obviously have an alpha project here that is probably closer to the start of the public testing phase than the end. Some of these cons may well be addressed before the first release candidate. The first obvious omission for a server product is that there is no security built into the server. Just supporting optional HTTP Authentication would be enough to make it practical to start running some experimental servers for something more than sandbox exercises. One thing that is a major difference to the feature set in the standard pseudo-RDBMS is that there is no way of interacting with sets of data. There is a method for changing multiple documents in one pass but what you really want to be able to do is apply a set of changes to documents identified by a query, the equivalent of SQL’s UPDATE.

Related to this is the fact that data in CouchDB is currently heavily based on silos. If I have a set of data referring to authors and a set of data relating to books then I currently need to duplicate data in both. One of the problems that the developers are going to face in evolving CouchDB is how to address this without introducing a solution so complex that you ask why you are not using an RDBMS? Similarly I notice in the roadmap there is an item about data validation. If you start introducing data validation and rules for validating data then before long there is going to be a question as to why you don’t simply use one of the existing document systems as all the current simplicity will have gone.

One thing that definitely needs improvement is error logging and reporting. Often the only error feedback you have is a Javascript popup that says “undefined” and a log message that tells you that the Erlang process terminated. There needs to be some more human-readable issue logging that points you towards what is going wrong.

Standard
Programming, Scripting

CouchDB: Querying data

CouchDB allows you to pass a map function to a special view URL to query the data in an ad-hoc way. Views can also be stored as JSON documents with a convention URL (_design on the server, accessed as _view by the client). These can then be obtained via a HTTP request.My functional and Javascript programming are weak but this is what I understand of writing queries in CouchDB. Let’s take an example of a set of library cards, each card represents a book but the amount of information I have on each book varies.

The basic find all function is this:

function(book) {
map(null, book)
}

This defines an anonymous function that takes one parameter, the target document, in this case a book, and returns an array of values. What is in the value list is controlled by the second parameter, in this query I return the entire document. The first parameter controls the sorting or ordering. So I wanted to return the title of all the books in my database then I would use:


function(book) {
map(book.name, book.name)
}

Sorting them by ISBN would go like this:

function(book) {
map(book.isbn, book.name)
}

One important thing to note is that if an object doesn’t have a value it doesn’t respond to the function and will not be included. So if I created some of my entries with a value title instead of name anything with a title and not a name will not be in the query. However if I use a non-existent entry as an ordering criteria the value will count as null and be sorted.

Because I can include any valid Javascript in my function I can actually put a lot of complexity into my queries. For example:

function(book) {
if(book.isbn != null) { map(book.name, {"Name": book.name, "ISBN": book.isbn})
} else { map(book.name, book.name) }
}

So I suspect this will either make you cheer or puke. What this function does is return a JSON object containing the Name and ISBN of the book if they are known or just the Book name as a String otherwise. Unlike SQL the heading of my query is almost completely arbitrary as long as the value on the right of my map function translates to a valid JSON object.

Now at work there are often a lot of debates as to whether things are “rigid” or “structured” or whether they are “flexible” or “formless”. It is a bit like the old meat and poison adage. CouchDB allows a client to construct an almost arbitrarily rich response to a query with almost no restriction on how the data that should be included in that response. In some cases this is going to allow you to easily interact with very complex unstructured data in some cases it is going to be an invitation to create a sprawling dataset with no value. There is no inherent right or wrong choice here but for a particular problem being solved there is probably going to be a wrong and right choice. SQL is powerful because of the restrictions and rules it builds into its grammar. Using Javascript is powerful because it relaxes those restrictions. Programmers and IT folks in general often fall into using the laxest possible implementation for reason of “flexibility” but then either have to impose order themselves or lose the power of the more restrictive choice.

So putting that into a concrete example, if a write a view with SQL I am going to have to follow a set of rules to get the data I want (for example my heading is going to have to be a set of tuples of equal size), using an arbitrary script and JSON means I am going to be able to get exactly the data I want in the form I want it. However since that return structure is customised to my query I might possibly be reducing my reuse by being over-specific or by building too much logic into my view code.

That’s quite a diversion just so I can say it’s horses for courses, so let’s wrap up this quick look at CouchDB views. All of CouchDB’s views are effectively JSON objects that are passed to a separate view server. This is a separate process that interacts with the main server via STDOUT and STDIN pipes. By default this is the view server that is built from the Spidermonkey library (it is called couchjs). However you can write a view parser for any language and plug it into CouchDB by creating an executable and mapping it to a MIME type in the couch.ini file. The view server essentially parses and readies the query function that is associated with the view and is then sent every document in the database as a JSON string. The view server picks up the results of reading every document and sends that back to the query request.

It is a pretty simple system and it works will for the relatively flat documents I have been trying with it. However I suspect that in a project with multiple developers some ground-rules for writing consistent query code would be a must.

Standard
Programming, Ruby, Scripting

CouchDB

What is CouchDB?

CouchDB is a dedicated document based database that kind of puts it in the same space as Exist, Xindice and Oracle Berkeley XML Db. What makes it very different is that instead of building around XML it is built around Javascript. The document storage format is JSON and the query language is Javascript. Something like Exist uses a minimum of well-formed XML for data and XPath or XQuery for querying.

Now the merits and flaws of the various markup languages is really holy war and I don’t want to get into it too much. I think it is enough to say that CouchDB aims to be lightweight in implementation without being lightweight in features or performance. JSON is very popular in the web space and by focussing on making common cases easier it is less work to use than XML. The more complex the data or specification requirements though the less viable it is as a solution.

Using JSON means that stored data is extremely flexible in structure, unlike relational data you can have very gappy or bitty information and not have a problem. Take something like a customer relationship management system. This is often a poor fit to relational data because you tend to discover information in small tranches. Initially a lead might be a first name and a telephone number. As the relationship develops you add a surname, a company name, a position, an collection of issues or queries and so on. JSON is a really good way of capturing this evolving picture of things.

Using Javascript to query this data surprised me initially, I feel pretty comfortable with my limited Javascript skills and therefore didn’t have a problem with it as a choice. I’ll come back to this later in the post but actually the idea sounds strange but is a natural fit when you use it. Again it is about the right tool for the right job but in the context of the data you are using in CouchDB creating a query by combing iterators, if statements and the functional programming map function is a better fit than trying to adapt SQL.

The final component of CouchDb is the functional language Erlang which is used to service the HTTP REST interface that CouchDB uses to provide an interface to the database. Erlang provides very cheap, lightweight concurrency that provides a good fit for handling lots of HTTP requests.

The interesting thing is that CouchDB automatically provides the REST front end that you tend to build on top of Java object stores like JBoss Cache or Coherence (which I have also been looking at recently). With those products you tend to stay native until you need to share data with other languages or systems and then you tend to REST serve it out as XML with a lightweight HTTP server. If you can see that need ahead of time then CouchDB might well be compelling due to the built-in support.

Installing

After reading about CouchDB I wanted to take a look at this new document database in a bit more detail. I managed to install the base tarball from the Google Code page easily. I then cut and pasted the Ruby sample REST code from the CouchDb Wiki into irb and on running the server I found that I could store and retrieve documents really easily. The combination of interactive Ruby and the REST interface made it easy to play around with data objects.

However when trying to run the tests via the administration web page only four of the tests would pass. The only error message was just an Error Code of 136. Since I seemed able to store and pull documents (and could confirm that the datafile had been generated and had data) I put it down to alpha flakiness and pushed on.

It was only when I tried executing ad-hoc queries and kept having them fail that I realised what was linking my problems. Every time the main Erlang server handed off a query to the Javascript interpreter the process died. Erlang is really resilient to thread death so there was zero impact from this.

CouchDB uses the Mozilla Spidermonkey libraries to construct its own View Server program which is configured via the couch.ini file. Having double-checked that the interpreter was there, was executable and could read the main.js script it was passed I was baffled. I decided to pull down the code base from SVN and then rebuild it. Building from SVN is a bit more involved than the tarfile and its worth double-checking the documentation on the wiki before you get stuck in. Having built and installed myself a new copy I still had the same problem and was feeling pretty cheesed off. With no other source of help I headed to IRC and checked out the Freenode #couchdb channel. There Christopher Lenz gave me a good steer that he had had a problem with Erlang’s HIPE (high performance virtual machine) on Debian.

I had been using Erlang 5.5.5 which I think I used apt-get to obtain. Downloading the latest Erlang release (5.6.1, they release pretty frequently) I tried building that from source and hung when configure tried to check for floating-point exceptions. Running configure with disable-hipe allowed configure to complete and then Erlang was straight-forward to build and install.

Restarting the CouchDB server I found my queries were working and all but one test (“conflicts” fails on an error assertion and this is apparantly a known issue) passing, hurrah!

The machine I had been working on was a Feisty Fawn Ubuntu instance so I loaded up Gutsy Gibbon in VM Ware Player and tried building Erlang and CouchDb there. Erlang was fine, even with Hipe. For CouchDB I used the latest Erlang I had built and used apt-get to obtain the Spidermonkey and ICU depdencies. Everything worked out of the box with this combination.

This problem highlights a few of CouchDB’s weaknesses. Firstly it is all bleeding edge stuff, Erlang is bounding along and so is CouchDB. Secondly there is no test suite in the source distribution that can help troubleshoot issues. Finally the error logging is tricky to understand for a neonate. I’m certain that having an EUnit test suite would have directed me towards the fact that the Javascript was failing much quicker than using trial and error and IRC.

The final issue was where CouchDB stores its logs and datafiles by default. These default to /usr/local/var which is not really where I want to store data like this. /var is a more natural choice and FHS seems to agree. You can change this when building in the etc/couch.ini.tpl file of the source directory but it would be nicer to have a more natural choice by default.

I also tried to install CouchDB on OSX without using a packaging mechanism but instead referring to an instance of Spidermonkey 1.7 I had on my system. While I could get configure to accept the Javascript library it wouldn’t recognise the jsapi.h header, maybe because configure doesn’t define the right Macros when it tries to build the test file.

Working with CouchDB

My first experience of CouchDB’s web interface was that it was good-looking but that the error messages, unresponsive links, Firebug warnings and occasional Javascript pop-ups were all signs that it was a work in progress. On fixing my passthrough to Javascript I had a very different experience of a slick and good looking interface that uses amazingly responsive AJAX to provide a really good experience.

Learning how to generate queries for example is easy with the in-built query browser (I’ll save the actual syntax for a later post) as the feedback from the system is very quick. Similarly creating new databases, documents and fields is actually not torturous but slick and quick.

I’m certain this is because Erlang and AJAX form a natural partnership in creating and servicing small requests that need to be handled quickly. I may be wrong but this is certainly my most positive AJAX webservice experience to date.

The CouchDB wiki contains information on getting started with various languages and I choose Ruby. It was a simple cut and paste into irb and then it was straight-forward to interact with database from the shell and the web client.

One of CouchDB’s more unusual feature seems to stem from Erlang’s concurrency model. Data is assigned a revision and in addition to applying multiple changes in order you can also view previous revisions. That’s a pretty weird feature compared to most of the current datastores. You also need to refer to a revision if you want to update a record. If the revision of the target of the update has changed when you submit your changes your changes get refused. The revision mechanism is also the basis of the replication mechanism although there isn’t enough documentation to understand when the replication pairs check their revision ids.

Incremental data-construction is easy via the web-gui but you need to fiddle a little bit to get the revision number to target if doing it programatically. Presumably a library would make that easier going. For information that has little inherent structure or is very gappy or incremental then CouchDB is a great data store and is currently occupying a niche as far as I know.

Standard