Programming, Scripting

CouchDB: Querying data

CouchDB allows you to pass a map function to a special view URL to query the data in an ad-hoc way. Views can also be stored as JSON documents with a convention URL (_design on the server, accessed as _view by the client). These can then be obtained via a HTTP request.My functional and Javascript programming are weak but this is what I understand of writing queries in CouchDB. Let’s take an example of a set of library cards, each card represents a book but the amount of information I have on each book varies.

The basic find all function is this:

function(book) { map(null, book) }

This defines an anonymous function that takes one parameter, the target document, in this case a book, and returns an array of values. What is in the value list is controlled by the second parameter, in this query I return the entire document. The first parameter controls the sorting or ordering. So I wanted to return the title of all the books in my database then I would use:

function(book) { map(book.name, book.name) }

Sorting them by ISBN would go like this:

function(book) { map(book.isbn, book.name) }

One important thing to note is that if an object doesn’t have a value it doesn’t respond to the function and will not be included. So if I created some of my entries with a value title instead of name anything with a title and not a name will not be in the query. However if I use a non-existent entry as an ordering criteria the value will count as null and be sorted.

Because I can include any valid Javascript in my function I can actually put a lot of complexity into my queries. For example:
function(book) { if(book.isbn != null) { map(book.name, {"Name": book.name, "ISBN": book.isbn}) } else { map(book.name, book.name) } }

So I suspect this will either make you cheer or puke. What this function does is return a JSON object containing the Name and ISBN of the book if they are known or just the Book name as a String otherwise. Unlike SQL the heading of my query is almost completely arbitrary as long as the value on the right of my map function translates to a valid JSON object.

Now at work there are often a lot of debates as to whether things are “rigid” or “structured” or whether they are “flexible” or “formless”. It is a bit like the old meat and poison adage. CouchDB allows a client to construct an almost arbitrarily rich response to a query with almost no restriction on how the data that should be included in that response. In some cases this is going to allow you to easily interact with very complex unstructured data in some cases it is going to be an invitation to create a sprawling dataset with no value. There is no inherent right or wrong choice here but for a particular problem being solved there is probably going to be a wrong and right choice. SQL is powerful because of the restrictions and rules it builds into its grammar. Using Javascript is powerful because it relaxes those restrictions. Programmers and IT folks in general often fall into using the laxest possible implementation for reason of “flexibility” but then either have to impose order themselves or lose the power of the more restrictive choice.

So putting that into a concrete example, if a write a view with SQL I am going to have to follow a set of rules to get the data I want (for example my heading is going to have to be a set of tuples of equal size), using an arbitrary script and JSON means I am going to be able to get exactly the data I want in the form I want it. However since that return structure is customised to my query I might possibly be reducing my reuse by being over-specific or by building too much logic into my view code.

That’s quite a diversion just so I can say it’s horses for courses, so let’s wrap up this quick look at CouchDB views. All of CouchDB’s views are effectively JSON objects that are passed to a separate view server. This is a separate process that interacts with the main server via STDOUT and STDIN pipes. By default this is the view server that is built from the Spidermonkey library (it is called couchjs). However you can write a view parser for any language and plug it into CouchDB by creating an executable and mapping it to a MIME type in the couch.ini file. The view server essentially parses and readies the query function that is associated with the view and is then sent every document in the database as a JSON string. The view server picks up the results of reading every document and sends that back to the query request.

It is a pretty simple system and it works will for the relatively flat documents I have been trying with it. However I suspect that in a project with multiple developers some ground-rules for writing consistent query code would be a must.

Programming, Ruby, Scripting

CouchDB

What is CouchDB?

CouchDB is a dedicated document based database that kind of puts it in the same space as Exist, Xindice and Oracle Berkeley XML Db. What makes it very different is that instead of building around XML it is built around Javascript. The document storage format is JSON and the query language is Javascript. Something like Exist uses a minimum of well-formed XML for data and XPath or XQuery for querying.

Now the merits and flaws of the various markup languages is really holy war and I don’t want to get into it too much. I think it is enough to say that CouchDB aims to be lightweight in implementation without being lightweight in features or performance. JSON is very popular in the web space and by focussing on making common cases easier it is less work to use than XML. The more complex the data or specification requirements though the less viable it is as a solution.

Using JSON means that stored data is extremely flexible in structure, unlike relational data you can have very gappy or bitty information and not have a problem. Take something like a customer relationship management system. This is often a poor fit to relational data because you tend to discover information in small tranches. Initially a lead might be a first name and a telephone number. As the relationship develops you add a surname, a company name, a position, an collection of issues or queries and so on. JSON is a really good way of capturing this evolving picture of things.

Using Javascript to query this data surprised me initially, I feel pretty comfortable with my limited Javascript skills and therefore didn’t have a problem with it as a choice. I’ll come back to this later in the post but actually the idea sounds strange but is a natural fit when you use it. Again it is about the right tool for the right job but in the context of the data you are using in CouchDB creating a query by combing iterators, if statements and the functional programming map function is a better fit than trying to adapt SQL.

The final component of CouchDb is the functional language Erlang which is used to service the HTTP REST interface that CouchDB uses to provide an interface to the database. Erlang provides very cheap, lightweight concurrency that provides a good fit for handling lots of HTTP requests.

The interesting thing is that CouchDB automatically provides the REST front end that you tend to build on top of Java object stores like JBoss Cache or Coherence (which I have also been looking at recently). With those products you tend to stay native until you need to share data with other languages or systems and then you tend to REST serve it out as XML with a lightweight HTTP server. If you can see that need ahead of time then CouchDB might well be compelling due to the built-in support.

Installing

After reading about CouchDB I wanted to take a look at this new document database in a bit more detail. I managed to install the base tarball from the Google Code page easily. I then cut and pasted the Ruby sample REST code from the CouchDb Wiki into irb and on running the server I found that I could store and retrieve documents really easily. The combination of interactive Ruby and the REST interface made it easy to play around with data objects.

However when trying to run the tests via the administration web page only four of the tests would pass. The only error message was just an Error Code of 136. Since I seemed able to store and pull documents (and could confirm that the datafile had been generated and had data) I put it down to alpha flakiness and pushed on.

It was only when I tried executing ad-hoc queries and kept having them fail that I realised what was linking my problems. Every time the main Erlang server handed off a query to the Javascript interpreter the process died. Erlang is really resilient to thread death so there was zero impact from this.

CouchDB uses the Mozilla Spidermonkey libraries to construct its own View Server program which is configured via the couch.ini file. Having double-checked that the interpreter was there, was executable and could read the main.js script it was passed I was baffled. I decided to pull down the code base from SVN and then rebuild it. Building from SVN is a bit more involved than the tarfile and its worth double-checking the documentation on the wiki before you get stuck in. Having built and installed myself a new copy I still had the same problem and was feeling pretty cheesed off. With no other source of help I headed to IRC and checked out the Freenode #couchdb channel. There Christopher Lenz gave me a good steer that he had had a problem with Erlang’s HIPE (high performance virtual machine) on Debian.

I had been using Erlang 5.5.5 which I think I used apt-get to obtain. Downloading the latest Erlang release (5.6.1, they release pretty frequently) I tried building that from source and hung when configure tried to check for floating-point exceptions. Running configure with disable-hipe allowed configure to complete and then Erlang was straight-forward to build and install.

Restarting the CouchDB server I found my queries were working and all but one test (“conflicts” fails on an error assertion and this is apparantly a known issue) passing, hurrah!

The machine I had been working on was a Feisty Fawn Ubuntu instance so I loaded up Gutsy Gibbon in VM Ware Player and tried building Erlang and CouchDb there. Erlang was fine, even with Hipe. For CouchDB I used the latest Erlang I had built and used apt-get to obtain the Spidermonkey and ICU depdencies. Everything worked out of the box with this combination.

This problem highlights a few of CouchDB’s weaknesses. Firstly it is all bleeding edge stuff, Erlang is bounding along and so is CouchDB. Secondly there is no test suite in the source distribution that can help troubleshoot issues. Finally the error logging is tricky to understand for a neonate. I’m certain that having an EUnit test suite would have directed me towards the fact that the Javascript was failing much quicker than using trial and error and IRC.

The final issue was where CouchDB stores its logs and datafiles by default. These default to /usr/local/var which is not really where I want to store data like this. /var is a more natural choice and FHS seems to agree. You can change this when building in the etc/couch.ini.tpl file of the source directory but it would be nicer to have a more natural choice by default.

I also tried to install CouchDB on OSX without using a packaging mechanism but instead referring to an instance of Spidermonkey 1.7 I had on my system. While I could get configure to accept the Javascript library it wouldn’t recognise the jsapi.h header, maybe because configure doesn’t define the right Macros when it tries to build the test file.

Working with CouchDB

My first experience of CouchDB’s web interface was that it was good-looking but that the error messages, unresponsive links, Firebug warnings and occasional Javascript pop-ups were all signs that it was a work in progress. On fixing my passthrough to Javascript I had a very different experience of a slick and good looking interface that uses amazingly responsive AJAX to provide a really good experience.

Learning how to generate queries for example is easy with the in-built query browser (I’ll save the actual syntax for a later post) as the feedback from the system is very quick. Similarly creating new databases, documents and fields is actually not torturous but slick and quick.

I’m certain this is because Erlang and AJAX form a natural partnership in creating and servicing small requests that need to be handled quickly. I may be wrong but this is certainly my most positive AJAX webservice experience to date.

The CouchDB wiki contains information on getting started with various languages and I choose Ruby. It was a simple cut and paste into irb and then it was straight-forward to interact with database from the shell and the web client.

One of CouchDB’s more unusual feature seems to stem from Erlang’s concurrency model. Data is assigned a revision and in addition to applying multiple changes in order you can also view previous revisions. That’s a pretty weird feature compared to most of the current datastores. You also need to refer to a revision if you want to update a record. If the revision of the target of the update has changed when you submit your changes your changes get refused. The revision mechanism is also the basis of the replication mechanism although there isn’t enough documentation to understand when the replication pairs check their revision ids.

Incremental data-construction is easy via the web-gui but you need to fiddle a little bit to get the revision number to target if doing it programatically. Presumably a library would make that easier going. For information that has little inherent structure or is very gappy or incremental then CouchDB is a great data store and is currently occupying a niche as far as I know.

Software, Work

Don’t hate the RDBMS; hate the implementation

I read through this post with the usual sinking feeling of despair. How can people get so muddled in their thinking? I am not sure I can even bear to go through the arguments again.

More programmers treat the database as a dumb store than there are situations where such treatment is appropriate. No-one is saying that Twitter data is deep, relational and worthy of data mining. However not all data is like Twitter micro blog posts.

The comments on the post were for the most part very good and say a lot of what I would have said. However looking at CouchDB documentation I noticed that the authors make far less dramatic claims for their product that the blog post does. A buggy alpha release of a hashtable datastore is not going to bring the enterprise RDBMS to its knees.

I actually set up and ran Couch DB but I will save my thoughts for another day, it’s an interesting application. What I actually want to talk about is how we can get more sophisticated with our datastores. It is becoming apparent to me that ORM technologies are really making a dreadful hash of data. The object representation is getting shafted because inheritance is never properly translated into a relational schema. The relational data is getting screwed by the fact the rules for object attributes is at odds with the normal forms.

The outcome is that you end up with the bastard hybrid worst of all worlds solution. What’s the answer?

Well the first thing is to admit that application programmers think the database is a big dumb datastore and will never stop thinking that. The second is that relational data is not the one true way to represent all data. They are the best tool we have at the moment for representing rich data sets that contain a lot of relational aspects. Customer orders in a supply system is the classic example. From a data mining point of view you are going to be dead on your feet if you do not have customers and their orders in a relational data store. You cannot operate if you cannot say who is buying how much of what.

If you let developers reimplement a data mining solution in their application for anything other than your very edge and niche interests then you are going to be wasting a lot of time and money for no good reason. You simply want a relational datastore, a metadata overlay to reinterpret the normalised data in terms of domain models and a standard piece of charting and reporting software.

However the application programmers have a point. The system that takes the order should not really have to decompose an order taken at a store level into its component parts. What the front end needs to do is take and confirm the order as quickly as possible. From this point of view the database is just a dumb datastore. Or rather what we need is a simple datastore that can do what is needed now and defer and delegate the processing into a richer data set in the future. From this point of view the application may store the data in something as transient as a message queue (although realistically we are talking about something like an object cache so the customer can view and adjust their order).

Having data distributed in different forms across different systems creates something of headache as it is hard to get an overall picture of what is happening on the system at any given moment. However creating a single datastore (implemented by an enterprise RDBMS) as a single point of reference is something of an anti-pattern. It is making one thing easier, the big picture. However to provide this data is being bashed by layering technologies into all kinds inappropriate shapes and various groups within the IT department are frequently in bitter conflict.

There needs to be a step back and IT people need to accept the complexity and start talking about the whole system comprising of many components. All of which need to be synced and queried if you want the total information picture. Instead of wasting effort in fitting however many square pegs into round holes we need to be thinking about how we use the best persistence solution for a given solution and how we report and coordinate these many systems.

It is the only way we can move forward.

Ruby, Scripting

Dumb Ruby Gems mistake

So on and off over the last week I have been trying to get RedCloth to work in both Ruby and JRuby. Despite getting the gem I kept getting a failure when I tried to require ‘redcloth’. When I finally got Textile support working in Java before I had managed to weave the two or three lines of Textile magic the tutorials promised I shrugged and took the task off my GTD list.

Except that today my stubborn side told me to take a look at the Ruby Gems documentation after seeing an answer to a similar problem on a mailing list. Sure enough there in Chapter 3 was the explanation that you need to require Ruby Gems before the actual Gem you want to use.

Heading to irb and sure enough:

require 'rubygems' require 'redcloth'

Does the trick and I am finally appreciating the library. Now maybe I’m slow or perhaps the same code on the RedCloth page should have included a full script that would regardless of what was in your RUBYOPT environment variable.

Scripting

Discovering Scala

Over the last two days I have been giving Scala a quick whirl and I have to say out of all the new languages (functional and otherwise) I have been looking at recently only Erlang is anywhere near as exciting. There is a small fifteen page tutorial and after finishing it I immediately thought: “I want to see more of that”.

I can’t put my finger on it at the moment but I imagine its the fact that it seems perhaps more familiar due to the Java background. The functional syntax is as confusing as any other functional language.

P.S. if you go through the short 15 page tutorial then there is a slight blip in the tutorial when case classes are introduced. All the code for the calculator example should be wrapped in an object block if you want it to work. I.e.

object Calculator { ... tutorial code }

Web Applications

Google Sites: What I would like

I am not going to deny that the basic Wiki functionality is all there but there are a few things that I would like to see. Now I know it is a free service but I would actually be prepared to pay for some of this on a WordPress model (i.e. buy what you want when you need it).

Categories or page tagging
The ability to send a page to Google Docs or alternatively the ability to export to external formats as you can for a Google Document.
The comments section should collapse if there are no comments; ditto the attachments. If something does not apply to my page then I just need a simple text link to add it.
Most of all, proper HTML markup rather than a massive paragraph consisting of my whole page, with a BR tag… if I’m lucky. If I want to reuse my content then I cannot export it and its not even decent HTML. Blogger is no better but crazily Google Pages does the right thing!

As with Blogger I think Google is having some major integration issues with all the companies it has bought up. If one of the applications or systems has a cool feature then I think it is natural to assume that it will be available in all Google branded applications.

Programming

Compiling under Leopard

Since I upgraded my monster MacBook Pro I really haven’t looked back. I’ve suddenly started getting software by checking out the SVN repository and compiling it because it would be quicker than downloading an archive. Surely this is what a 64-bit operating system is meant to be like. It is giving me a warped view of the world and I now wish I hadn’t foolishly opted for a Dell at work. I could probably run a Virtual Vista quicker under Leopard than I could if it were running natively.

The MacBook Pro genuinely feels like a desktop replacement. I’m not sure I would go back to the box and screen for development now. Even the battery life feels like its been extended. It might even be responsible for these cartoon birds that keep singing all over the place.

OSX 10.5.2? An OS that actually seems to reflect its codename.

Groovy, Programming, Scripting, Work

Using Groovy to create XML Schemas

I have previously found XML Schema to completely invaluable in defining interface points between systems. Normally file interfaces between systems are done in formats that are deceptively simple: CSV, structured text files. However in nearly all cases the initial simplicity tends to lead to a lot of problems. There is the issue of how to escape characters within your fields, particularly the field separator. The free text field is often used as exactly that, free text. Something is supposed to be a number field… until the letter X appears in it. Historical CSV is the worst as often the exact meaning and origin of each of the fields is undocumented and the meaning lost. I have even come across CSV generators that map meaningless constants to the output just to keep the number of fields the same. The receiving systems ignore those same fields or sometimes even hinge workflow off a value that will never vary in practice. The whole thing ends up being a nightmare.

Introducing an XML Schema reduces that nightmare but does bring in a lot more complexity. Being able to specify the type and order of the fields comes at a price. Previously when I have wanted to develop a new schema I have simply used the Xerces tools at the command line and an XML editor to generate both the Schema and a sample datafile. It works but it is quite laborious. Speeding this up would be great as often the point of capturing the complexity in the data transfer is so that the business or the architects can see the complexity of the integration and decide that they really want to do it before a lot of code gets written to integrate the systems.

Looking through the Groovy website I came across this example of how to validate an XML document and an idea is sparked. The multi-line indicator is a neat feature (borrowed from Python I think) and is (to my mind) a more elegant solution than the Ruby/Perl document syntax. It would allow me to define my schema, my sample document and my validation code in the same file. During iterations I would be more productive and when the interface is captured I just publish the final schema.

So I’ve knocked together a simple PoC and it seems to work pretty soundly. The easiest way to work with it is from the sample document to the Schema but TDD approach is to define the Schema and work back from the validation errors. The latter approach tends to avoid the situation where you’re validating your test document rather than your document template.

Software

Running Virtual Ubuntu

So at work I need to be able to have access to a personal UNIX playground and the form that you have to fill in to get a licensed VMFusion instance is a nightmare so I decided to look at the alternatives. I already had Parallels installed on my MacBook Pro but I had not done anything with it. I also decided to try and get Ubuntu running on my Windows Vista machine using the free (to download) VMWare Player.

VMWare Player requires a special image (I used this one) however once the software and the image was downloaded (the images are sensibly torrented although the player software itself does not seem to be), getting the system running was extremely easy. You just click on the image, it loads up and you update within Ubuntu as normal.

Getting Parallels working was not as as easy. I tried a standard DVD from a Linux Magazine, that failed with an X error where the X window could not be started. So I downloaded a text based installer and ran through that. It had the same problem and after reading this item in the Parallels Knowledge Base I took a guess at the problem and set the resolution during the text installation to be 1024 by 768. That sorted the issue and after that the major problem was networking. The Parallels installation did not seem able to share my wireless connection. Once I connected my Ethernet cable then the instance updated fine. Oddly once I set the VM to use Shared networking I could use the Wireless connection but counter-intuitively setting the Ubuntu instance to use the Wired connection. I guess at that point Parallels was able to weave a little magic and make the connection available and the issue of whether the physical hardware was Ethernet or Wireless was completely irrelevant.

Both systems run their virtual machines very quickly but VM Player seems to be the better suited to rapidly stopping and starting the machine. It works pretty much like a normal application, you fire it up and close the window when you are done. The Parallels application is much less seamless. Both applications use a similar amount of space to save their state, VM Player perhaps runs a bit fatter from my experience.

VM Player is pretty amazing for a product that is offered for free and is definitely a well-done teaser product. If you have never run a virtual machine before I would definitely recommend giving it a spin. Parallels is a slick and excellent program but its focus on running Windows under OS X seems to have led it to not being able to create a trouble-free installation experience for the leading desktop Linux distribution of today. That is a big mistake and even Parallels’ relatively low price tag of £50 to £60 does not excuse it. Some things should just work. After all at some point you are going to appreciate having the flexibility to install a OS how you like and at that point you may be more tempted to upgrade your existing solution than switch to a new application altogether.

Web Applications

Try our new, new services!

So on Friday not one but two long awaited beta service invitations arrived. The first was the announcement of the addition of Jotspot to Google Apps (finally) and the other about the Amazon Simple DB service. Typical buses…

I didn’t have a lot of time this weekend so I plumped for signing up for Google Apps and trying the new wiki functionality as I was hoping for a beefed up version of Pages. The Simple DB service also needs me to beef up my Web Service scripting fu.

It is too early to say much about either service but after signing up for a Google Apps account (apparently you cannot simply drive one off your regular Google Account). I was slightly underwhelmed by the new Google Sites service. It has taken how long to make a basic and acceptable wiki service available?

Still you can have a lot of separate wiki sites and you have a lot of flexibility on how you share and collaborate on them so maybe I need to build up some content first and then try to share it around. I would like to know whether you can hook Analytics up to some Sites content. That would be useful for some of the content that otherwise would go on something like a WordPress page.

Echo One

Sequentially arranged sentences composed of words (and punctuation)

Author Archives: rrees