Java, Python, Scripting

Sun recruits Jython leads

I know this isn’t exactly breaking news but it is further evidence that Sun is aiming to turn the JVM in a language agnostic platform. It’s also good news for the Jython project which has suffered a long period of hibernation and which has fallen far behind CPython in terms of its compatibility.

I like Python as a language and its clarity is great for one time scripts (which never really are). I would really like it to be a full member of the JVM-compatibles.

Standard
Java, Programming

Depressing times for Java programmers

Joshua Bloch of Effective Java and Collections fame has given a powerful and in many ways depressing talk. He has probably nailed the coffin shut on BGGA with this almost clinical dissection of the flaws of grafting Closures into Java. However he has also indicated how Java is effectively at the end of the line. Sun, for perfectly good reasons, wants to maintain backwards compatibility and not throw the kitchen sink into the language. The JCP does not have to stick with that approach but this talk does effectively say that there is a limit to what can be done while keeping the language recognisably Java.

The mention of Scala is significant because a lot of the things that people want in Java are available there and other features are in JRuby. Trying to create one language with universal appeal is going to be impossible. The reason I feel depressed about all of this is because I have working with Java for years now. I really had to fight to switch to 100% Java and now its time to move on again!

It would be easier to do that if there was something that was obviously better but currently all of the candidates for a potential successor have lacked that Eureka factor where you see something that is going to make programming easier  and better and your working life a whole lot more fun. So far only Scala has really come close.

Standard
Programming

CouchDB: Pros and Cons

I have been looking at a lot of the different databases that are working outside the traditional SQL style RDBMS (and I want to point out that having a true RDBMS might actually be something very different to what we have now). I have spent a lot of time with CouchDB compared to the others because it seems to occupy a special niche in data storage. It also builds in a REST interface right off the bat which effectively decouples the storage mechanism from the delivery and makes the whole system very language neutral.

CouchDB represents the first serious contact I have had with JSON and in deciding whether CouchDB is right for your project you really need to understand what JSON does and doesn’t offer. The first important thing is that JSON offers a kind of compromise from the heavy data definition in SQL or XML Schema and totally typeless data of something like flat files or Amazon SimpleDB. However to achieve that simplicity there are two consequences. Firstly data integrity is palmed off onto client applications and you will need to check data going in and coming out. I don’t personally believe that there is ever an application that doesn’t have some explicit or implicit data integrity. Second JSON documents can be very rich and complex but they cannot have detailed field structures, a telephone number is simply going to decompose into two numbers and its hard to get much finer grain description of the data. As a rule of thumb if you would normally create a well-formed XML document for the data or you would define a SQL data definition that would simply declare all the types of the column using all the defaults and never specifying a NOT NULL column then you probably have a good match.

JSON also favours dynamic languages over static types like Java. You need to decide if you are going to be able to take advantage of the full set of features in your chosen client language. XML might remain a better choice for static languages as you describe document instances declaratively.

Other good features about CouchDB include the fact that its incubating for Apache and will be a good addition to the projects there. It also leverages existing Javascript skills which are probably more common than people who know XPath, XQuery or even XML Schema and Relax NG. It has excellent support for incremental data and also fits well with highly irregular data like CMS page content. It also has an excellent UI that makes it easy to interact with the data. Finally it seems to have a good solution for scalability without doing anything too esoteric.

The negative features are tricky because we obviously have an alpha project here that is probably closer to the start of the public testing phase than the end. Some of these cons may well be addressed before the first release candidate. The first obvious omission for a server product is that there is no security built into the server. Just supporting optional HTTP Authentication would be enough to make it practical to start running some experimental servers for something more than sandbox exercises. One thing that is a major difference to the feature set in the standard pseudo-RDBMS is that there is no way of interacting with sets of data. There is a method for changing multiple documents in one pass but what you really want to be able to do is apply a set of changes to documents identified by a query, the equivalent of SQL’s UPDATE.

Related to this is the fact that data in CouchDB is currently heavily based on silos. If I have a set of data referring to authors and a set of data relating to books then I currently need to duplicate data in both. One of the problems that the developers are going to face in evolving CouchDB is how to address this without introducing a solution so complex that you ask why you are not using an RDBMS? Similarly I notice in the roadmap there is an item about data validation. If you start introducing data validation and rules for validating data then before long there is going to be a question as to why you don’t simply use one of the existing document systems as all the current simplicity will have gone.

One thing that definitely needs improvement is error logging and reporting. Often the only error feedback you have is a Javascript popup that says “undefined” and a log message that tells you that the Erlang process terminated. There needs to be some more human-readable issue logging that points you towards what is going wrong.

Standard
Programming, Scripting

CouchDB: Querying data

CouchDB allows you to pass a map function to a special view URL to query the data in an ad-hoc way. Views can also be stored as JSON documents with a convention URL (_design on the server, accessed as _view by the client). These can then be obtained via a HTTP request.My functional and Javascript programming are weak but this is what I understand of writing queries in CouchDB. Let’s take an example of a set of library cards, each card represents a book but the amount of information I have on each book varies.

The basic find all function is this:

function(book) {
map(null, book)
}

This defines an anonymous function that takes one parameter, the target document, in this case a book, and returns an array of values. What is in the value list is controlled by the second parameter, in this query I return the entire document. The first parameter controls the sorting or ordering. So I wanted to return the title of all the books in my database then I would use:


function(book) {
map(book.name, book.name)
}

Sorting them by ISBN would go like this:

function(book) {
map(book.isbn, book.name)
}

One important thing to note is that if an object doesn’t have a value it doesn’t respond to the function and will not be included. So if I created some of my entries with a value title instead of name anything with a title and not a name will not be in the query. However if I use a non-existent entry as an ordering criteria the value will count as null and be sorted.

Because I can include any valid Javascript in my function I can actually put a lot of complexity into my queries. For example:

function(book) {
if(book.isbn != null) { map(book.name, {"Name": book.name, "ISBN": book.isbn})
} else { map(book.name, book.name) }
}

So I suspect this will either make you cheer or puke. What this function does is return a JSON object containing the Name and ISBN of the book if they are known or just the Book name as a String otherwise. Unlike SQL the heading of my query is almost completely arbitrary as long as the value on the right of my map function translates to a valid JSON object.

Now at work there are often a lot of debates as to whether things are “rigid” or “structured” or whether they are “flexible” or “formless”. It is a bit like the old meat and poison adage. CouchDB allows a client to construct an almost arbitrarily rich response to a query with almost no restriction on how the data that should be included in that response. In some cases this is going to allow you to easily interact with very complex unstructured data in some cases it is going to be an invitation to create a sprawling dataset with no value. There is no inherent right or wrong choice here but for a particular problem being solved there is probably going to be a wrong and right choice. SQL is powerful because of the restrictions and rules it builds into its grammar. Using Javascript is powerful because it relaxes those restrictions. Programmers and IT folks in general often fall into using the laxest possible implementation for reason of “flexibility” but then either have to impose order themselves or lose the power of the more restrictive choice.

So putting that into a concrete example, if a write a view with SQL I am going to have to follow a set of rules to get the data I want (for example my heading is going to have to be a set of tuples of equal size), using an arbitrary script and JSON means I am going to be able to get exactly the data I want in the form I want it. However since that return structure is customised to my query I might possibly be reducing my reuse by being over-specific or by building too much logic into my view code.

That’s quite a diversion just so I can say it’s horses for courses, so let’s wrap up this quick look at CouchDB views. All of CouchDB’s views are effectively JSON objects that are passed to a separate view server. This is a separate process that interacts with the main server via STDOUT and STDIN pipes. By default this is the view server that is built from the Spidermonkey library (it is called couchjs). However you can write a view parser for any language and plug it into CouchDB by creating an executable and mapping it to a MIME type in the couch.ini file. The view server essentially parses and readies the query function that is associated with the view and is then sent every document in the database as a JSON string. The view server picks up the results of reading every document and sends that back to the query request.

It is a pretty simple system and it works will for the relatively flat documents I have been trying with it. However I suspect that in a project with multiple developers some ground-rules for writing consistent query code would be a must.

Standard
Programming, Ruby, Scripting

CouchDB

What is CouchDB?

CouchDB is a dedicated document based database that kind of puts it in the same space as Exist, Xindice and Oracle Berkeley XML Db. What makes it very different is that instead of building around XML it is built around Javascript. The document storage format is JSON and the query language is Javascript. Something like Exist uses a minimum of well-formed XML for data and XPath or XQuery for querying.

Now the merits and flaws of the various markup languages is really holy war and I don’t want to get into it too much. I think it is enough to say that CouchDB aims to be lightweight in implementation without being lightweight in features or performance. JSON is very popular in the web space and by focussing on making common cases easier it is less work to use than XML. The more complex the data or specification requirements though the less viable it is as a solution.

Using JSON means that stored data is extremely flexible in structure, unlike relational data you can have very gappy or bitty information and not have a problem. Take something like a customer relationship management system. This is often a poor fit to relational data because you tend to discover information in small tranches. Initially a lead might be a first name and a telephone number. As the relationship develops you add a surname, a company name, a position, an collection of issues or queries and so on. JSON is a really good way of capturing this evolving picture of things.

Using Javascript to query this data surprised me initially, I feel pretty comfortable with my limited Javascript skills and therefore didn’t have a problem with it as a choice. I’ll come back to this later in the post but actually the idea sounds strange but is a natural fit when you use it. Again it is about the right tool for the right job but in the context of the data you are using in CouchDB creating a query by combing iterators, if statements and the functional programming map function is a better fit than trying to adapt SQL.

The final component of CouchDb is the functional language Erlang which is used to service the HTTP REST interface that CouchDB uses to provide an interface to the database. Erlang provides very cheap, lightweight concurrency that provides a good fit for handling lots of HTTP requests.

The interesting thing is that CouchDB automatically provides the REST front end that you tend to build on top of Java object stores like JBoss Cache or Coherence (which I have also been looking at recently). With those products you tend to stay native until you need to share data with other languages or systems and then you tend to REST serve it out as XML with a lightweight HTTP server. If you can see that need ahead of time then CouchDB might well be compelling due to the built-in support.

Installing

After reading about CouchDB I wanted to take a look at this new document database in a bit more detail. I managed to install the base tarball from the Google Code page easily. I then cut and pasted the Ruby sample REST code from the CouchDb Wiki into irb and on running the server I found that I could store and retrieve documents really easily. The combination of interactive Ruby and the REST interface made it easy to play around with data objects.

However when trying to run the tests via the administration web page only four of the tests would pass. The only error message was just an Error Code of 136. Since I seemed able to store and pull documents (and could confirm that the datafile had been generated and had data) I put it down to alpha flakiness and pushed on.

It was only when I tried executing ad-hoc queries and kept having them fail that I realised what was linking my problems. Every time the main Erlang server handed off a query to the Javascript interpreter the process died. Erlang is really resilient to thread death so there was zero impact from this.

CouchDB uses the Mozilla Spidermonkey libraries to construct its own View Server program which is configured via the couch.ini file. Having double-checked that the interpreter was there, was executable and could read the main.js script it was passed I was baffled. I decided to pull down the code base from SVN and then rebuild it. Building from SVN is a bit more involved than the tarfile and its worth double-checking the documentation on the wiki before you get stuck in. Having built and installed myself a new copy I still had the same problem and was feeling pretty cheesed off. With no other source of help I headed to IRC and checked out the Freenode #couchdb channel. There Christopher Lenz gave me a good steer that he had had a problem with Erlang’s HIPE (high performance virtual machine) on Debian.

I had been using Erlang 5.5.5 which I think I used apt-get to obtain. Downloading the latest Erlang release (5.6.1, they release pretty frequently) I tried building that from source and hung when configure tried to check for floating-point exceptions. Running configure with disable-hipe allowed configure to complete and then Erlang was straight-forward to build and install.

Restarting the CouchDB server I found my queries were working and all but one test (“conflicts” fails on an error assertion and this is apparantly a known issue) passing, hurrah!

The machine I had been working on was a Feisty Fawn Ubuntu instance so I loaded up Gutsy Gibbon in VM Ware Player and tried building Erlang and CouchDb there. Erlang was fine, even with Hipe. For CouchDB I used the latest Erlang I had built and used apt-get to obtain the Spidermonkey and ICU depdencies. Everything worked out of the box with this combination.

This problem highlights a few of CouchDB’s weaknesses. Firstly it is all bleeding edge stuff, Erlang is bounding along and so is CouchDB. Secondly there is no test suite in the source distribution that can help troubleshoot issues. Finally the error logging is tricky to understand for a neonate. I’m certain that having an EUnit test suite would have directed me towards the fact that the Javascript was failing much quicker than using trial and error and IRC.

The final issue was where CouchDB stores its logs and datafiles by default. These default to /usr/local/var which is not really where I want to store data like this. /var is a more natural choice and FHS seems to agree. You can change this when building in the etc/couch.ini.tpl file of the source directory but it would be nicer to have a more natural choice by default.

I also tried to install CouchDB on OSX without using a packaging mechanism but instead referring to an instance of Spidermonkey 1.7 I had on my system. While I could get configure to accept the Javascript library it wouldn’t recognise the jsapi.h header, maybe because configure doesn’t define the right Macros when it tries to build the test file.

Working with CouchDB

My first experience of CouchDB’s web interface was that it was good-looking but that the error messages, unresponsive links, Firebug warnings and occasional Javascript pop-ups were all signs that it was a work in progress. On fixing my passthrough to Javascript I had a very different experience of a slick and good looking interface that uses amazingly responsive AJAX to provide a really good experience.

Learning how to generate queries for example is easy with the in-built query browser (I’ll save the actual syntax for a later post) as the feedback from the system is very quick. Similarly creating new databases, documents and fields is actually not torturous but slick and quick.

I’m certain this is because Erlang and AJAX form a natural partnership in creating and servicing small requests that need to be handled quickly. I may be wrong but this is certainly my most positive AJAX webservice experience to date.

The CouchDB wiki contains information on getting started with various languages and I choose Ruby. It was a simple cut and paste into irb and then it was straight-forward to interact with database from the shell and the web client.

One of CouchDB’s more unusual feature seems to stem from Erlang’s concurrency model. Data is assigned a revision and in addition to applying multiple changes in order you can also view previous revisions. That’s a pretty weird feature compared to most of the current datastores. You also need to refer to a revision if you want to update a record. If the revision of the target of the update has changed when you submit your changes your changes get refused. The revision mechanism is also the basis of the replication mechanism although there isn’t enough documentation to understand when the replication pairs check their revision ids.

Incremental data-construction is easy via the web-gui but you need to fiddle a little bit to get the revision number to target if doing it programatically. Presumably a library would make that easier going. For information that has little inherent structure or is very gappy or incremental then CouchDB is a great data store and is currently occupying a niche as far as I know.

Standard