I have been looking at a lot of the different databases that are working outside the traditional SQL style RDBMS (and I want to point out that having a true RDBMS might actually be something very different to what we have now). I have spent a lot of time with CouchDB compared to the others because it seems to occupy a special niche in data storage. It also builds in a REST interface right off the bat which effectively decouples the storage mechanism from the delivery and makes the whole system very language neutral.
CouchDB represents the first serious contact I have had with JSON and in deciding whether CouchDB is right for your project you really need to understand what JSON does and doesn’t offer. The first important thing is that JSON offers a kind of compromise from the heavy data definition in SQL or XML Schema and totally typeless data of something like flat files or Amazon SimpleDB. However to achieve that simplicity there are two consequences. Firstly data integrity is palmed off onto client applications and you will need to check data going in and coming out. I don’t personally believe that there is ever an application that doesn’t have some explicit or implicit data integrity. Second JSON documents can be very rich and complex but they cannot have detailed field structures, a telephone number is simply going to decompose into two numbers and its hard to get much finer grain description of the data. As a rule of thumb if you would normally create a well-formed XML document for the data or you would define a SQL data definition that would simply declare all the types of the column using all the defaults and never specifying a NOT NULL column then you probably have a good match.
JSON also favours dynamic languages over static types like Java. You need to decide if you are going to be able to take advantage of the full set of features in your chosen client language. XML might remain a better choice for static languages as you describe document instances declaratively.
Other good features about CouchDB include the fact that its incubating for Apache and will be a good addition to the projects there. It also leverages existing Javascript skills which are probably more common than people who know XPath, XQuery or even XML Schema and Relax NG. It has excellent support for incremental data and also fits well with highly irregular data like CMS page content. It also has an excellent UI that makes it easy to interact with the data. Finally it seems to have a good solution for scalability without doing anything too esoteric.
The negative features are tricky because we obviously have an alpha project here that is probably closer to the start of the public testing phase than the end. Some of these cons may well be addressed before the first release candidate. The first obvious omission for a server product is that there is no security built into the server. Just supporting optional HTTP Authentication would be enough to make it practical to start running some experimental servers for something more than sandbox exercises. One thing that is a major difference to the feature set in the standard pseudo-RDBMS is that there is no way of interacting with sets of data. There is a method for changing multiple documents in one pass but what you really want to be able to do is apply a set of changes to documents identified by a query, the equivalent of SQL’s UPDATE.
Related to this is the fact that data in CouchDB is currently heavily based on silos. If I have a set of data referring to authors and a set of data relating to books then I currently need to duplicate data in both. One of the problems that the developers are going to face in evolving CouchDB is how to address this without introducing a solution so complex that you ask why you are not using an RDBMS? Similarly I notice in the roadmap there is an item about data validation. If you start introducing data validation and rules for validating data then before long there is going to be a question as to why you don’t simply use one of the existing document systems as all the current simplicity will have gone.
One thing that definitely needs improvement is error logging and reporting. Often the only error feedback you have is a Javascript popup that says “undefined” and a log message that tells you that the Erlang process terminated. There needs to be some more human-readable issue logging that points you towards what is going wrong.
Heya,
I seemed to have missed your series about CouchDB when it came out, so here’s a late reply:
Thanks š Thanks for having a look at CouchDB and giving it a fair evaluation. I think you nail most of the pros and cons, we are certainly working on the cons š
Hope you make it back to #couchdb on Freenode.
Cheers
Jan
—
I’m planning to do some more with CouchDB soon, I think the critical thing is some kind of basic authorisation mechanism.
rrees:
I am answering a thread, where you said: I think the commonest answer would be that as the capabilities of our hardware platform rise so do our expectations of what can be achieved. Modern web architectures have the possibility of serving data around the entire globe. It is an intriguing possibility which takes us a long way onwards from what one physical machine can do, no matter how powerful.
After studying the Unified Modeling Language, and Use Case Modeling, it appears to me that we could reduce the program storage in any given computer by building a global object library. It that, a cascade could store a hierarchy of objects. The schema would be global. This would be a smaller more compact computer type, where any one computer could extract, and run, using global objects, for any given use. New objects would be automatically tested (compared), and located in the library according to use. The new objects would be analyzed, and converted to a cross platform compatible object as they are submitted to the library, again, automatically. Objects extracted for use are dumped on task completion, or program exit.
An example, consuming hard drives, is the Java objects obtained when downloading Java. Those objects could be stored in a global library for extraction and use on demand by anyone using the new type computer.
Programs could be much smaller executing line by line object acquisition statements. Then the new type computer becomes an object in the global library. And, if so authorized, could be an object in a smaller local group, collectively extracting, and using objects from the object library while performing interactive computations.
The only hitch in the idea is bandwidth.
What you are describing is essentially a distributed object cache. This is a hot topic at the moment with several implementations competing (links below).
I think CouchDB makes a good choice in selecting JSON for its cross-platform data serialisation format. However it is hard to map to languages that expect their object definitions to be fixed at runtime.
A better fit for languages like Java might be an XML format where the Schema definition of the document and the object definition can be kept in synch. I do plan to look at Exist at some point.
http://www.gigaspaces.com/
http://www.terracotta.org/
http://www.oracle.com/technology/products/coherence/index.html
I agree. However, Globally, as long as there are a multitude of spoken languages, the number of objects will grow exponentially.
Thanks, great article.
I think you would still need to create a defined data structure when using CouchDB. This data definition would be created using JavaScript rather than a DTD or an SQL schema?
I’m also curious about the risk of malicious code being executed when data is “parsed” by clients.
You don’t really need to define a structure, it is implied by the data you write. Consumers have the responsibility of checking and responding to the data they read.
JSON should always be parsed (I presume your quotation marks refer to eval’ing the code instead) to remove the risk of malicious code. This is true even with Javascript, in fact browsers now have built-in JSON parsers so it is not only easy but fast.