Software, Work

Don’t hate the RDBMS; hate the implementation

I read through this post with the usual sinking feeling of despair. How can people get so muddled in their thinking? I am not sure I can even bear to go through the arguments again.

More programmers treat the database as a dumb store than there are situations where such treatment is appropriate. No-one is saying that Twitter data is deep, relational and worthy of data mining. However not all data is like Twitter micro blog posts.

The comments on the post were for the most part very good and say a lot of what I would have said. However looking at CouchDB documentation I noticed that the authors make far less dramatic claims for their product that the blog post does. A buggy alpha release of a hashtable datastore is not going to bring the enterprise RDBMS to its knees.

I actually set up and ran Couch DB but I will save my thoughts for another day, it’s an interesting application. What I actually want to talk about is how we can get more sophisticated with our datastores. It is becoming apparent to me that ORM technologies are really making a dreadful hash of data. The object representation is getting shafted because inheritance is never properly translated into a relational schema. The relational data is getting screwed by the fact the rules for object attributes is at odds with the normal forms.

The outcome is that you end up with the bastard hybrid worst of all worlds solution. What’s the answer?

Well the first thing is to admit that application programmers think the database is a big dumb datastore and will never stop thinking that. The second is that relational data is not the one true way to represent all data. They are the best tool we have at the moment for representing rich data sets that contain a lot of relational aspects. Customer orders in a supply system is the classic example. From a data mining point of view you are going to be dead on your feet if you do not have customers and their orders in a relational data store. You cannot operate if you cannot say who is buying how much of what.

If you let developers reimplement a data mining solution in their application for anything other than your very edge and niche interests then you are going to be wasting a lot of time and money for no good reason. You simply want a relational datastore, a metadata overlay to reinterpret the normalised data in terms of domain models and a standard piece of charting and reporting software.

However the application programmers have a point. The system that takes the order should not really have to decompose an order taken at a store level into its component parts. What the front end needs to do is take and confirm the order as quickly as possible. From this point of view the database is just a dumb datastore. Or rather what we need is a simple datastore that can do what is needed now and defer and delegate the processing into a richer data set in the future. From this point of view the application may store the data in something as transient as a message queue (although realistically we are talking about something like an object cache so the customer can view and adjust their order).

Having data distributed in different forms across different systems creates something of headache as it is hard to get an overall picture of what is happening on the system at any given moment. However creating a single datastore (implemented by an enterprise RDBMS) as a single point of reference is something of an anti-pattern. It is making one thing easier, the big picture. However to provide this data is being bashed by layering technologies into all kinds inappropriate shapes and various groups within the IT department are frequently in bitter conflict.

There needs to be a step back and IT people need to accept the complexity and start talking about the whole system comprising of many components. All of which need to be synced and queried if you want the total information picture. Instead of wasting effort in fitting however many square pegs into round holes we need to be thinking about how we use the best persistence solution for a given solution and how we report and coordinate these many systems.

It is the only way we can move forward.

Standard

10 thoughts on “Don’t hate the RDBMS; hate the implementation

  1. Well, you seem to have made the same mistake a lot of people do when talking about CouchDB. It is not, in any way, meant to be a replacement for RDBMSes. No way, not in the slightest.

    “A buggy alpha release of a hashtable datastore is not going to bring the enterprise RDBMS to its knees.”

    It’s a shame you made this statement because it detracts from your other points. Firstly, CouchDB is not particularly buggy, you just made that up. Secondly, it’s not a hashtable, it supports views, indexes, ACID and replication, you made that bit up too, or perhaps didn’t read the documentation. Thirdly, again, I will point out that CouchDB is not a replacement for RDBMSes.

  2. To justify my original remark. When I ran the unit tests after installing CouchDB only four of them passed. I also tried to run the database with a different configuration file and it failed. I tried to install completely locally and run it but it seemed to get tangled up with the global instance.

    Even you use the term “not particularly buggy”. It has bugs, I don’t mind, it’s not done yet. But it does have issues.

    I also used the term “hashtable datastore”, CouchDB seems to use a UUID for storing documents and then associates a JSON deserialised instance of a data object. That seems very like a hashing implementation. Whether it supports views, indexing, transactions and replication doesn’t change that. Those are all properties of how you use that data and details of the datastore implementation.

    Everyone agrees that CouchDB is not a replacement for an RDBMs, except some people posting comments on the original post.

  3. There’s a fantastically obscure book by William Kent, “Data and Reality” that makes the central point that data modelling is hard. In my experience, the more reality you try to include in one model, the harder it gets, no matter how you are modelling the data. A data model for two apps will always be more complex than a data model for one app, because all kinds of assumptions about entities will have be thrashed out.

    From my decrepit, long-in-the-tooth viewpoint, it seems that we now have a generation of coders for whom an app is a web service and a data store. Since all of the uses of the data are encoded in the app, the data store, whether an RDBMS or something else, serves as little more than a collection of text files.

    If you *never* need to do anything with your data than what is coded in your app, then a simple storage solution is probably best. On the other hand, if you have data access needs beyond OLTP, or have far more data than memory combined with random access patterns, or need to share data between multiple apps, then an RDBMS is an easy win.

    But despite the ease with which we all picked up 3NF in our DB classes at university, large-scale, multi-app data modelling is *hard*.

  4. I look forward to the day when a young developer looking to setup their own blog doesn’t have to set up a full MySQL server installation just to run a copy of wordpress. No matter how easy it is, even on shared hosting, it’s a bad fit and it’s overkill. I’ve recently moved a couple projects from Postgres to SQLite as its ease of management FAR outweighs any potential performance issues (of which I’ve had none). I love being able to copy a single directory of files to a disk and have a complete copy of that web application, database and all. So while I would wholeheartedly agree that there will be a continuing need for the monolithic RDBMS, there is a large, open gap in our current data persistence practices that covers the ground between flat files on a disk and a cumbersome relation database server.

  5. “When I ran the unit tests after installing CouchDB only four of them passed.”

    You must have taken a cut from trunk then as we are currently working on new SpiderMonkey integration. If you try with the last official release this is not the case.

    “Even you use the term “not particularly buggy”. It has bugs, I don’t mind, it’s not done yet. But it does have issues.”

    I said this because all software has bugs, even the big players like Oracle or MySQL. If I had said “bug free” you would have, rightly, taken me to task for it, no matter what software I was describing.

    “I also used the term “hashtable datastore”, CouchDB seems to use a UUID for storing documents and then associates a JSON deserialised instance of a data object. That seems very like a hashing implementation. Whether it supports views, indexing, transactions and replication doesn’t change that. Those are all properties of how you use that data and details of the datastore implementation.”

    I could say that a RDBMS is also a hash table because it associates rows or tuples with UIDs in the same way. But this would be false and a grave oversimplification, as is your original statement.

    “Everyone agrees that CouchDB is not a replacement for an RDBMs, except some people posting comments on the original post.”

    Some people are mistaken though, that is not a reason to attack CouchDB though. Even a brief read of the the CouchDB wiki overview would have shown you this.

  6. I agree that having to get an instance of MySQL running for comparatively lightweight applications is a totally drag.

    However I do think you’re confusing the RDBMS and the implementation. SQLite (which I agree is great) is still a relational database, it’s just a very lightweight one. Derby and HSQL are equally relational and only slightly more heavyweight (in fact in comparision to MySQL they have a lot of punch for the size/complexity). There’s nothing in relational data making MySQL heavier, just a desire to service a different market.

    What document databases like Exist and CouchDB are doing are addressing a different data problem. One that probably does suit problems like CMS and Blogs.

  7. Noah, I’m sorry I called your baby ugly. I *like* CouchDB, I’m writing a post about it now.

    I did have failing test cases, in the post I’ll explain why that was. I also read the wiki more than briefly. I was still left with the impression I described above.

    This post is about data modelling and architecture issues. For now can you accept that the original comment is not an attack on CouchDB but a reality check on people who confuse RDBMS and heavyweight solutions and save any more comments you have for the CouchDB specific post (which will be done in the next couple of days).

  8. stuffduff says:

    GT.M (http://sourceforge.net/projects/fis-gtm/) is another alternative to RDBMS implementations. The main issue I have with RDBMS has to do with flexability. In situations where the data to be managed is well established I think that the RDBMS is a reasonable approach. What I keep running into are situations where collections of dissimilar objects are required. In this kind of environment the RDBMS can be demonstrated to be to restrictive. GT.M’s predecessors were managing 100’s of users on systems with less than 1 GB of memory, and hard drives under 1 GB. Unfortunately, those with a strong RDBMS education/worldview often can’t wrap their minds around the ‘array’ concepts and seldom learn to appreciate them or to use them effectively.

  9. Dissimilar sets of data are the classic example of where document style databases should be preferred to Relational Models. My favourite example is search terms.

    I’m not sure that relational data modellers can’t grok array ideas. After all they think of their data as collection of tuples.

    I think the main problem is what James says above that data modelling is hard and often identifying whether data is relational or not can be a lot harder than anyone realises. After all I tend to see a lot more of the opposite situation where people tend to regard their data as not having relational qualities and therefore having poorly structured data that is ridiculously expensive to query.

    I hadn’t heard of MUMPS before and had to look at Wikipedia for information. It looks interesting but the example syntax looks like it may be a step beyond the pale for me. How does GT.M supply data to non-MUMPS programs? One really appealing aspect of CouchDB is that it presents data via a REST and JSON interface.

  10. Dissimilar sets of data are the classic example of where document style databases should be preferred to Relational Models. My favourite example is search terms.

    I’m not sure that relational data modellers can’t grok array ideas. After all they think of their data as collection of tuples.

    I think the main problem is what James says above, that data modelling is hard and often identifying whether data is relational or not can be a lot harder than anyone realises. After all I tend to see a lot more of the opposite situation where people tend to regard their data as not having relational qualities and therefore having poorly structured data that is ridiculously expensive to query.

    I hadn’t heard of MUMPS before and had to look at Wikipedia for information. It looks interesting but the example syntax looks like it may be a step beyond the pale for me. How does GT.M supply data to non-MUMPS programs? One really appealing aspect of CouchDB is that it presents data via a REST and JSON interface.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s