Software, Work

Don’t hate the RDBMS; hate the implementation

I read through this post with the usual sinking feeling of despair. How can people get so muddled in their thinking? I am not sure I can even bear to go through the arguments again.

More programmers treat the database as a dumb store than there are situations where such treatment is appropriate. No-one is saying that Twitter data is deep, relational and worthy of data mining. However not all data is like Twitter micro blog posts.

The comments on the post were for the most part very good and say a lot of what I would have said. However looking at CouchDB documentation I noticed that the authors make far less dramatic claims for their product that the blog post does. A buggy alpha release of a hashtable datastore is not going to bring the enterprise RDBMS to its knees.

I actually set up and ran Couch DB but I will save my thoughts for another day, it’s an interesting application. What I actually want to talk about is how we can get more sophisticated with our datastores. It is becoming apparent to me that ORM technologies are really making a dreadful hash of data. The object representation is getting shafted because inheritance is never properly translated into a relational schema. The relational data is getting screwed by the fact the rules for object attributes is at odds with the normal forms.

The outcome is that you end up with the bastard hybrid worst of all worlds solution. What’s the answer?

Well the first thing is to admit that application programmers think the database is a big dumb datastore and will never stop thinking that. The second is that relational data is not the one true way to represent all data. They are the best tool we have at the moment for representing rich data sets that contain a lot of relational aspects. Customer orders in a supply system is the classic example. From a data mining point of view you are going to be dead on your feet if you do not have customers and their orders in a relational data store. You cannot operate if you cannot say who is buying how much of what.

If you let developers reimplement a data mining solution in their application for anything other than your very edge and niche interests then you are going to be wasting a lot of time and money for no good reason. You simply want a relational datastore, a metadata overlay to reinterpret the normalised data in terms of domain models and a standard piece of charting and reporting software.

However the application programmers have a point. The system that takes the order should not really have to decompose an order taken at a store level into its component parts. What the front end needs to do is take and confirm the order as quickly as possible. From this point of view the database is just a dumb datastore. Or rather what we need is a simple datastore that can do what is needed now and defer and delegate the processing into a richer data set in the future. From this point of view the application may store the data in something as transient as a message queue (although realistically we are talking about something like an object cache so the customer can view and adjust their order).

Having data distributed in different forms across different systems creates something of headache as it is hard to get an overall picture of what is happening on the system at any given moment. However creating a single datastore (implemented by an enterprise RDBMS) as a single point of reference is something of an anti-pattern. It is making one thing easier, the big picture. However to provide this data is being bashed by layering technologies into all kinds inappropriate shapes and various groups within the IT department are frequently in bitter conflict.

There needs to be a step back and IT people need to accept the complexity and start talking about the whole system comprising of many components. All of which need to be synced and queried if you want the total information picture. Instead of wasting effort in fitting however many square pegs into round holes we need to be thinking about how we use the best persistence solution for a given solution and how we report and coordinate these many systems.

It is the only way we can move forward.

Standard
Java, Software

Setting up a Derby Database

Okay so I currently I am setting up a number of databases on my own V-Server and I thought it would be helpful to make a few notes about the current Java databases and how easy they are to run in Server mode. The two biggest databases are Derby and HSQLDB, this post is about Derby. Both databases are really geared towards the embedded space and generally focus their documentation and tasks around this. In their default setups they are both pretty insecure for running on a public network, neither for example defaults to using passwords to authenticate logins.

All the databases are getting setup in the same basic way. Each database is going to be run in the userspace of a dedicated UNIX user and that user is meant to be accessed via SSH and SSH Keys. All the datafiles will reside within the user’s space. I’m not going to go into that as it is fairly straightforward and is probably better covered by the various UNIX admin guides you can find around the internet.

So Derby first then, well I have played around with Derby before so I hoped this would be the easiest. One thing that you need to know about Derby is that its documentation is good but it is organised into several different documents… by a madman. There is no apparent logic, rhyme or reason to the way information is included in one document or another. When setting up the server I had to jump between the Developer’s Guide, the Tuning Guide, the Server Guide and more unusually the Tool Guide. If you have the same experience then don’t worry, you’re probably doing it right.

Having downloaded the Derby package unpack it to the root directory. I find it helps with upgrades to create a symbolic link ~/derby to the actual installation directory. You can also create a profile environment variable DERBY_HOME that points to $HOME/derby.

With that all setup you can simply run ~/derby/startNetworkServer and you should be in the basic network server business. This is where it gets a bit more complicated. Derby will look for its configuration in the directory where you start it not in DERBY_HOME. So you need to create a derby.properties file in $HOME to configure the behaviour of the server.

To require authorisation you need to properties created in the properties file. Remember that the format of this file is a standard Java properties file. The two properties are:

  • derby.connection.requireAuthentication=true
  • derby.authentication.provider=BUILTIN

You should now restart the server. Now when you want to interact with the database you should need to provide a username and password. You have to check this though to make sure it works.

Run the ij tool from the command line: try to create a new database with something like the following

connect 'jdbc:derby:test;create=true';

You should get a message saying something like ‘Invalid authentication’. If the database is actually created then the database is probably not picking up your properties file, which in turn is likely to be an installation issue.

All being well we can now add a user to the properties file. User information is also a property so to make the file manageable you probably want to use comments to divide up the file into relevant sections and try and keep all the user setup in the same place. The format of the user entry is derby.user.USERNAME=PASSWORD. So for example

derby.user.test=changeme

Okay now restart the server and login via ij again. This time use a connection URL like the one below.

connect 'jdbc:derby:test;create=true;user=test;password=changeme';

This should now create a new database under HOME and give you access to it normally. Drop out of ij and confirm that the test directory has been created.

That is pretty much that now, from here on in you can connect to your server database instance as you would any other database server.

Written like this the setup must seem very basic and quick to perform, which is good. However when I was finding my way with it, it took over an hour and most of that time was taken up with looking for things in the documentation. Nothing is logically organised with Derby, for example the configuration properties are all described in the Tuning Guide. Except the network control properties which are described in the Admin’s Guide.

Unlike code DRY is not a good idea unless you are going to very rigorously modularise the documentation. Even then why not repeat something if it means that all the information relevant to task is one place?

Derby has a great feature list but I wonder how many users are ignorant of what it can do because of the poor documentation?

Standard
Java

ORM is evil

Object relational mapping (ORM) is evil because it is the great big programmer hammer driving the square peg into the relational data round hole. ORM is a self-fulfilling prophecy; because it treats the RDBMS as a big dumb datasource then in fact it makes the database a big dumb datasource. Worse, it allows people to not think of database as being a collection of, y’know, relational data but instead as being a list of entities. That’s true if you are using an OODBMS but ORM doesn’t mean the same thing. ORM makes your RDBMS look a bit like an OODBMS by making the data model teh suck.

The biggest problem with programmers seeing the database through the ORM reality-distortion filter is that they start thinking that the entities they deal with are the data. In most cases that simply isn’t true, or if it is it is because the entities have started to become some kind of uber-object that contains everything about everything. The kind of object in fact that made people say EJB is slow.

If you take something like a sale in an online shop that sale means different things to different people. To the accountant it means margin, to the sales referrer it means commission and to the dispatcher it means a weight, volume and address, to the customer its just a credit card charge and a wheelbarrow (or whatever it is they’ve bought).

The amazing thing about about the RDBMS is that recognises this multi-faceted view of data and accurately decomposes it into the constituent elements. The dispatcher doesn’t have to see the sales commission, the accountant doesn’t have to see the item volume. You use queries to constitute a coherent and logical projection of the underlying data that can be defined by the purpose the data is going to be used for.

ORM fanatics rarely see things this way; instead they want to denormalise here and there, aggregate bits of data directly into the record, add a few nullable columns into the schema because, hey, the Java fields can be null. In short they want to make the schema reflect their view of the world. Because they don’t have to make sure that the various departments get their reports or that auditors have their clear and independent trail they tend to see themselves as the centre of the universe and whatever hellspawned application they are working on as the answer to all the “problems” in the big dumb datastores.

A lot of people complain if you propose splitting logic between layers. The trouble is that someone who bleats about splitting across layers will inevitably followup their complaint with the proposal that all the logic should, in fact, shock horror, be located in application X, their latest wonder-project.

This proposal never works and for one simple reason, if you don’t want to split logic between layers then you are committing yourself to adding every last feature of Oracle, MySql, DB2 or whatever into your application. Access control, query languages, caching, data intersections, transactions, you name it it is in the database for a reason. Unless you can reproduce these features in your application then forget about logic splitting because you should put function where it is best serviced.

ORM should see itself as just being one consumer of data and learn to play better with others. It should be about the convenience of presenting one valid method of data modelling in a form that is more familiar to programmers. It should take away any impedance due to having to comprehend the relational model (which I admit is hard) but it should leave it there and stop thinking that is the true answer to complex data issues.

Standard