Work

February 2024 month notes

Postgres

Cool thing of the month is pgmem which is a NodeJS in-memory database with a Postgres compatible API. It makes it easy to create very complete integration or unit tests covering both statement testing and object definitions. So far everything that has worked with pgmem has been flawless in both Docker-ised Postgres instances and CloudSQL Postgres.

The library readme says that containers for testing are overkill and it has delivered on that claim for me. Highly recommended.

Less good has been adventures in CloudSQL’s IAM world. A set of overlapping work requirements means that the conventional practices of using roles and superuser permissions is effectively impossible so I’ve been diving deeper than I’ve ever expected to go into the world of Postgres’s permission model.

My least favourite discovery this month has been that it is possible to successfully grant a set of permissions to a set of users that generates no errors (admittedly via a Terraform module; I need to check whether the Postgres directly complains about this) but also gets denied by the permission system.

The heart of the problem seems to be that the owner of the database objects defines the superset of permissions that can be accessed by other users but that you can happily grant other users permissions outside of that superset without error except when you try to use that permission.

The error thrown was reported on a table providing a foreign key constraint so there were more than a few hours spent wondering why the user could read the other table but then get permission denied on it. The answer seemingly being that the insert into the child table triggers the permission violation but that the validation of the constraint in the constraining table triggers the permission system.

I’m not sure any of this knowledge will ever be useful again because this setup is so atypical. I might try and write a DevTo article to provide something for a future me to Google but I’m not quite sure how to phrase it to match the query.

Eager initialisation

I learnt something very strange about the Javascript test data generation FakerJS this month but it just a specific example of libraries that don’t make an effort to lazy load their functionality. I’ve come across this issue in Python where it affected start times in on-demand code, Java where the assumption that initialisation is a one-time cost meant that multiple deployments a day meant the price was never amortised and now I’ve encountered it in Javascript.

My takeaways are that it is important to [set aggressive timeouts](https://nodejs.org/api/cli.html#–test-timeout) on your testing suite rather than take the default of no timeouts.. This only surfaced because some fairly trivial tests using the Faker data couldn’t run in under a second which seemed very odd behaviour.

Setting timeouts also helps surface broken asynchronous testing and makes it less tedious to wait for the test suite to fail or hang.

Standard
Web Applications

Replicating data in Cloudant and Heroku

Heroku allows you to use CouchDB via the Cloudant cloud service which is great but compared to the documentation for the relational stores it is not clear how you are meant to deal with backups and importing of data. I also couldn’t find a way to use Futon on the Heroku instance (which comes from the Heroku account, you can’t use your own Cloudant account with the plugin) or share the database instance with my personal Cloudant account.

This post from Cloudant helps a lot, essentially you can get your Heroku instance URL and then the cool thing about Couch’s painless replication is that once you have a Couch URL you can replicate that database to a local instance or even back into Cloudant.


heroku config --long

curl CLOUDANT_URL/_replicate -H 'Content-Type: application/json' -d '{"source" : "CLOUDANT_URL", "target" : "TARGET_URL"}'

You can edit the database locally and then replicate back to the Heroku instance by just swapping the URLs in the Curl above.

That seems to pretty much be it. I’ve replicated my data out of Cloudant and then back into it, which feels bizarre but it’s all symmetrical with Couch and it’s a handy cloud-based backup mechanism.

Standard
Programming

Who hard-coded the VAT?

With the recent change in UK VAT rates there has been an intriguing chance to see which retailers have a decent data model and those who just assumed that VAT would never change.

If you look at your receipt then the companies that have a decent data model will probably just have a very ordinary looking receipt. Maybe if they are showing off then they will have a little message telling you the current rate.

However the companies that have a bad data model will have an additional line on your receipt that reads something like “VAT Discount” that subtracts 2% from your net bill. This means they have happily hardcoded the VAT rate throughout their system in the mistaken belief that it was a constant, like gravity.

In fact, so mild have the economic climes been recently that it seemed a reasonable assumption. I used to work on a retail system and while we did correctly data drive the VAT rates the data model had no way of tracking the change in VAT rate over time. If we changed the standard VAT rate it would also change it for all historical invoices prior to the switchover. For this reason when an invoice line was calculated the VAT also had to be calculated and written at the time order was finalised. It was obscure but the VAT rate was always the current rate and if you wanted to figure out the historical rate applied you had to do the work yourself.

Who would think that such a simple tax would cause such problems? Or allow you to see so much of the internals of big firm’s systems?

Standard
Software

Nulls mean nothing, or something…

My last post touched on something that is a real issue for me. Using Null in a Relational Database, now I’m not Fabian Pascal but I have had enough problems with Nulls to feel that if you are going to use a Relational Database it is well worth making the effort to eliminate Nulls from your data model.

You really haven’t suffered until you’ve had to add a Decode in a query to add a layer of logic to a column which really never should have been Null in the first place. And then there is learning the special logic for Nulls usually when your predicate has failed to evaluate as expected. Could there be anything finer than saying X = Y OR X IS NULL?

However usually my problems with nulls come down to the following:

(,,,3,)

(,2,3,,5) intersection (2, 5)

What do these things mean? They make my head explode in data terms. They don’t really make senses as tuples, vectors or indeed anything other than lists. In the case of lists you can charitably interpret the empty values as being positions that have no assigned value.

Null in programming languages usually has a strong definition. In Java it means that a reference has no associated memory allocation, in Ruby it’s a type of object. In databases people always struggle to say what NULL is.

In the comments on the previous blog post the commentor came up with at least three interpretations of a Null value: unknown, unset and Not A Number (NAN). However really Null is an absence of information, it’s kind of the anti-fact. It’s presence tells you nothing and as soon as you try and interpret it the number of potential meanings multiples exponentially.

Trying to store NAN as a concept in a database is just misguided. If I can say a NUMERIC field is, in fact, NAN then why can’t I store the idea that a Number is not a Lemon? Or a roadsign? If it is really not a number, why are you trying to store it in a numeric field? If it’s the result of a calculation that is either not storable within the allocated storage or infinity then why don’t you store that information some other way?

Relational databases work best without nulls and usually the data model works best if it is seen as a collection of facts, that can be stored in a normalised form with strong relations between the sets of facts.

However the way they often get used is a surrogate (and usually underperforming) hash stores where each entry consists of a set of keys that may or may not have values. This is great for object serialisation and instead of relational queries you can introduce highly parallelisable search algorithms. However it kind of sucks at everything else. Firstly because the usual RDBMS like MySql or Oracle isn’t expecting to be used as a Hash Store so it is doing a lot of things that you aren’t going to need in this model. Secondly because the irregular nature of the hashed data means that trying to get sets of rows back out of the database can underperform because the system is often forced to brute force examine the data in a serial fashion.

The whole point of creating tuple arthimetic is so that you can optimise for relational processing and query big sets of data quickly. Completely ignoring it or, worse still, crippling it so that serialising objects is easy is like shooting yourself in the foot… with a shotgun… for no reason.

Standard
Software

Refactoring RDBMS: Adding a new column

So, the requirements change and you now need to record a new piece of information in one of your tables. That’s a pretty easy one, right? Just add a column and you’re laughing.

Well this is probably the way I see this problem solved 100% of the time (Rails migrations and Grails GORM for example) and I actually think it is the worst solution.

When you add a column to a table you should always consider whether you can add a sensible value. Adding a nullable column to a table is a really bad idea. Firstly, it isn’t relational (and it is almost guaranteed to reduce your normal form). Secondly, it is likely to introduce performance problems if the column needs to be searchable. Finally, nullable columns have a higher cost of maintenance as they imply logic (the null state) but don’t define it. The logic as to when the column may be null will always exist outside the schema.

If you can define a sensible default then you might be able to add the column to the table. For example boolean flag fields could be handled this way as long as the historic values can be accurately mapped to, say, false.

This method however falls down if your historical data cannot be truthfully mapped to a default value. If for example the new value represents something you have started to track rather than some brand new piece of information then you cannot truthfully map the historic data to the default value as you don’t know whether the default really applies or not.

For example if you send goods to customers using a regular or overnight delivery method and you need to start tracking the delivery method it would be wrong to map all historic data to the regular delivery. The truth is you don’t know whether a historical order was an overnight delivery or not. If you do map it incorrectly then you will end up skewing all your data and ultimately poison the very data you want to be tracking.

In this case it is far better to simply introduce a relational child table to hold the new information. Using child tables is a more accurate record of the data as a child row will only exist where the data is known. For unknown historic records there will be no child entry and your queries will not need any special cases.

When using child table data like this you can easily separate your data via the EXISTS predicate and in most databases EXISTS is very performant.

I think using child tables to record new information is relational way to solve the problem of adding new columns but it is usually turned down for one of two reasons (there naturally may be others but I’ve only ever heard these used in anger).

Firstly there is the argument that this technique leads to a proliferation of child tables. This is an accurate criticism but unfortunately if you want your database to be accurate and the new information is introduced piecemeal then you do not have a lot of choice. Pretending that you have historic data you don’t actually have doesn’t solve the problem.

One thing that can help this situation is to separate your database into its transaction processing element and its warehousing or archive aspect. In the warehouse the structure of the database might be quite involved but since the accuracy of the data is more important than the simplicity of the data model and the queries tend to be of a reporting or aggregating nature there tends to be less issues with having many tables. Often views can allow you to present a high-level view of the data and introducing null values into the heading of query, while ugly, is more acceptable than breaking normal form in the raw data itself (though god help you if you start joining aggregate views on the gappy data rather than the underlying true state of the data).

The transactional version of the database is then free to reflect the state of the current data model alone rather than the historical permutations. Here the model can be clean and reflect more of the Domain Model although again you want to retain the Relational aspect of the data.

Having separate datastores with different schemas often makes app developers lives easier as well. Usually an application is only concerned with current interactions and not historical transactions. As long as historical data can be obtained as a Service then you have no need to reflect historical schema concerns in your application.

However even if you cannot separate the data there is still little reason to introduce null value columns into the database. ORM often makes the simple assumption that one data entity is one object, but that is simply to make ORM easier. The application persistence layer shouldn’t be allowed to distort the underlying relational data model. If your data isn’t relational or you can’t define a fixed schema, don’t use an RDBMS. Look into things like CouchDB.

In fact some persistence schemes, like JPA, actually allow you to reflect the reality of the relational data by allowing embedded entities in objects. This is a leaky abstraction because from an object point of view there often is very little reason why, say, an Order object contains a Delivery Dispatch Type object.

If you want to avoid the leak then use the Repository pattern to create a true Domain object and be prepared to handle the complexity of transforming the relational data to the object-orientated data within the Repository. Remember that the Repository represents a true boundary in the system: you are hoping to derive real benefits from both the relational representation and the object representation, not screw one in favour of the other.

So please; don’t just wack in another nullable column and then wonder why queries on your database are slow.

Standard
Programming

CouchDB: Pros and Cons

I have been looking at a lot of the different databases that are working outside the traditional SQL style RDBMS (and I want to point out that having a true RDBMS might actually be something very different to what we have now). I have spent a lot of time with CouchDB compared to the others because it seems to occupy a special niche in data storage. It also builds in a REST interface right off the bat which effectively decouples the storage mechanism from the delivery and makes the whole system very language neutral.

CouchDB represents the first serious contact I have had with JSON and in deciding whether CouchDB is right for your project you really need to understand what JSON does and doesn’t offer. The first important thing is that JSON offers a kind of compromise from the heavy data definition in SQL or XML Schema and totally typeless data of something like flat files or Amazon SimpleDB. However to achieve that simplicity there are two consequences. Firstly data integrity is palmed off onto client applications and you will need to check data going in and coming out. I don’t personally believe that there is ever an application that doesn’t have some explicit or implicit data integrity. Second JSON documents can be very rich and complex but they cannot have detailed field structures, a telephone number is simply going to decompose into two numbers and its hard to get much finer grain description of the data. As a rule of thumb if you would normally create a well-formed XML document for the data or you would define a SQL data definition that would simply declare all the types of the column using all the defaults and never specifying a NOT NULL column then you probably have a good match.

JSON also favours dynamic languages over static types like Java. You need to decide if you are going to be able to take advantage of the full set of features in your chosen client language. XML might remain a better choice for static languages as you describe document instances declaratively.

Other good features about CouchDB include the fact that its incubating for Apache and will be a good addition to the projects there. It also leverages existing Javascript skills which are probably more common than people who know XPath, XQuery or even XML Schema and Relax NG. It has excellent support for incremental data and also fits well with highly irregular data like CMS page content. It also has an excellent UI that makes it easy to interact with the data. Finally it seems to have a good solution for scalability without doing anything too esoteric.

The negative features are tricky because we obviously have an alpha project here that is probably closer to the start of the public testing phase than the end. Some of these cons may well be addressed before the first release candidate. The first obvious omission for a server product is that there is no security built into the server. Just supporting optional HTTP Authentication would be enough to make it practical to start running some experimental servers for something more than sandbox exercises. One thing that is a major difference to the feature set in the standard pseudo-RDBMS is that there is no way of interacting with sets of data. There is a method for changing multiple documents in one pass but what you really want to be able to do is apply a set of changes to documents identified by a query, the equivalent of SQL’s UPDATE.

Related to this is the fact that data in CouchDB is currently heavily based on silos. If I have a set of data referring to authors and a set of data relating to books then I currently need to duplicate data in both. One of the problems that the developers are going to face in evolving CouchDB is how to address this without introducing a solution so complex that you ask why you are not using an RDBMS? Similarly I notice in the roadmap there is an item about data validation. If you start introducing data validation and rules for validating data then before long there is going to be a question as to why you don’t simply use one of the existing document systems as all the current simplicity will have gone.

One thing that definitely needs improvement is error logging and reporting. Often the only error feedback you have is a Javascript popup that says “undefined” and a log message that tells you that the Erlang process terminated. There needs to be some more human-readable issue logging that points you towards what is going wrong.

Standard
Programming, Scripting

CouchDB: Querying data

CouchDB allows you to pass a map function to a special view URL to query the data in an ad-hoc way. Views can also be stored as JSON documents with a convention URL (_design on the server, accessed as _view by the client). These can then be obtained via a HTTP request.My functional and Javascript programming are weak but this is what I understand of writing queries in CouchDB. Let’s take an example of a set of library cards, each card represents a book but the amount of information I have on each book varies.

The basic find all function is this:

function(book) {
map(null, book)
}

This defines an anonymous function that takes one parameter, the target document, in this case a book, and returns an array of values. What is in the value list is controlled by the second parameter, in this query I return the entire document. The first parameter controls the sorting or ordering. So I wanted to return the title of all the books in my database then I would use:


function(book) {
map(book.name, book.name)
}

Sorting them by ISBN would go like this:

function(book) {
map(book.isbn, book.name)
}

One important thing to note is that if an object doesn’t have a value it doesn’t respond to the function and will not be included. So if I created some of my entries with a value title instead of name anything with a title and not a name will not be in the query. However if I use a non-existent entry as an ordering criteria the value will count as null and be sorted.

Because I can include any valid Javascript in my function I can actually put a lot of complexity into my queries. For example:

function(book) {
if(book.isbn != null) { map(book.name, {"Name": book.name, "ISBN": book.isbn})
} else { map(book.name, book.name) }
}

So I suspect this will either make you cheer or puke. What this function does is return a JSON object containing the Name and ISBN of the book if they are known or just the Book name as a String otherwise. Unlike SQL the heading of my query is almost completely arbitrary as long as the value on the right of my map function translates to a valid JSON object.

Now at work there are often a lot of debates as to whether things are “rigid” or “structured” or whether they are “flexible” or “formless”. It is a bit like the old meat and poison adage. CouchDB allows a client to construct an almost arbitrarily rich response to a query with almost no restriction on how the data that should be included in that response. In some cases this is going to allow you to easily interact with very complex unstructured data in some cases it is going to be an invitation to create a sprawling dataset with no value. There is no inherent right or wrong choice here but for a particular problem being solved there is probably going to be a wrong and right choice. SQL is powerful because of the restrictions and rules it builds into its grammar. Using Javascript is powerful because it relaxes those restrictions. Programmers and IT folks in general often fall into using the laxest possible implementation for reason of “flexibility” but then either have to impose order themselves or lose the power of the more restrictive choice.

So putting that into a concrete example, if a write a view with SQL I am going to have to follow a set of rules to get the data I want (for example my heading is going to have to be a set of tuples of equal size), using an arbitrary script and JSON means I am going to be able to get exactly the data I want in the form I want it. However since that return structure is customised to my query I might possibly be reducing my reuse by being over-specific or by building too much logic into my view code.

That’s quite a diversion just so I can say it’s horses for courses, so let’s wrap up this quick look at CouchDB views. All of CouchDB’s views are effectively JSON objects that are passed to a separate view server. This is a separate process that interacts with the main server via STDOUT and STDIN pipes. By default this is the view server that is built from the Spidermonkey library (it is called couchjs). However you can write a view parser for any language and plug it into CouchDB by creating an executable and mapping it to a MIME type in the couch.ini file. The view server essentially parses and readies the query function that is associated with the view and is then sent every document in the database as a JSON string. The view server picks up the results of reading every document and sends that back to the query request.

It is a pretty simple system and it works will for the relatively flat documents I have been trying with it. However I suspect that in a project with multiple developers some ground-rules for writing consistent query code would be a must.

Standard
Programming, Ruby, Scripting

CouchDB

What is CouchDB?

CouchDB is a dedicated document based database that kind of puts it in the same space as Exist, Xindice and Oracle Berkeley XML Db. What makes it very different is that instead of building around XML it is built around Javascript. The document storage format is JSON and the query language is Javascript. Something like Exist uses a minimum of well-formed XML for data and XPath or XQuery for querying.

Now the merits and flaws of the various markup languages is really holy war and I don’t want to get into it too much. I think it is enough to say that CouchDB aims to be lightweight in implementation without being lightweight in features or performance. JSON is very popular in the web space and by focussing on making common cases easier it is less work to use than XML. The more complex the data or specification requirements though the less viable it is as a solution.

Using JSON means that stored data is extremely flexible in structure, unlike relational data you can have very gappy or bitty information and not have a problem. Take something like a customer relationship management system. This is often a poor fit to relational data because you tend to discover information in small tranches. Initially a lead might be a first name and a telephone number. As the relationship develops you add a surname, a company name, a position, an collection of issues or queries and so on. JSON is a really good way of capturing this evolving picture of things.

Using Javascript to query this data surprised me initially, I feel pretty comfortable with my limited Javascript skills and therefore didn’t have a problem with it as a choice. I’ll come back to this later in the post but actually the idea sounds strange but is a natural fit when you use it. Again it is about the right tool for the right job but in the context of the data you are using in CouchDB creating a query by combing iterators, if statements and the functional programming map function is a better fit than trying to adapt SQL.

The final component of CouchDb is the functional language Erlang which is used to service the HTTP REST interface that CouchDB uses to provide an interface to the database. Erlang provides very cheap, lightweight concurrency that provides a good fit for handling lots of HTTP requests.

The interesting thing is that CouchDB automatically provides the REST front end that you tend to build on top of Java object stores like JBoss Cache or Coherence (which I have also been looking at recently). With those products you tend to stay native until you need to share data with other languages or systems and then you tend to REST serve it out as XML with a lightweight HTTP server. If you can see that need ahead of time then CouchDB might well be compelling due to the built-in support.

Installing

After reading about CouchDB I wanted to take a look at this new document database in a bit more detail. I managed to install the base tarball from the Google Code page easily. I then cut and pasted the Ruby sample REST code from the CouchDb Wiki into irb and on running the server I found that I could store and retrieve documents really easily. The combination of interactive Ruby and the REST interface made it easy to play around with data objects.

However when trying to run the tests via the administration web page only four of the tests would pass. The only error message was just an Error Code of 136. Since I seemed able to store and pull documents (and could confirm that the datafile had been generated and had data) I put it down to alpha flakiness and pushed on.

It was only when I tried executing ad-hoc queries and kept having them fail that I realised what was linking my problems. Every time the main Erlang server handed off a query to the Javascript interpreter the process died. Erlang is really resilient to thread death so there was zero impact from this.

CouchDB uses the Mozilla Spidermonkey libraries to construct its own View Server program which is configured via the couch.ini file. Having double-checked that the interpreter was there, was executable and could read the main.js script it was passed I was baffled. I decided to pull down the code base from SVN and then rebuild it. Building from SVN is a bit more involved than the tarfile and its worth double-checking the documentation on the wiki before you get stuck in. Having built and installed myself a new copy I still had the same problem and was feeling pretty cheesed off. With no other source of help I headed to IRC and checked out the Freenode #couchdb channel. There Christopher Lenz gave me a good steer that he had had a problem with Erlang’s HIPE (high performance virtual machine) on Debian.

I had been using Erlang 5.5.5 which I think I used apt-get to obtain. Downloading the latest Erlang release (5.6.1, they release pretty frequently) I tried building that from source and hung when configure tried to check for floating-point exceptions. Running configure with disable-hipe allowed configure to complete and then Erlang was straight-forward to build and install.

Restarting the CouchDB server I found my queries were working and all but one test (“conflicts” fails on an error assertion and this is apparantly a known issue) passing, hurrah!

The machine I had been working on was a Feisty Fawn Ubuntu instance so I loaded up Gutsy Gibbon in VM Ware Player and tried building Erlang and CouchDb there. Erlang was fine, even with Hipe. For CouchDB I used the latest Erlang I had built and used apt-get to obtain the Spidermonkey and ICU depdencies. Everything worked out of the box with this combination.

This problem highlights a few of CouchDB’s weaknesses. Firstly it is all bleeding edge stuff, Erlang is bounding along and so is CouchDB. Secondly there is no test suite in the source distribution that can help troubleshoot issues. Finally the error logging is tricky to understand for a neonate. I’m certain that having an EUnit test suite would have directed me towards the fact that the Javascript was failing much quicker than using trial and error and IRC.

The final issue was where CouchDB stores its logs and datafiles by default. These default to /usr/local/var which is not really where I want to store data like this. /var is a more natural choice and FHS seems to agree. You can change this when building in the etc/couch.ini.tpl file of the source directory but it would be nicer to have a more natural choice by default.

I also tried to install CouchDB on OSX without using a packaging mechanism but instead referring to an instance of Spidermonkey 1.7 I had on my system. While I could get configure to accept the Javascript library it wouldn’t recognise the jsapi.h header, maybe because configure doesn’t define the right Macros when it tries to build the test file.

Working with CouchDB

My first experience of CouchDB’s web interface was that it was good-looking but that the error messages, unresponsive links, Firebug warnings and occasional Javascript pop-ups were all signs that it was a work in progress. On fixing my passthrough to Javascript I had a very different experience of a slick and good looking interface that uses amazingly responsive AJAX to provide a really good experience.

Learning how to generate queries for example is easy with the in-built query browser (I’ll save the actual syntax for a later post) as the feedback from the system is very quick. Similarly creating new databases, documents and fields is actually not torturous but slick and quick.

I’m certain this is because Erlang and AJAX form a natural partnership in creating and servicing small requests that need to be handled quickly. I may be wrong but this is certainly my most positive AJAX webservice experience to date.

The CouchDB wiki contains information on getting started with various languages and I choose Ruby. It was a simple cut and paste into irb and then it was straight-forward to interact with database from the shell and the web client.

One of CouchDB’s more unusual feature seems to stem from Erlang’s concurrency model. Data is assigned a revision and in addition to applying multiple changes in order you can also view previous revisions. That’s a pretty weird feature compared to most of the current datastores. You also need to refer to a revision if you want to update a record. If the revision of the target of the update has changed when you submit your changes your changes get refused. The revision mechanism is also the basis of the replication mechanism although there isn’t enough documentation to understand when the replication pairs check their revision ids.

Incremental data-construction is easy via the web-gui but you need to fiddle a little bit to get the revision number to target if doing it programatically. Presumably a library would make that easier going. For information that has little inherent structure or is very gappy or incremental then CouchDB is a great data store and is currently occupying a niche as far as I know.

Standard
Software, Work

Don’t hate the RDBMS; hate the implementation

I read through this post with the usual sinking feeling of despair. How can people get so muddled in their thinking? I am not sure I can even bear to go through the arguments again.

More programmers treat the database as a dumb store than there are situations where such treatment is appropriate. No-one is saying that Twitter data is deep, relational and worthy of data mining. However not all data is like Twitter micro blog posts.

The comments on the post were for the most part very good and say a lot of what I would have said. However looking at CouchDB documentation I noticed that the authors make far less dramatic claims for their product that the blog post does. A buggy alpha release of a hashtable datastore is not going to bring the enterprise RDBMS to its knees.

I actually set up and ran Couch DB but I will save my thoughts for another day, it’s an interesting application. What I actually want to talk about is how we can get more sophisticated with our datastores. It is becoming apparent to me that ORM technologies are really making a dreadful hash of data. The object representation is getting shafted because inheritance is never properly translated into a relational schema. The relational data is getting screwed by the fact the rules for object attributes is at odds with the normal forms.

The outcome is that you end up with the bastard hybrid worst of all worlds solution. What’s the answer?

Well the first thing is to admit that application programmers think the database is a big dumb datastore and will never stop thinking that. The second is that relational data is not the one true way to represent all data. They are the best tool we have at the moment for representing rich data sets that contain a lot of relational aspects. Customer orders in a supply system is the classic example. From a data mining point of view you are going to be dead on your feet if you do not have customers and their orders in a relational data store. You cannot operate if you cannot say who is buying how much of what.

If you let developers reimplement a data mining solution in their application for anything other than your very edge and niche interests then you are going to be wasting a lot of time and money for no good reason. You simply want a relational datastore, a metadata overlay to reinterpret the normalised data in terms of domain models and a standard piece of charting and reporting software.

However the application programmers have a point. The system that takes the order should not really have to decompose an order taken at a store level into its component parts. What the front end needs to do is take and confirm the order as quickly as possible. From this point of view the database is just a dumb datastore. Or rather what we need is a simple datastore that can do what is needed now and defer and delegate the processing into a richer data set in the future. From this point of view the application may store the data in something as transient as a message queue (although realistically we are talking about something like an object cache so the customer can view and adjust their order).

Having data distributed in different forms across different systems creates something of headache as it is hard to get an overall picture of what is happening on the system at any given moment. However creating a single datastore (implemented by an enterprise RDBMS) as a single point of reference is something of an anti-pattern. It is making one thing easier, the big picture. However to provide this data is being bashed by layering technologies into all kinds inappropriate shapes and various groups within the IT department are frequently in bitter conflict.

There needs to be a step back and IT people need to accept the complexity and start talking about the whole system comprising of many components. All of which need to be synced and queried if you want the total information picture. Instead of wasting effort in fitting however many square pegs into round holes we need to be thinking about how we use the best persistence solution for a given solution and how we report and coordinate these many systems.

It is the only way we can move forward.

Standard
Java, Software

Setting up a Derby Database

Okay so I currently I am setting up a number of databases on my own V-Server and I thought it would be helpful to make a few notes about the current Java databases and how easy they are to run in Server mode. The two biggest databases are Derby and HSQLDB, this post is about Derby. Both databases are really geared towards the embedded space and generally focus their documentation and tasks around this. In their default setups they are both pretty insecure for running on a public network, neither for example defaults to using passwords to authenticate logins.

All the databases are getting setup in the same basic way. Each database is going to be run in the userspace of a dedicated UNIX user and that user is meant to be accessed via SSH and SSH Keys. All the datafiles will reside within the user’s space. I’m not going to go into that as it is fairly straightforward and is probably better covered by the various UNIX admin guides you can find around the internet.

So Derby first then, well I have played around with Derby before so I hoped this would be the easiest. One thing that you need to know about Derby is that its documentation is good but it is organised into several different documents… by a madman. There is no apparent logic, rhyme or reason to the way information is included in one document or another. When setting up the server I had to jump between the Developer’s Guide, the Tuning Guide, the Server Guide and more unusually the Tool Guide. If you have the same experience then don’t worry, you’re probably doing it right.

Having downloaded the Derby package unpack it to the root directory. I find it helps with upgrades to create a symbolic link ~/derby to the actual installation directory. You can also create a profile environment variable DERBY_HOME that points to $HOME/derby.

With that all setup you can simply run ~/derby/startNetworkServer and you should be in the basic network server business. This is where it gets a bit more complicated. Derby will look for its configuration in the directory where you start it not in DERBY_HOME. So you need to create a derby.properties file in $HOME to configure the behaviour of the server.

To require authorisation you need to properties created in the properties file. Remember that the format of this file is a standard Java properties file. The two properties are:

  • derby.connection.requireAuthentication=true
  • derby.authentication.provider=BUILTIN

You should now restart the server. Now when you want to interact with the database you should need to provide a username and password. You have to check this though to make sure it works.

Run the ij tool from the command line: try to create a new database with something like the following

connect 'jdbc:derby:test;create=true';

You should get a message saying something like ‘Invalid authentication’. If the database is actually created then the database is probably not picking up your properties file, which in turn is likely to be an installation issue.

All being well we can now add a user to the properties file. User information is also a property so to make the file manageable you probably want to use comments to divide up the file into relevant sections and try and keep all the user setup in the same place. The format of the user entry is derby.user.USERNAME=PASSWORD. So for example

derby.user.test=changeme

Okay now restart the server and login via ij again. This time use a connection URL like the one below.

connect 'jdbc:derby:test;create=true;user=test;password=changeme';

This should now create a new database under HOME and give you access to it normally. Drop out of ij and confirm that the test directory has been created.

That is pretty much that now, from here on in you can connect to your server database instance as you would any other database server.

Written like this the setup must seem very basic and quick to perform, which is good. However when I was finding my way with it, it took over an hour and most of that time was taken up with looking for things in the documentation. Nothing is logically organised with Derby, for example the configuration properties are all described in the Tuning Guide. Except the network control properties which are described in the Admin’s Guide.

Unlike code DRY is not a good idea unless you are going to very rigorously modularise the documentation. Even then why not repeat something if it means that all the information relevant to task is one place?

Derby has a great feature list but I wonder how many users are ignorant of what it can do because of the poor documentation?

Standard