Software, Work

Don’t hate the RDBMS; hate the implementation

I read through this post with the usual sinking feeling of despair. How can people get so muddled in their thinking? I am not sure I can even bear to go through the arguments again.

More programmers treat the database as a dumb store than there are situations where such treatment is appropriate. No-one is saying that Twitter data is deep, relational and worthy of data mining. However not all data is like Twitter micro blog posts.

The comments on the post were for the most part very good and say a lot of what I would have said. However looking at CouchDB documentation I noticed that the authors make far less dramatic claims for their product that the blog post does. A buggy alpha release of a hashtable datastore is not going to bring the enterprise RDBMS to its knees.

I actually set up and ran Couch DB but I will save my thoughts for another day, it’s an interesting application. What I actually want to talk about is how we can get more sophisticated with our datastores. It is becoming apparent to me that ORM technologies are really making a dreadful hash of data. The object representation is getting shafted because inheritance is never properly translated into a relational schema. The relational data is getting screwed by the fact the rules for object attributes is at odds with the normal forms.

The outcome is that you end up with the bastard hybrid worst of all worlds solution. What’s the answer?

Well the first thing is to admit that application programmers think the database is a big dumb datastore and will never stop thinking that. The second is that relational data is not the one true way to represent all data. They are the best tool we have at the moment for representing rich data sets that contain a lot of relational aspects. Customer orders in a supply system is the classic example. From a data mining point of view you are going to be dead on your feet if you do not have customers and their orders in a relational data store. You cannot operate if you cannot say who is buying how much of what.

If you let developers reimplement a data mining solution in their application for anything other than your very edge and niche interests then you are going to be wasting a lot of time and money for no good reason. You simply want a relational datastore, a metadata overlay to reinterpret the normalised data in terms of domain models and a standard piece of charting and reporting software.

However the application programmers have a point. The system that takes the order should not really have to decompose an order taken at a store level into its component parts. What the front end needs to do is take and confirm the order as quickly as possible. From this point of view the database is just a dumb datastore. Or rather what we need is a simple datastore that can do what is needed now and defer and delegate the processing into a richer data set in the future. From this point of view the application may store the data in something as transient as a message queue (although realistically we are talking about something like an object cache so the customer can view and adjust their order).

Having data distributed in different forms across different systems creates something of headache as it is hard to get an overall picture of what is happening on the system at any given moment. However creating a single datastore (implemented by an enterprise RDBMS) as a single point of reference is something of an anti-pattern. It is making one thing easier, the big picture. However to provide this data is being bashed by layering technologies into all kinds inappropriate shapes and various groups within the IT department are frequently in bitter conflict.

There needs to be a step back and IT people need to accept the complexity and start talking about the whole system comprising of many components. All of which need to be synced and queried if you want the total information picture. Instead of wasting effort in fitting however many square pegs into round holes we need to be thinking about how we use the best persistence solution for a given solution and how we report and coordinate these many systems.

It is the only way we can move forward.

Ruby, Scripting

Dumb Ruby Gems mistake

So on and off over the last week I have been trying to get RedCloth to work in both Ruby and JRuby. Despite getting the gem I kept getting a failure when I tried to require ‘redcloth’. When I finally got Textile support working in Java before I had managed to weave the two or three lines of Textile magic the tutorials promised I shrugged and took the task off my GTD list.

Except that today my stubborn side told me to take a look at the Ruby Gems documentation after seeing an answer to a similar problem on a mailing list. Sure enough there in Chapter 3 was the explanation that you need to require Ruby Gems before the actual Gem you want to use.

Heading to irb and sure enough:

require 'rubygems' require 'redcloth'

Does the trick and I am finally appreciating the library. Now maybe I’m slow or perhaps the same code on the RedCloth page should have included a full script that would regardless of what was in your RUBYOPT environment variable.

Scripting

Discovering Scala

Over the last two days I have been giving Scala a quick whirl and I have to say out of all the new languages (functional and otherwise) I have been looking at recently only Erlang is anywhere near as exciting. There is a small fifteen page tutorial and after finishing it I immediately thought: “I want to see more of that”.

I can’t put my finger on it at the moment but I imagine its the fact that it seems perhaps more familiar due to the Java background. The functional syntax is as confusing as any other functional language.

P.S. if you go through the short 15 page tutorial then there is a slight blip in the tutorial when case classes are introduced. All the code for the calculator example should be wrapped in an object block if you want it to work. I.e.

object Calculator { ... tutorial code }

Web Applications

Google Sites: What I would like

I am not going to deny that the basic Wiki functionality is all there but there are a few things that I would like to see. Now I know it is a free service but I would actually be prepared to pay for some of this on a WordPress model (i.e. buy what you want when you need it).

Categories or page tagging
The ability to send a page to Google Docs or alternatively the ability to export to external formats as you can for a Google Document.
The comments section should collapse if there are no comments; ditto the attachments. If something does not apply to my page then I just need a simple text link to add it.
Most of all, proper HTML markup rather than a massive paragraph consisting of my whole page, with a BR tag… if I’m lucky. If I want to reuse my content then I cannot export it and its not even decent HTML. Blogger is no better but crazily Google Pages does the right thing!

As with Blogger I think Google is having some major integration issues with all the companies it has bought up. If one of the applications or systems has a cool feature then I think it is natural to assume that it will be available in all Google branded applications.

Programming

Compiling under Leopard

Since I upgraded my monster MacBook Pro I really haven’t looked back. I’ve suddenly started getting software by checking out the SVN repository and compiling it because it would be quicker than downloading an archive. Surely this is what a 64-bit operating system is meant to be like. It is giving me a warped view of the world and I now wish I hadn’t foolishly opted for a Dell at work. I could probably run a Virtual Vista quicker under Leopard than I could if it were running natively.

The MacBook Pro genuinely feels like a desktop replacement. I’m not sure I would go back to the box and screen for development now. Even the battery life feels like its been extended. It might even be responsible for these cartoon birds that keep singing all over the place.

OSX 10.5.2? An OS that actually seems to reflect its codename.

Groovy, Programming, Scripting, Work

Using Groovy to create XML Schemas

I have previously found XML Schema to completely invaluable in defining interface points between systems. Normally file interfaces between systems are done in formats that are deceptively simple: CSV, structured text files. However in nearly all cases the initial simplicity tends to lead to a lot of problems. There is the issue of how to escape characters within your fields, particularly the field separator. The free text field is often used as exactly that, free text. Something is supposed to be a number field… until the letter X appears in it. Historical CSV is the worst as often the exact meaning and origin of each of the fields is undocumented and the meaning lost. I have even come across CSV generators that map meaningless constants to the output just to keep the number of fields the same. The receiving systems ignore those same fields or sometimes even hinge workflow off a value that will never vary in practice. The whole thing ends up being a nightmare.

Introducing an XML Schema reduces that nightmare but does bring in a lot more complexity. Being able to specify the type and order of the fields comes at a price. Previously when I have wanted to develop a new schema I have simply used the Xerces tools at the command line and an XML editor to generate both the Schema and a sample datafile. It works but it is quite laborious. Speeding this up would be great as often the point of capturing the complexity in the data transfer is so that the business or the architects can see the complexity of the integration and decide that they really want to do it before a lot of code gets written to integrate the systems.

Looking through the Groovy website I came across this example of how to validate an XML document and an idea is sparked. The multi-line indicator is a neat feature (borrowed from Python I think) and is (to my mind) a more elegant solution than the Ruby/Perl document syntax. It would allow me to define my schema, my sample document and my validation code in the same file. During iterations I would be more productive and when the interface is captured I just publish the final schema.

So I’ve knocked together a simple PoC and it seems to work pretty soundly. The easiest way to work with it is from the sample document to the Schema but TDD approach is to define the Schema and work back from the validation errors. The latter approach tends to avoid the situation where you’re validating your test document rather than your document template.

Software

Running Virtual Ubuntu

So at work I need to be able to have access to a personal UNIX playground and the form that you have to fill in to get a licensed VMFusion instance is a nightmare so I decided to look at the alternatives. I already had Parallels installed on my MacBook Pro but I had not done anything with it. I also decided to try and get Ubuntu running on my Windows Vista machine using the free (to download) VMWare Player.

VMWare Player requires a special image (I used this one) however once the software and the image was downloaded (the images are sensibly torrented although the player software itself does not seem to be), getting the system running was extremely easy. You just click on the image, it loads up and you update within Ubuntu as normal.

Getting Parallels working was not as as easy. I tried a standard DVD from a Linux Magazine, that failed with an X error where the X window could not be started. So I downloaded a text based installer and ran through that. It had the same problem and after reading this item in the Parallels Knowledge Base I took a guess at the problem and set the resolution during the text installation to be 1024 by 768. That sorted the issue and after that the major problem was networking. The Parallels installation did not seem able to share my wireless connection. Once I connected my Ethernet cable then the instance updated fine. Oddly once I set the VM to use Shared networking I could use the Wireless connection but counter-intuitively setting the Ubuntu instance to use the Wired connection. I guess at that point Parallels was able to weave a little magic and make the connection available and the issue of whether the physical hardware was Ethernet or Wireless was completely irrelevant.

Both systems run their virtual machines very quickly but VM Player seems to be the better suited to rapidly stopping and starting the machine. It works pretty much like a normal application, you fire it up and close the window when you are done. The Parallels application is much less seamless. Both applications use a similar amount of space to save their state, VM Player perhaps runs a bit fatter from my experience.

VM Player is pretty amazing for a product that is offered for free and is definitely a well-done teaser product. If you have never run a virtual machine before I would definitely recommend giving it a spin. Parallels is a slick and excellent program but its focus on running Windows under OS X seems to have led it to not being able to create a trouble-free installation experience for the leading desktop Linux distribution of today. That is a big mistake and even Parallels’ relatively low price tag of £50 to £60 does not excuse it. Some things should just work. After all at some point you are going to appreciate having the flexibility to install a OS how you like and at that point you may be more tempted to upgrade your existing solution than switch to a new application altogether.

Web Applications

Try our new, new services!

So on Friday not one but two long awaited beta service invitations arrived. The first was the announcement of the addition of Jotspot to Google Apps (finally) and the other about the Amazon Simple DB service. Typical buses…

I didn’t have a lot of time this weekend so I plumped for signing up for Google Apps and trying the new wiki functionality as I was hoping for a beefed up version of Pages. The Simple DB service also needs me to beef up my Web Service scripting fu.

It is too early to say much about either service but after signing up for a Google Apps account (apparently you cannot simply drive one off your regular Google Account). I was slightly underwhelmed by the new Google Sites service. It has taken how long to make a basic and acceptable wiki service available?

Still you can have a lot of separate wiki sites and you have a lot of flexibility on how you share and collaborate on them so maybe I need to build up some content first and then try to share it around. I would like to know whether you can hook Analytics up to some Sites content. That would be useful for some of the content that otherwise would go on something like a WordPress page.

Celeb Spotting

Mike Leigh, Coptic Street

Mike Leigh, famous director and resident of the mansions south of the British Museum. He looks exactly how you would want your favourite uncle to look: a chubby, kindly hangdog kind of face, rather like a fat peanut with a beard. He also has observer’s eyes and that’s not so friendly. They seem to flick around taking in the scene before him and evaluating it immediately.

Computer Games

Audiosurf

Can you imagine a game that combines Guitar Hero, Trackmania, Wipeout and some surreal media player style music visualisation? Well if you can’t then you could just download Audiosurf to see just what that would look like. Combined with Steam this game is just all kinds of awesome.

In addition to the fantastic visualisation techniques there are also two important things going on here. There’s the social networking aspect of who is playing what music and how well they are doing. There is also the extreme personalisation as you are listening to your own music.

As a tech note, if your game keeps saying that the demo has expired when you have actually purchased it, just restart the Steam client.

It is also another example of how well Steam Achievements can be used to set challenges and keep people playing. I expect all games to ape HL2 E2’s achievements strategy in the coming years.

Echo One

Sequentially arranged sentences composed of words (and punctuation)