Programming

Keep the focus on the read

One of the interesting things about Wazoku‘s Startup Challenge app is that a lot of the functionality is created via “out of the box” CouchDB features. In fact it is often where we haven’t lent heavily on the features of our store and frameworks where we have issues.

One of the interesting things we decided to do in the app relatively late in the day was provide little encouragements to say how many more votes an entry needed to get to the next place in the ladder. As this was a late feature we didn’t really think through where this feature would sit. We had code that re-ranks entries when their vote ordering changes and so when an entry was being re-ranked it also acquired the target to beat at the same time.

With a store like CouchDB you are really aiming to keep on reading data and minimising writes. That’s via denormalisation and also about strategies to generate related and derived data when you are changing the parent data.

So this placement made sense from that point of view. It was only later that I have begun to realise that we were choosing the wrong point to read. With hindsight it is actually only necessary to calculate the target entry when someone looks at the entry. This is because the views of the entries are distributed unequally and the vote totals already exist as a CouchDB view and therefore we can do a key lookup to find all entries with more votes than the current entry when needed.

If we wanted to cache that result to avoid needless recalculation we would be better off storing the information in front-side cache like Memcached or Redis but in practice key reads in CouchDB are pretty damn fast and low load.

So we thought we were saving ourselves problems by denormalising derived data but in fact we were creating a lot more work at a point where it is uncertain that the additional data will ever be consumed.

Sometimes it can be hard to pick the right point to read!

 

Standard
Programming

How do I query data with CouchDB?

This question comes up a lot when dealing with Couch and I have given various answers before but my latest answer is simply that you don’t. In reality what you want to do in Couch, like a lot of the NoSql databases, is look for key lookups.

Now the key lookup may be a range of keys you are interested in but in reality there is nothing in Couch that is similar to the SQL “WHERE” clause.

So if you cannot do queries then how do you relate data? Well that’s the thing about storing documents instead of rows, if you have related data then you have to ask whether that data has any meaningful existence outside of its parent. In relational terms it is like asking whether you ever access the content of table outside of JOIN with its parent.

Initially you might think: of course I do! But often data is often explicitly related to its parent’s primary key by things like ORDER and GROUP BY. In these kind of cases then you move the related data into the parent record, effectively denormalising to avoid a lookup.

If the data does have a meaningful existence outside the parent (for example in Wazoku comments are an example of a piece of data that exists separately from the thing they are a comment on) then you have a few options but essentially instead of querying you are still trying to do a direct key lookup.

The first simple case is to include a reference to key of the related data in the associated document. Then from one key lookup you can go direct to the next. As an example we store a list of comment document ids on any document that can be commented on and then we can load the comments as needed (often the count of the comments can be as relevant as the full content). I describe the ids used this way as “forward references” as they lead you on to the related document.

The second, slightly more involved approach, is the creation of a view that allows the document to be looked up via an alternative key. For example if we store the document id of the thing being commented on in the comment document under the key comment_on we can then create a mapping view of all comment documents to their comment_on key. Then given any document we can simply do a direct lookup on the key in the view to determine whether it has any associated comments.

The final common technique I use is something I refer to as “unrolling” of collections. So again we create a CouchDB view that consists just of a map job and in it we take each item in an array of “forward references” (related document ids) and emit a document in each view mapping the id to the current document id.

So if an idea document has five comment forward references the resulting view will have five documents, each relating a comment document id to the idea document id.

If things get more complicated then I also have the Couch databases indexed in Elasticsearch and in Neo4J and these alternative views of the data give me powerful adhoc queries on properties or relationships in the data.

In general though I am always trying to think ahead as to how my documents relate and then express that in terms of a key lookup so that I am always working with the simplest case.

Standard
Programming, Python

Django and JSON stores: a match in heaven

My current project is using CouchDB as its store and Django to implement the web frontend. When you have a JSON store such as CouchDB then Python is a natural complement due to its brilliant parsing of JSON into native data structures and its powerful dictionary data type that makes it really easy to work with maps.

In a previous project using Python and Mongo we used Presentation objects to provide domain logic on top of the raw data maps but this time around I wanted to try and cut out a layer and just work with maps as much as possible (perhaps the influence of the Clojure programming I’ve been doing on the side).

However this still leaves two problems that need addressing. Firstly Django templates generally handle dictionaries like a dream allowing to address them with the standard dot syntax. However both Mongo and Couch use leading underscores to indicate “special” variables and this clashes with the Python convention of having a leading underscore indicate a private member of the class. The most immediate time you encounter this is when you want to use the id of a document in a url and the naive doc._id does not work.

The other problem is the issue of legitimate domain logic. In Wazoku we want to use people’s names if they have supplied them and fallback to their email if they haven’t supplied their name.

The answer to both of these problems (without resorting to an intermediary object) is Django’s filters. The necessary logic can be written in a few lines of Python that simply examines the dictionary and does the necessary transformation to derive the id or user’s name. This is much lighter than the corresponding Presentation solution.

Standard
Programming

Thoughts on Neo4J REST server 1.3

So the new Neo4J GPL release 1.3 (with added REST) is the release that kind of moves Neo4J out of its niche and into the wider arena. There are less barriers to experimenting with it in production situations and now it is no longer necessary to embed it you don’t have any language barriers, it can just be another piece of infrastructure.

Last week I used the new release to prototype an application and these are my observations. The big thing is that the REST server is easy to use out of the box and the API is well-thought out and practical and it looks like there’s some nice wrappers already for various languages (I used this one for Python).

However compared to other web-based stores there are still a few wrinkles. The first is that it would be nice to clearly be able to mark nodes that you consider “roots” in the graph. It’s almost the first thing you want to do and there should be a facility to support it in the API.

Indexing is still cronky, I think there is a reasonable expectation that Neo4J should be able to have a dynamic index that indexes the properties of nodes as they change without having to explicitly add the indexable fields to the index. I don’t mind if I need to do some explicit work to switch on this mode but having to manage it all explicitly is a huge pain. Particularly because you don’t have any root node support so you are almost inevitably going to want to search before moving over to relationship following.

It would be nice to have list and map properties on nodes, at the moment we’re getting around that by using JSON in the property string but again it feels like a value add that’s necessary if you want to use Neo4J for a big chunk of your data processing.

This is a personal peeve but I do think it would be nice to be able to override the id of a node. One usecase I can definitely think of is when you’re translating a document database into a graph form and you want to cross-reference the two (particularly if you’re using the richer data on the document rather than tranferring it into the node).

One other feature that is a bit strange (but I might be doing this wrong) is that the server seems to correspond to just one database. It would seem pretty easy to incorporate a database-style data-partition name into the REST scheme and allow one server to serve many different databases.

Now that the GPL version is available I am also looking forward to Neo4J entering the Linux package distributions in the same way that CouchDB has become a ubiquitous deployment choice.

So in short Neo4J has left it’s niche but it still hasn’t fully left the ghetto but I have high hopes for the next release.

Standard
Programming

Learning to work atomically

I have been doing a lot of work with MongoDb recently and I have made a few noob mistakes despite being relatively well-grounded in the theory. One of the key mistakes I have made using the Java driver is to not have the driver in the right mode. By default the driver will not block on an insert, you need to be in Safe Mode for that to happen.

What is the impact? Well if you are trying to update a record that you have just inserted and the update neither fails nor is applied then chances are that the update failed to find the record you had just inserted because it wasn’t there when the update query ran. Of course a few milliseconds later it appeared and is there are the end of the batch process.

Updates in Mongo consist of a query and a data change operation and there is an art in getting the query to work on the set of data you want it to. I find myself doing a conditional match in Scala and then thinking “at this point is that still going to be valid?” and then going back tweaking the query so that the update is guaranteed to be valid at the point it happens.

Today I spent a lot of time buggering about trying to avoid writing keys in the document that held no data, after doing it I realised that I could have just written a single remove statement that would have removed the empty keys in one big cleanup after the data had been stored.

Atomic independence also means losing some things that we take for granted like sequence ids. People like numbers but guaranteeing even ascending values can rapidly become a nightmare if you want to avoid contention and single point of failure.

Cursors are similarly tricksy, I have a long-running batch job and I realised today that it runs long enough that you cannot guarantee a known state by the time it finishes. Instead you have to do these kind of “loop until there’s nothing left to do” constructs where the loop condition expresses the state of the store you are trying to achieve and you get at least one cursor that has no entries.

There’s a lot of stuff about datastores that is ingrained deeper than you realise and it takes more than one difficult experience to start genuinely thinking differently about things.

Standard
Programming

Redis 2, worth the hype

So I’ve flirted a little bit with Redis but never really had anything that fitted it’s solution profile, until this week!

I needed to cross reference postcodes and their corresponding longitude and latitudes. I tried a few other solutions but in the end I decided that a postcode was a nice normalisable key (it just needs the country details adding in) and that since I had thousands of records to relate, I really valued speed above everything else.

Redis (in conjunction with the Python client library) tore through the data set in terms of both inserting approximately 1.7M records and reading roughly 300K entries. I also liked using the hash functionality to store the long and lat under the same key rather than having to write my own logic to create the pairing.

Redis really helped solved my problem and lived up to its promises. It definitely has a place in my toolkit from now on.

Standard
Programming

Why do you think Riak is a column-store?

Someone is wrong on the internet, the unfortunate thing is, that person is me. In one version of my NoSql article for ThoughtWorks I put Riak under the heading of a column store. Twitter feels this is wrong and that it is better classified as a distributed key-value store. I agree this description is more accurate than mine but it feels like calling limes, lemons, oranges and grapefruit citrus fruits.

When I was putting together the article I knew I would have to look at BigTable and Dynamo derivatives that were generally available. Now BigTable self-identifies as a column store where as Dynamo describes itself as a distributed key-value store. However both systems have a lot of properties in common:

  • They are peer-aware
  • Fault-tolerant
  • Distributed
  • Can be scaled to demand

Now for me these common characteristics also explain why something like Redis isn’t the same as Riak and it’s an over-simplification to just call Riak and Cassandra distributed key-value stores.

So I saw this similarity between the situations (and here’s where the mistake occurs) and thought: well you know if you look at things like Riak that allow metadata on their bucket you can consider them to be a column-store because the columns are the keys on the metadata and the bucket is actually just a key of a special type. Each key can have a variable number of columns but must have one, the bucket. Then when you query the metadata you are kind of just doing a column sort and search. Brilliant! I can just lump all these different systems under this term.

Now I don’t want to jump to another label, I want to try and get the description of the group more or less correct. Looking at the things that are common I feel that something along the lines of “P2P clustered datastores” might be more accurate.

Standard