Data wrangling with Clojure

Clojure is a great language for wrangling data that is either awkwardly-sized or where data needs to be drawn from and stored in different locations.

What does awkward-sized data mean?

I am going to attribute the term “awkward-sized data” to Henry Garner and Bruce Durling. Awkward-sized data is neither big data nor small data and to avoid defining something by what it is not I would define it as bigger than would fit comfortably into a spreadsheet and irregular enough that it is not easy to map onto a relational schema.

It is about hundreds of thousands of data points and not millions, it is data sets that fit into the memory on a reasonably specified laptop.

It also means data where you need to reconcile data between multiple datastores, something that is more common in a microservice or scalable service world where monolithic data is distributed between more systems.

What makes Clojure a good fit for the problem?

Clojure picks up a lot of good data processing traits from its inheritance as a LISP. A LISP after all is a “list processor”, the fundamental structures of the language are data and its key functionality is parsing and processing those data structures into operations. You can serialise data structures to a flat-file and back into memory purely through the reader macro and without the need for parsing libraries.

Clojure has great immutable data structures with great performance, a robust set of data processing functions in its core library, along with parallel execution versions, it has well-defined transactions on data. It is, unusually, lazy be default which means it can do powerful calculations with a minimal amount of memory usage. It has a lot of great community libraries written and also Java compatibility if you want to use an existing Java library.

Clojure also has an awesome REPL which means you have a powerful way of directly interacting with your data and getting immediate feedback on the work you are doing.

Why not use a DSL or a specify datastore?

I will leave the argument as to why you need a general purpose programming language to Tommy Hall, his talk about cloud infrastructure DSLs is equally relevant here. There are things you reasonably want to do and you can either add them all to a DSL until it has every feature of poorly thought-out programming language or you can start directly with the programming language.

For me the key thing that I always want to do is read or write data, either from a datastore, file or HTTP/JSON API. I haven’t come across a single data DSL that makes it easier to read from one datastore and write to another.

Where can I find out more?

If you are interested in statistical analysis a good place to start is Bruce Durling’s talk on Incanter which he gave relatively early in his use of it.

Henry Garner’s talk Expressive Parallel Analytics with Clojure has a name that might scare the hell out of you but, trust me, this is actually a pretty good step-by-step guide to how you do data transformations and aggregations in Clojure and then make them run in parallel to improve performance.

Libraries I like

In my own work I lean on the following libraries a lot.

JSON is the lingua franca of computing and you are going to need a decent JSON parser and serialiser, I like Cheshire because it does everything I need, which is primarily produce sensible native data structures that are as close to native JSON structures as possible.

After JSON the other thing that I always need is access to HTTP. When you are mucking around with dirty data the biggest thing I’ve found frustrating are libraries that throw exceptions whenever you get something other than a status code of 200. clj-http is immensely powerful but you will want to switch off exceptions. clj-http-lite only uses what is in the JDK so makes for easier dependencies, you need to switch off exceptions again. Most of the time the lite library is perfectly usable, if you are just using well-behaved public APIs I would not bother with anything more complicated. For an asynchronous client there is http-kit, if you want to make simultaneous requests async can be a great choice but most of the time it adds a level of complexity and indirection that I don’t think you need. You don’t need to worry about exceptions but do remember to add a basic error handler to avoid debugging heartache.

For SQL I love yesql because it doesn’t do crazy things and instead lets you write and test normal SQL and then use inside Clojure programs. In my experience this is what you want to do 100% of the time and not use some weird abstraction layer. While I will admit to being lazy and frequently loading the queries into the default namespace it is far more sensible to load them via the require-sql syntax.

One thing I have had to do a bit of is parsing and cleaning HTML and I love the library Hickory for this. One of the nice things is that because it produces a standard Clojure map for the content you can use a lot of completely vanilla Clojure techniques to do interesting things with the content.

Example projects

I created a simple film data API that reads content from an Oracle database and simply publishes it as a JSON. This use Yesql and is really just a trivial data transform that makes the underlying data much more usable by other consumers.

id-to-url is a straight-forward piece of data munging but requires internal tier access to the Guardian Content API. Given a bunch of internal id numbers from an Oracle databases we need to check the publication status of the content and then extract the public url for the content and ultimately in the REPL I write the URLs to a flat file.

Asynchronous and Parallel processing

My work has generally been IO-bound so I haven’t really needed to use much parallel processing.

However if you need it then Rich Hickey does the best explanation of reducers and why reduce is the only function you really need in data processing. For transducers (in Clojure core from 1.7) I like Kyle Kingsbury’s talk a lot and he talks about Tesser which seems to be the ultimate library for multicore processing.

For async work Rich, again, does the best explanation of core.async. For IO async ironically is probably the best approach for making the most of your resources but I haven’t yet been in a situation where


Data Modelling with Representations

One of my colleagues at ThoughtWorks, Ian Robinson,  is doing something really interesting with business and data modelling. His ideas are making me think again about some of my views on data modelling.

Data modelling is something that is criminally under-valued in software design. It is actual the crux of most IT problems and yet it often dealt with as an afterthought or from the perspective of making application development easier.

Unlike a lot of agile development tasks data modelling is often something that repays a lot of upfront thought and design. The point being that it is often easier to change a data model when it is still on the whiteboard or paper than when a schema has been constructed and is being managed. The cost of change is far greater the later it is made in the process than with most software development.

The key thing I have taken away from Ian’s work is that where the analysis goes too far is in specifying what data entities are composed of. This tips over into design and is where wasted and unnecessary work begins to appear. Instead if we can agree the language of the entities in our design we can actually have many different representations of the same entity in different parts of our system.

This may initially not seem that exciting but actually its quite subversive and challenges one of the ideas in Domain Driven Design that speaking in terms of data everything is symmetrical (i.e. the table has the same columns as the corresponding object has fields and so on…).

Saying that there is one conceptual element, say a Customer, but that there may be many Representations of a Customer is a really powerful tool particularly when trying to evolve already established systems. What this implies is that there is no single point of truth about what data a given Entity should hold and that instead we can construct representations that are suitable for the context we want to use.

The single conceptual framework allows us to introduce Ubiquitous Language but without the need to have symmetry. It also avoids making the database the single point of truth, which is the default position of most system designs, particularly where the database is also doing double-duty as an integration bus.

I think the language of Representations follows this pattern “the X as a Y”. So the Customer as a Participant in a Transaction can be different from the Customer as a Recipient of an Order.

If we think in set terms our different Representations are sets that intersect with our conceptual Customer. In most cases the Representations are going to be subsets, pairing down the information required to just that that is needed to fulfil the transaction the Representation participates in. In rarer cases we are going to have Representations that are supersets where maybe for a point the Representation carries information that will ultimately reside elsewhere once the transaction is complete but in my mind these are going to be quite rare.

So to summarise the advantages I think Representations will bring:

  • simplified data modelling diagrams, there is no need to record any information about what an entity contains, the only relevant information is its relation to other entities
  • a solution to Fat Objects that hoover up all possible functionality an entity can have over time
  • no need to produce a canonical version of an entity, the only relevant information is how one Representation gets mapped to another
  • a rational way to deal with existing data structures that cannot be changed in the current system
  • a framework for evolving data entities that operate in multiple roles