Data wrangling with Clojure

Clojure is a great language for wrangling data that is either awkwardly-sized or where data needs to be drawn from and stored in different locations.

What does awkward-sized data mean?

I am going to attribute the term “awkward-sized data” to Henry Garner and Bruce Durling. Awkward-sized data is neither big data nor small data and to avoid defining something by what it is not I would define it as bigger than would fit comfortably into a spreadsheet and irregular enough that it is not easy to map onto a relational schema.

It is about hundreds of thousands of data points and not millions, it is data sets that fit into the memory on a reasonably specified laptop.

It also means data where you need to reconcile data between multiple datastores, something that is more common in a microservice or scalable service world where monolithic data is distributed between more systems.

What makes Clojure a good fit for the problem?

Clojure picks up a lot of good data processing traits from its inheritance as a LISP. A LISP after all is a “list processor”, the fundamental structures of the language are data and its key functionality is parsing and processing those data structures into operations. You can serialise data structures to a flat-file and back into memory purely through the reader macro and without the need for parsing libraries.

Clojure has great immutable data structures with great performance, a robust set of data processing functions in its core library, along with parallel execution versions, it has well-defined transactions on data. It is, unusually, lazy be default which means it can do powerful calculations with a minimal amount of memory usage. It has a lot of great community libraries written and also Java compatibility if you want to use an existing Java library.

Clojure also has an awesome REPL which means you have a powerful way of directly interacting with your data and getting immediate feedback on the work you are doing.

Why not use a DSL or a specify datastore?

I will leave the argument as to why you need a general purpose programming language to Tommy Hall, his talk about cloud infrastructure DSLs is equally relevant here. There are things you reasonably want to do and you can either add them all to a DSL until it has every feature of poorly thought-out programming language or you can start directly with the programming language.

For me the key thing that I always want to do is read or write data, either from a datastore, file or HTTP/JSON API. I haven’t come across a single data DSL that makes it easier to read from one datastore and write to another.

Where can I find out more?

If you are interested in statistical analysis a good place to start is Bruce Durling’s talk on Incanter which he gave relatively early in his use of it.

Henry Garner’s talk Expressive Parallel Analytics with Clojure has a name that might scare the hell out of you but, trust me, this is actually a pretty good step-by-step guide to how you do data transformations and aggregations in Clojure and then make them run in parallel to improve performance.

Libraries I like

In my own work I lean on the following libraries a lot.

JSON is the lingua franca of computing and you are going to need a decent JSON parser and serialiser, I like Cheshire because it does everything I need, which is primarily produce sensible native data structures that are as close to native JSON structures as possible.

After JSON the other thing that I always need is access to HTTP. When you are mucking around with dirty data the biggest thing I’ve found frustrating are libraries that throw exceptions whenever you get something other than a status code of 200. clj-http is immensely powerful but you will want to switch off exceptions. clj-http-lite only uses what is in the JDK so makes for easier dependencies, you need to switch off exceptions again. Most of the time the lite library is perfectly usable, if you are just using well-behaved public APIs I would not bother with anything more complicated. For an asynchronous client there is http-kit, if you want to make simultaneous requests async can be a great choice but most of the time it adds a level of complexity and indirection that I don’t think you need. You don’t need to worry about exceptions but do remember to add a basic error handler to avoid debugging heartache.

For SQL I love yesql because it doesn’t do crazy things and instead lets you write and test normal SQL and then use inside Clojure programs. In my experience this is what you want to do 100% of the time and not use some weird abstraction layer. While I will admit to being lazy and frequently loading the queries into the default namespace it is far more sensible to load them via the require-sql syntax.

One thing I have had to do a bit of is parsing and cleaning HTML and I love the library Hickory for this. One of the nice things is that because it produces a standard Clojure map for the content you can use a lot of completely vanilla Clojure techniques to do interesting things with the content.

Example projects

I created a simple film data API that reads content from an Oracle database and simply publishes it as a JSON. This use Yesql and is really just a trivial data transform that makes the underlying data much more usable by other consumers.

id-to-url is a straight-forward piece of data munging but requires internal tier access to the Guardian Content API. Given a bunch of internal id numbers from an Oracle databases we need to check the publication status of the content and then extract the public url for the content and ultimately in the REPL I write the URLs to a flat file.

Asynchronous and Parallel processing

My work has generally been IO-bound so I haven’t really needed to use much parallel processing.

However if you need it then Rich Hickey does the best explanation of reducers and why reduce is the only function you really need in data processing. For transducers (in Clojure core from 1.7) I like Kyle Kingsbury’s talk a lot and he talks about Tesser which seems to be the ultimate library for multicore processing.

For async work Rich, again, does the best explanation of core.async. For IO async ironically is probably the best approach for making the most of your resources but I haven’t yet been in a situation where