Many cores, many threads?

There are roughly two approaches to solving the multicore problem: better ways of managing threads and making multiple processes easier to invoke and manage. There are a few pros and cons, threads for example are going to easier to work with if you are managing shared data or you have expensive resource creation costs that you would prefer to confine to a single initialisation event rather than going through a version of it for every job you want to create.

However if you can make your job a parallel process, avoid the need for co-ordination and have a lightweight spin-up then there is actually an interesting question as to whether you should favour processes rather than threads to boost throughput.

Recently I have been working a lot with various REST-like webservices and these have some interesting idempotent and horizontally scalable properties. Essentially you can partition the data and at that point you can execute the HTTP requests in parallel.

I’ve used two techniques for doing this, the most general is the use of GNU Parallel which is an awesome tool (which I would recommend you incorporate into your general shell use) combined with Curl which is also awesome. The other is using Python’s multiprocessing library, which was very helpful when I actually wanted to use a datasource like MongoDB to generate my initial dataset. Even here I could make a system call to Curl if I didn’t want to use Python’s Httplib.

So why use processes to do parallel work rather than threads? The first answer is very UNIX-specific in that processes actually get a tremendous amount of OS support for managing and running, compared to tools for analysing thread activity and usage.

The second is conceptual, the creation of a process to do a task is involves a simple lifecycle and the separation between the resources used by processes is absolute compared to the case with threads. Once a process has gone you know it has completed either successfully or unsuccessfully and upon completion that process no longer affects any other running process.

The third is practical, I think it is easier to construct pipelines of work that divide into parallel streams through a pipeline approach rather than trying to put that code into a thread manager.

So if the conditions apply take a good look at the multi-process approach because it might be a lot easier to implement.


One thought on “Many cores, many threads?

  1. Pingback: Many cores, many threads? | InternetRSSFeeds

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s