Scale Summit 2015: Testing in production session

One of the most interesting sessions I went to at Scale Summit 2015 was one about testing in production. It was not that well attended compared to the other sessions so I don't know if there was implied agreement with the topic.

One of the questions was why it is important to test in production. For me the biggest thing is that you can only really get realistically distributed traffic from genuine traffic. Most load-testing or replay strategies fail for me at the first hurdle by only creating load from a few points of presence, usually in the big Amazon availability zones. You also have to be careful that traffic is routed outside of Amazon's internal data connections if you want to get realistic numbers. Dealing with load from a few different locations with large data pipelines between them is very different from distributed clients on the public network.

Replay strategies allow for "realistic" traffic patterns and behaviours but one of the more interesting ideas discussed was to generate fake load during off-peak periods. This is generated alongside the genuine user traffic. The fake load exercises key revenue generating pathways with some procedural randomisation. Injecting this additional fake load allows capacity planning and scaling strategies to be tested to a known excess capacity.

Doing testing in production means being responsible so we talked about how to identify fake test traffic (HTML headers with verification seemed sensible) so that you can do things like circuit-break that traffic and also segment it in reporting.

During the conversation I realised that the Guardian's practice of asking native app users to join the beta programme was also an example of testing in production. Most users who enter the scheme don't leave so you are creating a large segment of users who are validating releases and features ahead of the wider user base.

In the past we've also used the Facebook trick of duplicating user requests into multiple systems to make sure that systems that are being developed can deal with production load. If you don't like doing that client-side you can do it server-side by using a simple proxy that queues up work with a variety of systems but offloads everything that isn't the user's genuine request. Essentially you throw away the additional responses but the services will still do the work.

We also talked about the concept of having advanced healthchecks that report on the status of things like the availability of dependencies. I've used this technique before but interestingly I've made the machines go into failure mode if their mandatory dependencies aren't available where as other people were simply dashboarding the failures (and presumably alerting on them).

At the end of the session I was pretty convinced that testing in production is not only sensible but that actually there are a number of weaknesses in pre-production testing approaches. The key one being that you should assume that pre-production testing represents the best case scenario. You are testing your assumed scenario in a controlled environment.

There is also a big overlap between good monitoring and production testing. You have to have the first before you can reasonably do the second. The monitoring needs to be freely accessible to everyone as well. There's no good reason to hide monitoring away in an operations group and developers and non-technical team members need to be able to see and understand what is actually happening in production if they are to have the same conversation.


The state of microservices

One of the liveliest sessions at Scale Summit was the one on microservices where opinions flowed free and fast in a rapidly rotating fishbowl.

There were several points of interest that I took away and I thought I would write them up here as some of the key questions. In another post I’ll talk about the problems with microservices that haven’t been solved yet.

Are we doing microservices yet?

Some people in the room had already adopted microservices, the reasons given included: breaking down or trying to change functionality monolith codebases, trying to scale up an existing applications or architectures or actually as an existing best practice that should be applied more generally.

What is a microservice?

A few people wanted to get back to this. Showing that while it is a handy term it isn’t universally understood or agreed.

I personally think a microservice is a body of code whose purpose and implementation is easy to understand when studied, adheres to the Unix philosophy of doing one thing well and which can be re-implemented quickly if needed.

Are microservices just good practice or righteous SOA?

Well naturally you can regard every new concept as being a distillation of previous good practice. However the point of naming something is to try and make it easy to triangulate on a meaning in conversation.

Won’t microservices just be corrupted by consultants and architects?

Yes, everything popular gets corrupted in time. I’m okay with that because in the interval we have a handy term to describe some useful patterns of solution design.

Don’t microservices complicate operations?

One attendee put it well: microservices are recursive so if the operations team are going to support them then they should be in the business of creating a service that deploys services.

Some people felt that for microservices to work developers and teams had to have full stack ownership and responsibility but I felt that was trying to smuggle devops in under the microservices banner.

I think microservices are best deployed on a platform and that platform defines what a deployable service is and can be responsible for shutting down misbehaving services.

Such a scheme allows for other aspects of the Unix way to be applied such as man pages, responding to –help and other useful conventions.

The platform can check whether these conventions have been met before deploying the service.

Isn’t microservices just a way to discuss the granularity of a service?

Yes in a way, although there are a few other practices that make up a successful application of microservices you can just think about it as being a way of checking the responsibility boundaries of your service and how easy it would be to replace.

If your service has multiple responsibilities and is difficult to replace easily then it might have problems.

A lot of people wanted to use AWS an an example of good services without microservices. I think AWS is a good example of service implementation: each part has good boundaries and there are a few implementations of the key APIs. Things like the AWS security functionality is a good example of how you have to work hard to avoid having services rely on other services, and the result isn’t elegant as a result.

I would argue that public-facing APIs are probably where you want to composite microservices to provide a consistent facade onto the functionality though.

As other delegates pointed out, isolating services makes them easier to scale. If starting a server in the EC2 API requires more resources than shutting it down you might prefer to scale up just the creation service rather than many instances of the whole API which are unlikely to be used or consume resources.

As ever, horses for courses, you’re going to know your domain better than I do.

Don’t microservices cause problems as well as solve them?

Absolutely, choosing the wrong solution at the wrong time is going to cause problems. Zealously over-applying microservices or applying the idea to absurd levels is not going to have a happy outcome.

A guess a good point is that we know our problems with existing service implementations. We don’t know what problems there are with microservices or whether they have logical and simple solutions. However they are helping us solve some of our known problems.

Aren’t microservices simply REST-ful services done right?

The most common form of microservice today is probably one implemented via HTTP and JSON. However this form isn’t prescriptive. ProtocolBuffers might be a better choice for an internal exchange format and ZeroMQ might be a better choice for a transport.

I also think that message queues are a good basis for microservices with micro consumers and producers focussing on tight message types.

See also my mini-list of microservice myths which has more on this subject.

Should we be doing microservices?

I suspect that doing microservices for the sake of ticking a solution buzzword is bad. However I think microservices seem a pretty good solution to a problem I know I’ve seen in a fast-moving domain where you are trying to innovate without creating a maintenance burden or large legacy.


Scale Summit 2014

Scale Summit is the new Scale Camp, an unconference aimed at bringing the same kind of topics as you might expect at Velocity.

This was the first Scale Summit, the venue was excellent as was the food (especially the bacon rolls, from Eden apparently) and supply of drink. Scale Summit happens under Chatham House rules so there’s no attribution of what is said which allows the attendees to be really frank and also for people to be free with what they really know rather than hedging and trying to be “on message”. It makes for a fascinating gathering.

The sessions varied in their organisation but all focussed on discussion between the participants. I managed to go to the Elasticsearch session, which was interesting for the practical boundaries that people were finding and also the operational knowledge. On the subject of using ES as the primary application store, the feeling seemed to be “not yet”, but there was also some words of wisdom about separating out document stores and search functionality and not finding a superficial unity in the two purposes.

The microservices session was a fast and furious fishbowl, easily the liveliest event and one that is going to require a post in its own right. It was interesting to see that the room split into practitioners and people who were sceptical that microservices were a thing or held value over conventional service development.

After lunch I sat in on what can be done to get frontend testing off the critical path to production (not much now but clearly more effort needs to be made), distributed DOS attacks on transactional sites (not as scary as I imagined but again we have to be thinking about how this works), distributed data stores (good war stories, felt better informed for going), getting ops and developers to work together and Linux containers (definitely going to try Docker now).

I had quite a few questions going into the event and while I didn’t get all the answers I hoped for I did at least establish that smart people don’t have simple answers to them either which is reassuring. It’s hard to tell in the heat of it all whether you’re on the edge of things doing things that are pushing the boundaries or simply over-complicating your situation.

The attendees were nicely mixed and from a range of backgrounds, ops, architecture and developers were all well-represented so you felt you were seeing a rounded situation.

The unconference format left me wanting more rather than feeling I had had enough. The openess was amazing and I am planning on being there next year.