Scale Summit 2015: Testing in production session

One of the most interesting sessions I went to at Scale Summit 2015 was one about testing in production. It was not that well attended compared to the other sessions so I don't know if there was implied agreement with the topic.

One of the questions was why it is important to test in production. For me the biggest thing is that you can only really get realistically distributed traffic from genuine traffic. Most load-testing or replay strategies fail for me at the first hurdle by only creating load from a few points of presence, usually in the big Amazon availability zones. You also have to be careful that traffic is routed outside of Amazon's internal data connections if you want to get realistic numbers. Dealing with load from a few different locations with large data pipelines between them is very different from distributed clients on the public network.

Replay strategies allow for "realistic" traffic patterns and behaviours but one of the more interesting ideas discussed was to generate fake load during off-peak periods. This is generated alongside the genuine user traffic. The fake load exercises key revenue generating pathways with some procedural randomisation. Injecting this additional fake load allows capacity planning and scaling strategies to be tested to a known excess capacity.

Doing testing in production means being responsible so we talked about how to identify fake test traffic (HTML headers with verification seemed sensible) so that you can do things like circuit-break that traffic and also segment it in reporting.

During the conversation I realised that the Guardian's practice of asking native app users to join the beta programme was also an example of testing in production. Most users who enter the scheme don't leave so you are creating a large segment of users who are validating releases and features ahead of the wider user base.

In the past we've also used the Facebook trick of duplicating user requests into multiple systems to make sure that systems that are being developed can deal with production load. If you don't like doing that client-side you can do it server-side by using a simple proxy that queues up work with a variety of systems but offloads everything that isn't the user's genuine request. Essentially you throw away the additional responses but the services will still do the work.

We also talked about the concept of having advanced healthchecks that report on the status of things like the availability of dependencies. I've used this technique before but interestingly I've made the machines go into failure mode if their mandatory dependencies aren't available where as other people were simply dashboarding the failures (and presumably alerting on them).

At the end of the session I was pretty convinced that testing in production is not only sensible but that actually there are a number of weaknesses in pre-production testing approaches. The key one being that you should assume that pre-production testing represents the best case scenario. You are testing your assumed scenario in a controlled environment.

There is also a big overlap between good monitoring and production testing. You have to have the first before you can reasonably do the second. The monitoring needs to be freely accessible to everyone as well. There's no good reason to hide monitoring away in an operations group and developers and non-technical team members need to be able to see and understand what is actually happening in production if they are to have the same conversation.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s