Inside the Gates of Hell

Chip Correra | March 2

Last week, I called out Dilbert for avoiding the dreaded load test – it is hard, expensive and takes too much time. Yeah, but. I use my stuff to make money and if it is slow then nobody will like it and I will die in the gutter, gasping for air. So, I simply must enter through the gates of hell and get some testing done – no way around it.

So, I checked with NBC and despite the recent loss of Olympic coverage they’re still not desperate enough to cover my idea that testing directly on a production environment is news worthy. They mumbled something about boring and too geeky. Pfsst. I’ll press on.

Beyond the obligatory warning, “Don’t blame me if you screw up and spend your weekend discovering that your disaster recovery plan has a few holes that you’ll need to fix on the fly”, let me bullet point the typical test in your production environment stuff here (I will try to add some value later in this post):

Make sure you have a disaster recovery plan and have a backup or two handy
Conduct your test during the lowest traffic periods
(Optional) Redirect users to a “Temporarily Out of Service” page, if you don’t want to include any real users in your experiment
Increase load volumes gradually to avoid complete system crashes

A quick side bar, perhaps a good subject for a later post – disaster recovery plans that haven’t been validated through execution are virtually useless. Not that I think that combining these complicated and risky test scenarios is optimal, but I certainly think that validating a disaster recovery plan precludes any type of live load testing in your production environment. So if you haven’t done this, stop here – now!

Many technology platforms today can be vertically subdivided front to back – in particular to enable maintenance and upgrades. This is most often accomplished with a load balancer rule change that sends traffic down one path or another. I can use this technique to send folks to the out of service page or maintain service while you use either all or part of your production gear to test. Further back in the stack – caches, queues, indexers, databases – it gets a bit more complicated and depends highly on your architecture.

But this common, well covered approach only seems to solve the where do I get a test environment problem. I still have all the data problems that I mentioned last post and I still have the load generation problems.

The technique that may be a little less obvious than just separating out a test environment within my production gear is what I’ll call load testing through reduction. Basically, the idea is to test various parts of my infrastructure under load by removing a component member and measure the increase in load being served by the remaining production environment.

For example, of course I’ll use the easiest component here, lets say I want to measure the front end server performance under load and I have 10 front end servers. Of course, I already know that under “normal” traffic that my array of front end servers are performing well within the acceptable range at 25% of their peak capacity – so removing one front end should not tip everything over. The starting measurement is when all 10 servers are working and there is measurable “normal” traffic. Removing one front end server should increase traffic to the remaining 9 servers. It is also worth measuring the impact on other components in the stack – caches, queues, indexers, databases, etc… Some components will have an initial reaction to the change in load and then should settle down to a new normal performance under load. In a perfectly scalable world, I should notice approximately a 10% decrease in performance at the remain front end servers. In the real world, mileage varies and knowing the details is super valuable in forecasting and planning capacity.

I can continue using this technique, gathering multiple data points, until I reach various stress points. Usually a stress point is defined as the point in which performance becomes unacceptable or worse, failure emerge. I typically stop before catastrophic failure and usually have plenty to work on well before many stress points. There is tons of learning available using this technique – developing a deep understanding or my product’s scalability on a component by component basis is incredibly useful in guiding future development investment strategies, inspiring consensus building around “acceptable” performance metrics and even budgeting future infrastructure spend.

This technique allows me to avoid time consuming environment building, expensive and inaccurate approximations as well as the nearly impossible and risky data moving and cleansing for testing purposes.

No data movement, cleaning, backups/restores, load generation needed. Know thy product performance and scalability.

2 Responses

pdelaney

March 4, 2010 at 11:02 am

Chip nice article. I agree that a disaster recovery configuration that has not really been tested is completely useless, which always seems to be the case. Buying hardware for DR is easy testing it is very difficult.

Reply
chipcorrera

March 4, 2010 at 11:15 am

Probably a good idea for a different post. Mostly, I've built practical DR plans around fault tolerance and failover strategies – in other words, bringing up the cold spares, etc…

Reply

2 Responses

Leave a Reply Cancel reply