Stress testing with Google Cloud Platform
September 3rd, 2015 by EnergyworxGoogle points out that getting all cloudy gives you a tremendous amount: Agility, scalability, cost savings and more. The scales weigh heavily in favor of embracing cloud goodness. However, on the other side of that scale, getting all cloudy means giving up a degree of control. Google Cloud users don’t ‘control’ the infrastructure and, in certain cases, they don’t know the implementation behind APIs Energyworx also rely on. This is especially true of managed services such as databases and message queues, and those APIs and associated SLAs are central to the operation of our Energyworx platform, and others. Google’s solution architect Corrie Elsthon states “There’s nothing surprising, bad or wrong about this situation, as stated previously there are far more pros than cons with the cloud, but our Google engineers whose reputation (and need for a night’s sleep uninterrupted by a 3am wake up call) rely on the stability and scalability of the systems we build, we want you to know what do we do?” Google follows the age old maxim; trust but verify, and verify by testing!
Testing comes in many forms but broadly there are two types, functional and stress testing. Functional tests check for correctness. When I register for your service does my email address get encrypted and correctly persisted? Stress tests check for robustness. Does our service handle 100,000 users registering in the fifteen minutes after it’s mentioned in the news? As an aside, Corrie was tempted as she wrote this information to phrase everything in terms of “we all know this…” and “of course we all do that..” when it comes to testing because we do all know it’s a good thing to do and we all do it to one extent or another but the number of issues good engineers face with scalability issues is proof that the importance of stress testing isn’t a universally held truth, or at least a universally practiced truth. The remainder of this post focuses on a set of best practices Corrie distilled from a stress testing exercise Google did in Google Cloud Platform with us as part of our Energyworx platform go live.
Energyworx and Google Cloud Platform leveraged existing Energyworx REST APIs together with Grinder to stress test the system. Grinder allows the calls to the REST APIs to be scaled up and down as required depending on the type and degree of stress to be applied. Test scenarios were based around scaling the number of smart meters uploading data, the amount of work performed by the meters and physical locations of the meters. For example, we knew a single meter worked correctly so let’s try several hundred thousand meters working at the same time, or let’s have meters running Europe accessing the system in the US, or let’s have thousands of meters do an end of day upload at the same time. Following these best practices our platform ran extended 200 core tests for approximately $10 a time and proved that our system was ready for millions of meters flooding the grid daily with billions of values. Google were right and our launch went off without a hitch. Stress testing is a blast…
First best practice is to leverage Google Cloud Platform to provide the resources to stress test. To simulate hundreds of thousands of smart meters (or users, or game sessions, or other stimuli) takes resources and Google Cloud Platform allows you to spin these up on demand, in very little time and pay by the minute for them. That’s a great deal for stress testing.
Second best practice is that systems are often complex, with different tiers and services interacting and it can be tough to predict how they will behave under stress, so use stress testing to probe the behavior of your system and the infrastructure and services your system relies upon. Be creative with your scenarios and you’ll learn a lot about your system’s behavior.
Third best practice is that we test the rate of change of the load we apply as well as the maximum load. What that means is that it’s great to know our system can handle a load of 100K transactions per second but it’s still not a useful system if it can only handle these in batches of 10K increases each minute for 10 minutes when a single news article from the right expert can bring you that much traffic in the web equivalent of the blink of an eye.
Fourth best practice is that we test regularly. If we release each Friday and bugfix on demand, we don’t need to stress test every time Energyworx release but we stress test the entire system every 2-4 weeks to ensure that performance is not degrading over time.