Performance testing has been traditionally relegated to latest software phases. Even Agile approaches fail to define clearly how to approach this time consuming activity, as teams fails to incorporate performance conditions in their DefinitionOfDone. In Telestax we strive to get best performance from our software, and durings latest year we have progressed on incorporating this activity in our ContinousDelivery strategy.
Performance testing is hard to achieve right. Both controlling the runtime environment, and how to collect the data to provide stable and comparable results is hard enough. This activity has been conducted traditionally by senior engineers, with enough experience at different levels (HW, OS, Application, scripting…).
This activity is usually composed of several steps, each one introducing a risk of misconfiguration which could invalidate the final results. These steps are part of a long-running cycle, since most of the time we need meaningful time to exercise the process (normally more than one hour at system level). Sometimes those misconfigurations are detected very late, and the whole process needs to be restarted again. We could try to enumerate the different steps:
- Recreate the runtime environment: This is old school, yet still used, where the tester needs to create the environment form scratch, by using a particular provided Hardware. OS needs to be installed and tuned, tooling, and the rest of necessary artifacts to finally conduct the testing.
- Start the monitored process: This should involve to execute whatever process (or set of processes) involved in the testing.
- Start monitoring tools: This should include OS system resource monitoring (CPU,Mem,network, IO…), and particular process resource monitoring (like GC for Java processes). These processes should harvest information with proper frequency, and ensure the performance penalty of monitoring is low, so the actual system performance is not impacted.
- Start injecting the defined load stimulus: This may imply to use testing tools (sipp, soapui, jmeter..), and hopefully inject the load with a ramp up curve, so monitored process is not burst with unrealistic traffic profile load. The testing tool should probably add more data into the final results, since important information must be analyzed to have a complete performance model (response time, latency…).
- Mantain load for a period of time: After ramp-up period, the load is mantined “constantly”, and the collection of data is done in the background.
- Stop, and settle down: Once the defined load testing scenario is achieved, we need to stop injecting load, and probably let system resources consumption settle down. This phase is particurlay interesting to monitor to identify possible leaks of resource consumption introduced by monitored process.
- Stop data collection: Stop all the processes associated to monitoring.
- Analyse resulting data: Hopefully the monitoring tools have collected enough data to reach a conclusion. This data may be interpreted in different ways, and usually graphs generation is done to help on interpretation.
- Assert on final results: Finally the data is aggregated into meaningful metrics, and those are asserted against the initial performance goals (CPU under certain threshold, call failure ratio under certain threshold..).
- Compare results with previous runs.
The activity may be implemented with different levels of automation:
- Totally manual: All is done manually, from environment preparation, to collecting results and analysing. This effort is usually done every release cycle.
- Reserved Environment, but manually conducted: The runtime environment is prepared and reserved for later executions. Most of the tooling and process preparation is already in place, and just a bit of update is necessary to trigger a new execution. The tester manully ensures each step is completed
- Reserved Environment, scripted, but manually triggered: Same as before, but the steps are run by scripts, which hopefully check conditions after each step to ensure the whole run is valid. The execution is still manually triggered by the tester when the team decides to do performance testing.
- Reserved Environment, scripted, automatically triggered: Same as previous, but the execution of the script is hook into the software development lifecycle somehow. The scripts needs to be designed to be remotely invoked, and hopefully configured with meaningful parameters.
We think we need to have the highest level of automation to ensure all the potential releases meet certain minimum performance goals. That involves to have predefined hardware environment to run performance testing, and scripting to cover all steps, including the possibility to hook these scripts into the software development lifecycle.
Again, we have progressed on this vision lately this year. Lots of this progress are coming from our performance tool called PerfCorder.
This tool, as its name suggests, started with a particular purpose to help us collect/record data in a uniform way, and have an standarized format on storing performance data results.
Later we incorporated a basic analysis tools, that takes that stored file, and provide statistics over the collected data. The fact we based our performance testing analysis on statistics, rather than on regular manual graph interpretation, is key to achive a consistent and repeatable process.
Finally, we incorporated a performance goal tester over the results of the analysis. This tool allows us to express different performance goals, and generate a standarized report reusing JUnit format.
It was only natural to introduce all this process in our CI/CD environment. We use Cloudbees, backed up with Amazon EC2 instances, so it was kind of easy to achieve all our goals with some tricks.
Let’s see it in Action
All the previous explanation is quite reasonable, and probably we could get a consensus in the community, but as the saying goes “A picture is worth a thousand words”. So, let’s see some screenshots of our current CI/CD environment. I will be showing my work for SIPServlets project, but same strategy its been followed by the rest of projects.
This is the view of our CD environment:
We can see a “release” job, several functional test jobs (TCK, test-suite), and finally some performance jobs. All these jobs are interelated using basic Jenkins chaining features, so they are executed one after the other composing the CD pipeline.
Let’s focus now on the performance jobs. In our case, we have three different scenarios implemented, exercising different parts of our software. So we have the three basic UAS, Proxy, and B2BUA tests. Each job is parameterized as in following picture:
The idea of having a parameterized performance job, is to be able to both configure the load scenario, and the conditions that affect the monitored process performance. So, for example we can tune the performance job to create different loads by setting the CALL_RATE, CALL_LENGTH, and TEST_DURATION parameters. At the same time, we can tune the JAVA_OPTIONS, LOG_LEVEL, JAIN_SIP_PROPS which greatly impacts the monitored process performance. Finally we use a build selector, which is handy if we want to restest a previous release with different parameters and compare with current version results.
Each performance job is configured to run a certain shell script which will run the actual test, and collect data. Following is a picture of this configuration:
As we can see, there is an initial part where environment variables are prepared, taking job parameters into account, so the final script if properly configured. The scripts are saved in Git to do proper version control.
Following picture shows the artifacts archived when performance job run is completed:
We can see PerfCorder artifacts there, including the ZIP file with all the data collected, the Analysis saved in XML file, a HTML generated view of the analysis, and finally a JUnit test report with the test assertion results.
Following snippet shows a sample of the Analysis file generated form the collected data:
This is a simplified version of the analysis file. We can see we have meta-data about the test (duration, settings..), and a measurement map with statistics from different collected data. With the “samplesToStripratio” parameter we configured the samples range to analyze, discarding the ramp-up and settle-down phases. So, a value of “10” means to discard 10% of samples at the beginning and end of the collection/csv files.
The following picture shows the HTML generated view of the analysis file:
The green vertical bars indicate which range of the samples were taken into account for the statistics.
Following snippet shows how we express our performance goals in a XSLT template, that will be applied to the analysis file:
We can see basic “lessThan” assertions with configured thresholds, and complex “ratioLessThan” assertions taking two different measurements into account. We can see we have total freedom on selecting the most meaningfull statistic about a certain measurement. So for example we may use Percentile95 for ResponseTime, since we are only interested in the worst case scenario. In the other hand, we may use Median/Mean/Mode for Memory to have a central tendency indicator of that resource consumption.
Finally this is the test report generated when the performance goal XSLT template is applied to the analysis XML file:
We can see how the different assertions worked. Here is worth noting we use a trick to insert the actual measurement value reusing the JUnit time attribute. Eventhough this is not stricly a time value, it will allows us to create some interesting graphs in the CI/CD environment by reusing existing plugins.
Since the performance goals assertions are provided using a JUnit report, we just need our Jenkins job to collect this report and present results. This of course will automatically affect the job completion status, marking ours runs as stable or unstable. Following picture shows how it works:
In this case, we can see this run has violated one of our performance goals SIPResponseTime1Percentile95. Following picture shows how it looks if we click that particular test:
We can see all the information provided in appropiate fields. Including the actual value that caused the violation (503.29), and the test condition that was evaluated (less than 500). The additional standard output XML content allows us to compare this test with previous Jenkins runs, visualizing a history of the evolution of this measurement. Following picture show the historical measurement evolution graph shown when clicking the link:
This graph is generated by a Jenkins plugin, so for us its an out-of-the-box feature. With this graph we can see how our software evolve in terms of performance with the different versions/releases. We have a fine grained view of each measurement, and we can identify trends, and trigger actions points if the performance is being degraded severely.
Results of this practice on latest SIPServlets release
Performance testing is not only about meeting certain goals, but to know how your system will behave by defining a performance model. This model will help you to anticipate how the system will scale against bigger loads, or identify the hotspots in your system to improve.
As the result of this practice during latest year, we have improved SIPServlets performance, and the latest 7.0.4 release provides those benefits.
First let me explain the load scenario. We run 200 CAPS and 400 CAPS runs with 60 seconds as call length during one hour. This is to compare how the sytem behaves under different load, and try to anticipate how it will scale. The tests are run in Amazon EC2 slaves with c3.2xlarge instance type as hardware reference. We use Sipp tool to inject UDP traffic into the SIPServlets container. We use a simple Proxy application which is the average scenario for any SIPServlet application.
The following table depicts the results and compare 7.0.4 and 7.0.3 release results(CPUMedian, MemMedian, GCPercentile95,ResponseTimePercentile95,FailedCallsSum/TotalCallsSum,RetransSum/SuccessCallsSum):
We can see release 7.0.4 retains much less memory, which favors GC consumption, and finally free some CPU. SIP parameters stay basically the same in terms of failed calls, response time, and retransmissions. The following java options were used :
-Xms4048m -Xmx4048m -Xmn256m -XX:PermSize=512m -XX:MaxPermSize=1024m -XX:+CMSIncrementalPacing -XX:CMSIncrementalDutyCycle=50 -XX:CMSIncrementalDutyCycleMin=50 -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode -XX:+UseStringCache -XX:+OptimizeStringConcat -XX:MaxTenuringThreshold=0 -XX:SurvivorRatio=128 -XX:+UseParNewGC -XX:+UseCompressedOops -XX:CMSInitiatingOccupancyFraction=50 -XX:+CMSParallelRemarkEnabled -Djava.net.preferIPv4Stack=true -Dorg.jboss.resolver.warning=true -Dsun.rmi.dgc.client.gcInterval=3600000 -Dsun.rmi.dgc.server.gcInterval=3600000 -Djboss.modules.system.pkgs=org.jboss.byteman -Djava.awt.headless=true
In addition to track performance, we have elaborated on new tuning options. Through this effort we tried the new Java Garbage Collector G1 (compared to regular CMS collector), and here are the results:
We can see G1 makes a beatifull work at saving CPU during GC, at the expense of mantaining an overall higher memory consumption. In this case the Java options were:
-Xms4048m -Xmx4048m -Xmn256m -XX:PermSize=512m -XX:MaxPermSize=1024m -XX:+UseStringCache -XX:+OptimizeStringConcat -XX:+UseCompressedOops -Djava.net.preferIPv4Stack=true -Dorg.jboss.resolver.warning=true -Djboss.modules.system.pkgs=org.jboss.byteman -Djava.awt.headless=true -XX:+UseG1GC -XX:ParallelGCThreads=8 -XX:ConcGCThreads=8 -XX:G1RSetUpdatingPauseTimePercent=10 -XX:+ParallelRefProcEnabled -XX:G1HeapRegionSize=4m -XX:G1HeapWastePercent=5 -XX:InitiatingHeapOccupancyPercent=85 -XX:+UnlockExperimentalVMOptions -XX:G1MixedGCLiveThresholdPercent=85 -XX:+AlwaysPreTouch
I hope you enjoyed the previous demo of our performance testing strategy. We think its a simple, yet effective, toolkit to track your performance, and try to get the best of it.
PerfCorder seems to be growing fast, as the team finds new places to incorporate it. Next step will be to provide a WAR deployment to monitor local process for production environments. Check out the current issue list at Github, and feel free to contact us if you are interested on contributing.