A script to compare two Macrobenchmarks runs

A script to compare two Macrobenchmarks runs

In Statistically Rigorous Android Macrobenchmarks, I laid out a methodology for rigorously comparing the outcome of two Jetpack Macrobenchmark runs. To summarize the article:

  • Remove sources of variations until the distribution fits a normal distribution with a stable standard deviation.
  • Then compute the confidence interval for a difference between two means.

When I published the article, I also shared a Google Spreadsheet template that did some of that work.

Later on, a colleague (thanks Aaron!) shared a Github repo of kscripts from Kaushik Gopal, and I realized I could easily turn my spreadsheet into a small Kotlin script to make it easier for other Square Android developers to play with comparing Macrobenchmark runs. So I did that.

Then I went on paternity leave and forgot about this, until recently when Saket Narayan reminded me that it could be worth sharing with the community.

Without further ado, here's the script.

You first need to install Kotlin, download the script and make it executable

# Install kotlin
brew install kotlin

# Download the comparison script
curl -O https://gist.githubusercontent.com/pyricau/07fd9598c5cdec0bc9f62505b6329df7/raw/977b2a84532758fd614f6cc44dab43a242922cdb/compare.benchmarks.main.kts
chmod u+x compare.benchmarks.main.kts

Then you can run the comparison script:

# Compare the json output in run1 and run2 folders.
compare.benchmarks.main.kts run1/com.example.macrobenchmark-benchmarkData.json run2/com.example.macrobenchmark-benchmarkData.json

###########################################################################
Results for com.example.InteractionLatencyBenchmarks#openHomeScreen
##################################################
NavigationMs
#########################
DATA CHECKS
✓ All checks passed, the comparison conclusion is meaningful.

Data checks for Benchmark 1
- ✓ At least 30 iterations (100)
- ✓ CV (5.26) <= 6%
- ✓ Latencies pass normality test

Data checks for Benchmark 2
- ✓ At least 30 iterations (100)
- ✓ CV (4.32) <= 6%
- ✓ Latencies pass normality test

- ✓ Variance less than doubles (0.66)
#########################
RESULT
Mean difference confidence interval at 95% confidence level:
The change yielded no statistical significance (the mean difference confidence interval crosses 0): from -6 ms (-2.36%) to 1 ms (0.3%).
#########################
MEDIANS
The median went from 259 ms to 231 ms.
DO NOT REPORT THE DIFFERENCE IN MEDIANS.
This data helps contextualize results but is not statistically meaningful.
#########################

While I'd love to get feedback and ideas for improvements (hit me up!), I'm providing this script as is, with no guarantees and no intention to maintain it. Do whatever you want with it!