1 // Copyright 2024 The Chromium Authors 2 // Use of this source code is governed by a BSD-style license that can be 3 // found in the LICENSE file. 4 5 #ifndef TESTING_PERF_CONFIDENCE_RATIO_BOOTSTRAP_ESTIMATOR_H_ 6 #define TESTING_PERF_CONFIDENCE_RATIO_BOOTSTRAP_ESTIMATOR_H_ 7 8 // This class provides construction of ratios between two measured stochastic 9 // quantities, including confidence intervals. The intended use is performance 10 // benchmarking (so that one can say with reasonable certainty that “binary A 11 // is 5% faster then binary B”), but it could in theory be used for other things 12 // as well. 13 // 14 // This is not a simple problem. The naïve way is to assume normality 15 // and use the standard rule-of-thumb that 95% CI is +/- 1.96 sigma, 16 // but benchmarks are frequently very non-normal (and we frequently 17 // do not have enough samples for the central limit theorem to come 18 // into play); in particular, measurements tend to have a long tail 19 // into slowness. The division makes it even worse, of course (dividing 20 // two normal distributions by each other does not become another normal 21 // distribution, even in the limit). Pinpoint tries to sidestep this using 22 // the Mann-Whitney U test, which has hardly any assumptions at all 23 // (its inputs do not even need to be numbers, just anything that 24 // can be ordered) but it has very low power, cannot produce confidence 25 // intervals and fundamentally measures something different from 26 // what people typically expect (the question “which one would most often 27 // win a race” is distinct from “which one has the lowest average”, 28 // giving a potentially problematic advantage to distributions with 29 // longer tails). 30 // 31 // This class uses the nonparametric bootstrap, a randomized method. 32 // A full introduction to the bootstrap is out of scope, but the basic 33 // idea is that instead of just doing r = sum(A) / sum(B), we resample the 34 // input data lots of times, and computing r over those resamples. 35 // The distribution of r then becomes more amenable to normal statistical 36 // methods. (This feels like cheating by just drawing more numbers out of 37 // thin air from the data we have, but it has a sound statistical basis.) 38 // We use the “bias-corrected and accelerated” (BCa) method for computing 39 // the confidence intervals, which is the generally recommended method 40 // these days; it automatically corrects for second-order bias and skew 41 // effects, with only fairly reasonable assumptions about the underlying 42 // distribution. We follow the exposition in this 2020 paper (which is 43 // basically an explanation of the original 1987 paper): 44 // 45 // Efron, Narasimhan: “The automatic construction of bootstrap confidence 46 // intervals” 47 // 48 // Even though the bootstrap is a very flexible method, it still makes 49 // some assumptions, in particular that the samples are independent and 50 // identically distributed. In practice, this may be violated because e.g.: 51 // 52 // a) Something happened during one or more of the runs that impacted 53 // their times (not identically distributed), e.g. a heavy cron job 54 // or thermal throttling after the first N samples. This can cause 55 // outliers, which bootstrapping is not inherently immune against. 56 // If you have an uncontrolled measurement environment, consider 57 // filtering outliers or warm-up runs to reduce the impact. 58 // b) One or both of the sides had very lucky or unlucky code or data 59 // placement, unrelated to the patch itself, causing a bias 60 // (samples are not independent). This is generally hard to do something 61 // about, but at least one thing to keep in mind is to always run with 62 // binary names of the same length (i.e., don't call one of your binaries 63 // _old, since argv[0] is put on the stack and this can perturb the 64 // layout if you are unlucky). 65 // 66 // It also does not solve the problem of multiple comparisons (you could 67 // consider e.g. Bonferroni correction), and it does not do well with extremely 68 // high confidence levels unless you have lots of samples (e.g., CI=99.9999% 69 // with 100 samples will quickly hit the edge of the empirical distribution, 70 // which is not correct). 71 // 72 // I've spot-checked the results against the bcajack R package, 73 // which is the authors' own implementation of the methods in the paper; 74 // of course, in a system based on randomness, it's impossible to get 75 // exactly the same numbers. We don't the optimization of block-based 76 // jackknife for computing the acceleration (which takes it down from 77 // O(n²) to O(n)), as we generally only have a couple hundred samples. 78 // We also don't implement the calculation of error bars stemming from 79 // the resampling randomness (so-called internal error). 80 // 81 // This class is generally written to be independent of Blink, for slightly 82 // more general reuse (though the test is part of blink_unittests). It is 83 // not intended to be part of the main Chromium build, only used in 84 // auxiliary utilities. 85 86 #include <stdint.h> 87 88 #include <random> 89 #include <vector> 90 91 #include "third_party/blink/renderer/core/core_export.h" 92 93 class RatioBootstrapEstimator { 94 public: RatioBootstrapEstimator(uint64_t seed)95 explicit RatioBootstrapEstimator(uint64_t seed) : gen_(seed) {} 96 97 struct Sample { 98 // These are generally assumed to be time spent, 99 // but if you have a higher-is-better score, you can just 100 // invert the returned ratios below to get the improvement. 101 double before, after; 102 }; 103 struct Estimate { 104 // The most likely single value (just sum(before) / sum(after), 105 // without any further statistical computation). E.g., if the 106 // new code is 5% faster, this will return 1.05 (which means 107 // it is capable of computing 1.05 as many items in a given time; 108 // 5% higher throughput). 109 double point_estimate; 110 111 // The xx% confidence interval. 112 double lower, upper; 113 }; 114 115 // NOTE: The slowest part of this code is generally drawing the random 116 // numbers. To counteract this, we allow computing multiple series at the 117 // same time in parallel, reusing the randomness (the computations are 118 // independent). However, they should be the same length, possibly +/- 1; 119 // if not, the resampling will use the shortest one for all values, which 120 // could return wider confidence intervals than would be ideal if there is 121 // a large discrepancy. 122 // 123 // We've generally found num_resamples = 2000 to give reasonable accuracy; 124 // generally enough that the percentage values fluctuate only in the 125 // first digit after the decimal point. 126 // 127 // Confidence level is typically a number like 0.95 (95% CI) or 0.99. 128 // 129 // If compute_geometric_mean is set to true, you will get an extra 130 // estimate at the end, estimating the geometric mean between all the 131 // other ratios. 132 std::vector<Estimate> ComputeRatioEstimates( 133 const std::vector<std::vector<Sample>>& data, 134 unsigned num_resamples, 135 double confidence_level, 136 bool compute_geometric_mean); 137 138 // Pulled out for unit testing only. 139 static double InverseNormalCDF(double p); 140 141 private: 142 static double EstimateRatioExcept(const std::vector<Sample>& x, 143 int skip_index); 144 static double EstimateGeometricMeanExcept( 145 const std::vector<std::vector<Sample>>& x, 146 int skip_index); 147 148 // mt19937 isn't a great PRNG (for one, it has huge state), but 149 // 150 // a) It makes it easier to port this code to a non-Chromium context, and 151 // b) We need the determinism for unit testing purposes (otherwise, 152 // we are almost certain to make a test that is flaky to some degree), 153 // and e.g. base::RandomBitGenerator does not support seeding. 154 std::mt19937 gen_; 155 }; 156 157 #endif // TESTING_PERF_CONFIDENCE_RATIO_BOOTSTRAP_ESTIMATOR_H_ 158