• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1 // Copyright 2024 The Chromium Authors
2 // Use of this source code is governed by a BSD-style license that can be
3 // found in the LICENSE file.
4 
5 #ifndef TESTING_PERF_CONFIDENCE_RATIO_BOOTSTRAP_ESTIMATOR_H_
6 #define TESTING_PERF_CONFIDENCE_RATIO_BOOTSTRAP_ESTIMATOR_H_
7 
8 // This class provides construction of ratios between two measured stochastic
9 // quantities, including confidence intervals. The intended use is performance
10 // benchmarking (so that one can say with reasonable certainty that “binary A
11 // is 5% faster then binary B”), but it could in theory be used for other things
12 // as well.
13 //
14 // This is not a simple problem. The naïve way is to assume normality
15 // and use the standard rule-of-thumb that 95% CI is +/- 1.96 sigma,
16 // but benchmarks are frequently very non-normal (and we frequently
17 // do not have enough samples for the central limit theorem to come
18 // into play); in particular, measurements tend to have a long tail
19 // into slowness. The division makes it even worse, of course (dividing
20 // two normal distributions by each other does not become another normal
21 // distribution, even in the limit). Pinpoint tries to sidestep this using
22 // the Mann-Whitney U test, which has hardly any assumptions at all
23 // (its inputs do not even need to be numbers, just anything that
24 // can be ordered) but it has very low power, cannot produce confidence
25 // intervals and fundamentally measures something different from
26 // what people typically expect (the question “which one would most often
27 // win a race” is distinct from “which one has the lowest average”,
28 // giving a potentially problematic advantage to distributions with
29 // longer tails).
30 //
31 // This class uses the nonparametric bootstrap, a randomized method.
32 // A full introduction to the bootstrap is out of scope, but the basic
33 // idea is that instead of just doing r = sum(A) / sum(B), we resample the
34 // input data lots of times, and computing r over those resamples.
35 // The distribution of r then becomes more amenable to normal statistical
36 // methods. (This feels like cheating by just drawing more numbers out of
37 // thin air from the data we have, but it has a sound statistical basis.)
38 // We use the “bias-corrected and accelerated” (BCa) method for computing
39 // the confidence intervals, which is the generally recommended method
40 // these days; it automatically corrects for second-order bias and skew
41 // effects, with only fairly reasonable assumptions about the underlying
42 // distribution. We follow the exposition in this 2020 paper (which is
43 // basically an explanation of the original 1987 paper):
44 //
45 //   Efron, Narasimhan: “The automatic construction of bootstrap confidence
46 //   intervals”
47 //
48 // Even though the bootstrap is a very flexible method, it still makes
49 // some assumptions, in particular that the samples are independent and
50 // identically distributed. In practice, this may be violated because e.g.:
51 //
52 //   a) Something happened during one or more of the runs that impacted
53 //      their times (not identically distributed), e.g. a heavy cron job
54 //      or thermal throttling after the first N samples. This can cause
55 //      outliers, which bootstrapping is not inherently immune against.
56 //      If you have an uncontrolled measurement environment, consider
57 //      filtering outliers or warm-up runs to reduce the impact.
58 //   b) One or both of the sides had very lucky or unlucky code or data
59 //      placement, unrelated to the patch itself, causing a bias
60 //      (samples are not independent). This is generally hard to do something
61 //      about, but at least one thing to keep in mind is to always run with
62 //      binary names of the same length (i.e., don't call one of your binaries
63 //      _old, since argv[0] is put on the stack and this can perturb the
64 //      layout if you are unlucky).
65 //
66 // It also does not solve the problem of multiple comparisons (you could
67 // consider e.g. Bonferroni correction), and it does not do well with extremely
68 // high confidence levels unless you have lots of samples (e.g., CI=99.9999%
69 // with 100 samples will quickly hit the edge of the empirical distribution,
70 // which is not correct).
71 //
72 // I've spot-checked the results against the bcajack R package,
73 // which is the authors' own implementation of the methods in the paper;
74 // of course, in a system based on randomness, it's impossible to get
75 // exactly the same numbers. We don't the optimization of block-based
76 // jackknife for computing the acceleration (which takes it down from
77 // O(n²) to O(n)), as we generally only have a couple hundred samples.
78 // We also don't implement the calculation of error bars stemming from
79 // the resampling randomness (so-called internal error).
80 //
81 // This class is generally written to be independent of Blink, for slightly
82 // more general reuse (though the test is part of blink_unittests). It is
83 // not intended to be part of the main Chromium build, only used in
84 // auxiliary utilities.
85 
86 #include <stdint.h>
87 
88 #include <random>
89 #include <vector>
90 
91 #include "third_party/blink/renderer/core/core_export.h"
92 
93 class RatioBootstrapEstimator {
94  public:
RatioBootstrapEstimator(uint64_t seed)95   explicit RatioBootstrapEstimator(uint64_t seed) : gen_(seed) {}
96 
97   struct Sample {
98     // These are generally assumed to be time spent,
99     // but if you have a higher-is-better score, you can just
100     // invert the returned ratios below to get the improvement.
101     double before, after;
102   };
103   struct Estimate {
104     // The most likely single value (just sum(before) / sum(after),
105     // without any further statistical computation). E.g., if the
106     // new code is 5% faster, this will return 1.05 (which means
107     // it is capable of computing 1.05 as many items in a given time;
108     // 5% higher throughput).
109     double point_estimate;
110 
111     // The xx% confidence interval.
112     double lower, upper;
113   };
114 
115   // NOTE: The slowest part of this code is generally drawing the random
116   // numbers. To counteract this, we allow computing multiple series at the
117   // same time in parallel, reusing the randomness (the computations are
118   // independent). However, they should be the same length, possibly +/- 1;
119   // if not, the resampling will use the shortest one for all values, which
120   // could return wider confidence intervals than would be ideal if there is
121   // a large discrepancy.
122   //
123   // We've generally found num_resamples = 2000 to give reasonable accuracy;
124   // generally enough that the percentage values fluctuate only in the
125   // first digit after the decimal point.
126   //
127   // Confidence level is typically a number like 0.95 (95% CI) or 0.99.
128   //
129   // If compute_geometric_mean is set to true, you will get an extra
130   // estimate at the end, estimating the geometric mean between all the
131   // other ratios.
132   std::vector<Estimate> ComputeRatioEstimates(
133       const std::vector<std::vector<Sample>>& data,
134       unsigned num_resamples,
135       double confidence_level,
136       bool compute_geometric_mean);
137 
138   // Pulled out for unit testing only.
139   static double InverseNormalCDF(double p);
140 
141  private:
142   static double EstimateRatioExcept(const std::vector<Sample>& x,
143                                     int skip_index);
144   static double EstimateGeometricMeanExcept(
145       const std::vector<std::vector<Sample>>& x,
146       int skip_index);
147 
148   // mt19937 isn't a great PRNG (for one, it has huge state), but
149   //
150   //   a) It makes it easier to port this code to a non-Chromium context, and
151   //   b) We need the determinism for unit testing purposes (otherwise,
152   //      we are almost certain to make a test that is flaky to some degree),
153   //      and e.g. base::RandomBitGenerator does not support seeding.
154   std::mt19937 gen_;
155 };
156 
157 #endif  // TESTING_PERF_CONFIDENCE_RATIO_BOOTSTRAP_ESTIMATOR_H_
158