1# User Guide 2 3## Command Line 4 5[Output Formats](#output-formats) 6 7[Output Files](#output-files) 8 9[Running Benchmarks](#running-benchmarks) 10 11[Running a Subset of Benchmarks](#running-a-subset-of-benchmarks) 12 13[Result Comparison](#result-comparison) 14 15[Extra Context](#extra-context) 16 17## Library 18 19[Runtime and Reporting Considerations](#runtime-and-reporting-considerations) 20 21[Setup/Teardown](#setupteardown) 22 23[Passing Arguments](#passing-arguments) 24 25[Custom Benchmark Name](#custom-benchmark-name) 26 27[Calculating Asymptotic Complexity](#asymptotic-complexity) 28 29[Templated Benchmarks](#templated-benchmarks) 30 31[Fixtures](#fixtures) 32 33[Custom Counters](#custom-counters) 34 35[Multithreaded Benchmarks](#multithreaded-benchmarks) 36 37[CPU Timers](#cpu-timers) 38 39[Manual Timing](#manual-timing) 40 41[Setting the Time Unit](#setting-the-time-unit) 42 43[Random Interleaving](random_interleaving.md) 44 45[User-Requested Performance Counters](perf_counters.md) 46 47[Preventing Optimization](#preventing-optimization) 48 49[Reporting Statistics](#reporting-statistics) 50 51[Custom Statistics](#custom-statistics) 52 53[Memory Usage](#memory-usage) 54 55[Using RegisterBenchmark](#using-register-benchmark) 56 57[Exiting with an Error](#exiting-with-an-error) 58 59[A Faster `KeepRunning` Loop](#a-faster-keep-running-loop) 60 61## Benchmarking Tips 62 63[Disabling CPU Frequency Scaling](#disabling-cpu-frequency-scaling) 64 65[Reducing Variance in Benchmarks](reducing_variance.md) 66 67<a name="output-formats" /> 68 69## Output Formats 70 71The library supports multiple output formats. Use the 72`--benchmark_format=<console|json|csv>` flag (or set the 73`BENCHMARK_FORMAT=<console|json|csv>` environment variable) to set 74the format type. `console` is the default format. 75 76The Console format is intended to be a human readable format. By default 77the format generates color output. Context is output on stderr and the 78tabular data on stdout. Example tabular output looks like: 79 80``` 81Benchmark Time(ns) CPU(ns) Iterations 82---------------------------------------------------------------------- 83BM_SetInsert/1024/1 28928 29349 23853 133.097kB/s 33.2742k items/s 84BM_SetInsert/1024/8 32065 32913 21375 949.487kB/s 237.372k items/s 85BM_SetInsert/1024/10 33157 33648 21431 1.13369MB/s 290.225k items/s 86``` 87 88The JSON format outputs human readable json split into two top level attributes. 89The `context` attribute contains information about the run in general, including 90information about the CPU and the date. 91The `benchmarks` attribute contains a list of every benchmark run. Example json 92output looks like: 93 94```json 95{ 96 "context": { 97 "date": "2015/03/17-18:40:25", 98 "num_cpus": 40, 99 "mhz_per_cpu": 2801, 100 "cpu_scaling_enabled": false, 101 "build_type": "debug" 102 }, 103 "benchmarks": [ 104 { 105 "name": "BM_SetInsert/1024/1", 106 "iterations": 94877, 107 "real_time": 29275, 108 "cpu_time": 29836, 109 "bytes_per_second": 134066, 110 "items_per_second": 33516 111 }, 112 { 113 "name": "BM_SetInsert/1024/8", 114 "iterations": 21609, 115 "real_time": 32317, 116 "cpu_time": 32429, 117 "bytes_per_second": 986770, 118 "items_per_second": 246693 119 }, 120 { 121 "name": "BM_SetInsert/1024/10", 122 "iterations": 21393, 123 "real_time": 32724, 124 "cpu_time": 33355, 125 "bytes_per_second": 1199226, 126 "items_per_second": 299807 127 } 128 ] 129} 130``` 131 132The CSV format outputs comma-separated values. The `context` is output on stderr 133and the CSV itself on stdout. Example CSV output looks like: 134 135``` 136name,iterations,real_time,cpu_time,bytes_per_second,items_per_second,label 137"BM_SetInsert/1024/1",65465,17890.7,8407.45,475768,118942, 138"BM_SetInsert/1024/8",116606,18810.1,9766.64,3.27646e+06,819115, 139"BM_SetInsert/1024/10",106365,17238.4,8421.53,4.74973e+06,1.18743e+06, 140``` 141 142<a name="output-files" /> 143 144## Output Files 145 146Write benchmark results to a file with the `--benchmark_out=<filename>` option 147(or set `BENCHMARK_OUT`). Specify the output format with 148`--benchmark_out_format={json|console|csv}` (or set 149`BENCHMARK_OUT_FORMAT={json|console|csv}`). Note that the 'csv' reporter is 150deprecated and the saved `.csv` file 151[is not parsable](https://github.com/google/benchmark/issues/794) by csv 152parsers. 153 154Specifying `--benchmark_out` does not suppress the console output. 155 156<a name="running-benchmarks" /> 157 158## Running Benchmarks 159 160Benchmarks are executed by running the produced binaries. Benchmarks binaries, 161by default, accept options that may be specified either through their command 162line interface or by setting environment variables before execution. For every 163`--option_flag=<value>` CLI switch, a corresponding environment variable 164`OPTION_FLAG=<value>` exist and is used as default if set (CLI switches always 165 prevails). A complete list of CLI options is available running benchmarks 166 with the `--help` switch. 167 168<a name="running-a-subset-of-benchmarks" /> 169 170## Running a Subset of Benchmarks 171 172The `--benchmark_filter=<regex>` option (or `BENCHMARK_FILTER=<regex>` 173environment variable) can be used to only run the benchmarks that match 174the specified `<regex>`. For example: 175 176```bash 177$ ./run_benchmarks.x --benchmark_filter=BM_memcpy/32 178Run on (1 X 2300 MHz CPU ) 1792016-06-25 19:34:24 180Benchmark Time CPU Iterations 181---------------------------------------------------- 182BM_memcpy/32 11 ns 11 ns 79545455 183BM_memcpy/32k 2181 ns 2185 ns 324074 184BM_memcpy/32 12 ns 12 ns 54687500 185BM_memcpy/32k 1834 ns 1837 ns 357143 186``` 187 188## Disabling Benchmarks 189 190It is possible to temporarily disable benchmarks by renaming the benchmark 191function to have the prefix "DISABLED_". This will cause the benchmark to 192be skipped at runtime. 193 194<a name="result-comparison" /> 195 196## Result comparison 197 198It is possible to compare the benchmarking results. 199See [Additional Tooling Documentation](tools.md) 200 201<a name="extra-context" /> 202 203## Extra Context 204 205Sometimes it's useful to add extra context to the content printed before the 206results. By default this section includes information about the CPU on which 207the benchmarks are running. If you do want to add more context, you can use 208the `benchmark_context` command line flag: 209 210```bash 211$ ./run_benchmarks --benchmark_context=pwd=`pwd` 212Run on (1 x 2300 MHz CPU) 213pwd: /home/user/benchmark/ 214Benchmark Time CPU Iterations 215---------------------------------------------------- 216BM_memcpy/32 11 ns 11 ns 79545455 217BM_memcpy/32k 2181 ns 2185 ns 324074 218``` 219 220You can get the same effect with the API: 221 222```c++ 223 benchmark::AddCustomContext("foo", "bar"); 224``` 225 226Note that attempts to add a second value with the same key will fail with an 227error message. 228 229<a name="runtime-and-reporting-considerations" /> 230 231## Runtime and Reporting Considerations 232 233When the benchmark binary is executed, each benchmark function is run serially. 234The number of iterations to run is determined dynamically by running the 235benchmark a few times and measuring the time taken and ensuring that the 236ultimate result will be statistically stable. As such, faster benchmark 237functions will be run for more iterations than slower benchmark functions, and 238the number of iterations is thus reported. 239 240In all cases, the number of iterations for which the benchmark is run is 241governed by the amount of time the benchmark takes. Concretely, the number of 242iterations is at least one, not more than 1e9, until CPU time is greater than 243the minimum time, or the wallclock time is 5x minimum time. The minimum time is 244set per benchmark by calling `MinTime` on the registered benchmark object. 245 246Furthermore warming up a benchmark might be necessary in order to get 247stable results because of e.g caching effects of the code under benchmark. 248Warming up means running the benchmark a given amount of time, before 249results are actually taken into account. The amount of time for which 250the warmup should be run can be set per benchmark by calling 251`MinWarmUpTime` on the registered benchmark object or for all benchmarks 252using the `--benchmark_min_warmup_time` command-line option. Note that 253`MinWarmUpTime` will overwrite the value of `--benchmark_min_warmup_time` 254for the single benchmark. How many iterations the warmup run of each 255benchmark takes is determined the same way as described in the paragraph 256above. Per default the warmup phase is set to 0 seconds and is therefore 257disabled. 258 259Average timings are then reported over the iterations run. If multiple 260repetitions are requested using the `--benchmark_repetitions` command-line 261option, or at registration time, the benchmark function will be run several 262times and statistical results across these repetitions will also be reported. 263 264As well as the per-benchmark entries, a preamble in the report will include 265information about the machine on which the benchmarks are run. 266 267<a name="setup-teardown" /> 268 269## Setup/Teardown 270 271Global setup/teardown specific to each benchmark can be done by 272passing a callback to Setup/Teardown: 273 274The setup/teardown callbacks will be invoked once for each benchmark. If the 275benchmark is multi-threaded (will run in k threads), they will be invoked 276exactly once before each run with k threads. 277 278If the benchmark uses different size groups of threads, the above will be true 279for each size group. 280 281Eg., 282 283```c++ 284static void DoSetup(const benchmark::State& state) { 285} 286 287static void DoTeardown(const benchmark::State& state) { 288} 289 290static void BM_func(benchmark::State& state) {...} 291 292BENCHMARK(BM_func)->Arg(1)->Arg(3)->Threads(16)->Threads(32)->Setup(DoSetup)->Teardown(DoTeardown); 293 294``` 295 296In this example, `DoSetup` and `DoTearDown` will be invoked 4 times each, 297specifically, once for each of this family: 298 - BM_func_Arg_1_Threads_16, BM_func_Arg_1_Threads_32 299 - BM_func_Arg_3_Threads_16, BM_func_Arg_3_Threads_32 300 301<a name="passing-arguments" /> 302 303## Passing Arguments 304 305Sometimes a family of benchmarks can be implemented with just one routine that 306takes an extra argument to specify which one of the family of benchmarks to 307run. For example, the following code defines a family of benchmarks for 308measuring the speed of `memcpy()` calls of different lengths: 309 310```c++ 311static void BM_memcpy(benchmark::State& state) { 312 char* src = new char[state.range(0)]; 313 char* dst = new char[state.range(0)]; 314 memset(src, 'x', state.range(0)); 315 for (auto _ : state) 316 memcpy(dst, src, state.range(0)); 317 state.SetBytesProcessed(int64_t(state.iterations()) * 318 int64_t(state.range(0))); 319 delete[] src; 320 delete[] dst; 321} 322BENCHMARK(BM_memcpy)->Arg(8)->Arg(64)->Arg(512)->Arg(4<<10)->Arg(8<<10); 323``` 324 325The preceding code is quite repetitive, and can be replaced with the following 326short-hand. The following invocation will pick a few appropriate arguments in 327the specified range and will generate a benchmark for each such argument. 328 329```c++ 330BENCHMARK(BM_memcpy)->Range(8, 8<<10); 331``` 332 333By default the arguments in the range are generated in multiples of eight and 334the command above selects [ 8, 64, 512, 4k, 8k ]. In the following code the 335range multiplier is changed to multiples of two. 336 337```c++ 338BENCHMARK(BM_memcpy)->RangeMultiplier(2)->Range(8, 8<<10); 339``` 340 341Now arguments generated are [ 8, 16, 32, 64, 128, 256, 512, 1024, 2k, 4k, 8k ]. 342 343The preceding code shows a method of defining a sparse range. The following 344example shows a method of defining a dense range. It is then used to benchmark 345the performance of `std::vector` initialization for uniformly increasing sizes. 346 347```c++ 348static void BM_DenseRange(benchmark::State& state) { 349 for(auto _ : state) { 350 std::vector<int> v(state.range(0), state.range(0)); 351 auto data = v.data(); 352 benchmark::DoNotOptimize(data); 353 benchmark::ClobberMemory(); 354 } 355} 356BENCHMARK(BM_DenseRange)->DenseRange(0, 1024, 128); 357``` 358 359Now arguments generated are [ 0, 128, 256, 384, 512, 640, 768, 896, 1024 ]. 360 361You might have a benchmark that depends on two or more inputs. For example, the 362following code defines a family of benchmarks for measuring the speed of set 363insertion. 364 365```c++ 366static void BM_SetInsert(benchmark::State& state) { 367 std::set<int> data; 368 for (auto _ : state) { 369 state.PauseTiming(); 370 data = ConstructRandomSet(state.range(0)); 371 state.ResumeTiming(); 372 for (int j = 0; j < state.range(1); ++j) 373 data.insert(RandomNumber()); 374 } 375} 376BENCHMARK(BM_SetInsert) 377 ->Args({1<<10, 128}) 378 ->Args({2<<10, 128}) 379 ->Args({4<<10, 128}) 380 ->Args({8<<10, 128}) 381 ->Args({1<<10, 512}) 382 ->Args({2<<10, 512}) 383 ->Args({4<<10, 512}) 384 ->Args({8<<10, 512}); 385``` 386 387The preceding code is quite repetitive, and can be replaced with the following 388short-hand. The following macro will pick a few appropriate arguments in the 389product of the two specified ranges and will generate a benchmark for each such 390pair. 391 392<!-- {% raw %} --> 393```c++ 394BENCHMARK(BM_SetInsert)->Ranges({{1<<10, 8<<10}, {128, 512}}); 395``` 396<!-- {% endraw %} --> 397 398Some benchmarks may require specific argument values that cannot be expressed 399with `Ranges`. In this case, `ArgsProduct` offers the ability to generate a 400benchmark input for each combination in the product of the supplied vectors. 401 402<!-- {% raw %} --> 403```c++ 404BENCHMARK(BM_SetInsert) 405 ->ArgsProduct({{1<<10, 3<<10, 8<<10}, {20, 40, 60, 80}}) 406// would generate the same benchmark arguments as 407BENCHMARK(BM_SetInsert) 408 ->Args({1<<10, 20}) 409 ->Args({3<<10, 20}) 410 ->Args({8<<10, 20}) 411 ->Args({3<<10, 40}) 412 ->Args({8<<10, 40}) 413 ->Args({1<<10, 40}) 414 ->Args({1<<10, 60}) 415 ->Args({3<<10, 60}) 416 ->Args({8<<10, 60}) 417 ->Args({1<<10, 80}) 418 ->Args({3<<10, 80}) 419 ->Args({8<<10, 80}); 420``` 421<!-- {% endraw %} --> 422 423For the most common scenarios, helper methods for creating a list of 424integers for a given sparse or dense range are provided. 425 426```c++ 427BENCHMARK(BM_SetInsert) 428 ->ArgsProduct({ 429 benchmark::CreateRange(8, 128, /*multi=*/2), 430 benchmark::CreateDenseRange(1, 4, /*step=*/1) 431 }) 432// would generate the same benchmark arguments as 433BENCHMARK(BM_SetInsert) 434 ->ArgsProduct({ 435 {8, 16, 32, 64, 128}, 436 {1, 2, 3, 4} 437 }); 438``` 439 440For more complex patterns of inputs, passing a custom function to `Apply` allows 441programmatic specification of an arbitrary set of arguments on which to run the 442benchmark. The following example enumerates a dense range on one parameter, 443and a sparse range on the second. 444 445```c++ 446static void CustomArguments(benchmark::internal::Benchmark* b) { 447 for (int i = 0; i <= 10; ++i) 448 for (int j = 32; j <= 1024*1024; j *= 8) 449 b->Args({i, j}); 450} 451BENCHMARK(BM_SetInsert)->Apply(CustomArguments); 452``` 453 454### Passing Arbitrary Arguments to a Benchmark 455 456In C++11 it is possible to define a benchmark that takes an arbitrary number 457of extra arguments. The `BENCHMARK_CAPTURE(func, test_case_name, ...args)` 458macro creates a benchmark that invokes `func` with the `benchmark::State` as 459the first argument followed by the specified `args...`. 460The `test_case_name` is appended to the name of the benchmark and 461should describe the values passed. 462 463```c++ 464template <class ...Args> 465void BM_takes_args(benchmark::State& state, Args&&... args) { 466 auto args_tuple = std::make_tuple(std::move(args)...); 467 for (auto _ : state) { 468 std::cout << std::get<0>(args_tuple) << ": " << std::get<1>(args_tuple) 469 << '\n'; 470 [...] 471 } 472} 473// Registers a benchmark named "BM_takes_args/int_string_test" that passes 474// the specified values to `args`. 475BENCHMARK_CAPTURE(BM_takes_args, int_string_test, 42, std::string("abc")); 476 477// Registers the same benchmark "BM_takes_args/int_test" that passes 478// the specified values to `args`. 479BENCHMARK_CAPTURE(BM_takes_args, int_test, 42, 43); 480``` 481 482Note that elements of `...args` may refer to global variables. Users should 483avoid modifying global state inside of a benchmark. 484 485<a name="asymptotic-complexity" /> 486 487## Calculating Asymptotic Complexity (Big O) 488 489Asymptotic complexity might be calculated for a family of benchmarks. The 490following code will calculate the coefficient for the high-order term in the 491running time and the normalized root-mean square error of string comparison. 492 493```c++ 494static void BM_StringCompare(benchmark::State& state) { 495 std::string s1(state.range(0), '-'); 496 std::string s2(state.range(0), '-'); 497 for (auto _ : state) { 498 auto comparison_result = s1.compare(s2); 499 benchmark::DoNotOptimize(comparison_result); 500 } 501 state.SetComplexityN(state.range(0)); 502} 503BENCHMARK(BM_StringCompare) 504 ->RangeMultiplier(2)->Range(1<<10, 1<<18)->Complexity(benchmark::oN); 505``` 506 507As shown in the following invocation, asymptotic complexity might also be 508calculated automatically. 509 510```c++ 511BENCHMARK(BM_StringCompare) 512 ->RangeMultiplier(2)->Range(1<<10, 1<<18)->Complexity(); 513``` 514 515The following code will specify asymptotic complexity with a lambda function, 516that might be used to customize high-order term calculation. 517 518```c++ 519BENCHMARK(BM_StringCompare)->RangeMultiplier(2) 520 ->Range(1<<10, 1<<18)->Complexity([](benchmark::IterationCount n)->double{return n; }); 521``` 522 523<a name="custom-benchmark-name" /> 524 525## Custom Benchmark Name 526 527You can change the benchmark's name as follows: 528 529```c++ 530BENCHMARK(BM_memcpy)->Name("memcpy")->RangeMultiplier(2)->Range(8, 8<<10); 531``` 532 533The invocation will execute the benchmark as before using `BM_memcpy` but changes 534the prefix in the report to `memcpy`. 535 536<a name="templated-benchmarks" /> 537 538## Templated Benchmarks 539 540This example produces and consumes messages of size `sizeof(v)` `range_x` 541times. It also outputs throughput in the absence of multiprogramming. 542 543```c++ 544template <class Q> void BM_Sequential(benchmark::State& state) { 545 Q q; 546 typename Q::value_type v; 547 for (auto _ : state) { 548 for (int i = state.range(0); i--; ) 549 q.push(v); 550 for (int e = state.range(0); e--; ) 551 q.Wait(&v); 552 } 553 // actually messages, not bytes: 554 state.SetBytesProcessed( 555 static_cast<int64_t>(state.iterations())*state.range(0)); 556} 557// C++03 558BENCHMARK_TEMPLATE(BM_Sequential, WaitQueue<int>)->Range(1<<0, 1<<10); 559 560// C++11 or newer, you can use the BENCHMARK macro with template parameters: 561BENCHMARK(BM_Sequential<WaitQueue<int>>)->Range(1<<0, 1<<10); 562 563``` 564 565Three macros are provided for adding benchmark templates. 566 567```c++ 568#ifdef BENCHMARK_HAS_CXX11 569#define BENCHMARK(func<...>) // Takes any number of parameters. 570#else // C++ < C++11 571#define BENCHMARK_TEMPLATE(func, arg1) 572#endif 573#define BENCHMARK_TEMPLATE1(func, arg1) 574#define BENCHMARK_TEMPLATE2(func, arg1, arg2) 575``` 576 577<a name="fixtures" /> 578 579## Fixtures 580 581Fixture tests are created by first defining a type that derives from 582`::benchmark::Fixture` and then creating/registering the tests using the 583following macros: 584 585* `BENCHMARK_F(ClassName, Method)` 586* `BENCHMARK_DEFINE_F(ClassName, Method)` 587* `BENCHMARK_REGISTER_F(ClassName, Method)` 588 589For Example: 590 591```c++ 592class MyFixture : public benchmark::Fixture { 593public: 594 void SetUp(const ::benchmark::State& state) { 595 } 596 597 void TearDown(const ::benchmark::State& state) { 598 } 599}; 600 601BENCHMARK_F(MyFixture, FooTest)(benchmark::State& st) { 602 for (auto _ : st) { 603 ... 604 } 605} 606 607BENCHMARK_DEFINE_F(MyFixture, BarTest)(benchmark::State& st) { 608 for (auto _ : st) { 609 ... 610 } 611} 612/* BarTest is NOT registered */ 613BENCHMARK_REGISTER_F(MyFixture, BarTest)->Threads(2); 614/* BarTest is now registered */ 615``` 616 617### Templated Fixtures 618 619Also you can create templated fixture by using the following macros: 620 621* `BENCHMARK_TEMPLATE_F(ClassName, Method, ...)` 622* `BENCHMARK_TEMPLATE_DEFINE_F(ClassName, Method, ...)` 623 624For example: 625 626```c++ 627template<typename T> 628class MyFixture : public benchmark::Fixture {}; 629 630BENCHMARK_TEMPLATE_F(MyFixture, IntTest, int)(benchmark::State& st) { 631 for (auto _ : st) { 632 ... 633 } 634} 635 636BENCHMARK_TEMPLATE_DEFINE_F(MyFixture, DoubleTest, double)(benchmark::State& st) { 637 for (auto _ : st) { 638 ... 639 } 640} 641 642BENCHMARK_REGISTER_F(MyFixture, DoubleTest)->Threads(2); 643``` 644 645<a name="custom-counters" /> 646 647## Custom Counters 648 649You can add your own counters with user-defined names. The example below 650will add columns "Foo", "Bar" and "Baz" in its output: 651 652```c++ 653static void UserCountersExample1(benchmark::State& state) { 654 double numFoos = 0, numBars = 0, numBazs = 0; 655 for (auto _ : state) { 656 // ... count Foo,Bar,Baz events 657 } 658 state.counters["Foo"] = numFoos; 659 state.counters["Bar"] = numBars; 660 state.counters["Baz"] = numBazs; 661} 662``` 663 664The `state.counters` object is a `std::map` with `std::string` keys 665and `Counter` values. The latter is a `double`-like class, via an implicit 666conversion to `double&`. Thus you can use all of the standard arithmetic 667assignment operators (`=,+=,-=,*=,/=`) to change the value of each counter. 668 669In multithreaded benchmarks, each counter is set on the calling thread only. 670When the benchmark finishes, the counters from each thread will be summed; 671the resulting sum is the value which will be shown for the benchmark. 672 673The `Counter` constructor accepts three parameters: the value as a `double` 674; a bit flag which allows you to show counters as rates, and/or as per-thread 675iteration, and/or as per-thread averages, and/or iteration invariants, 676and/or finally inverting the result; and a flag specifying the 'unit' - i.e. 677is 1k a 1000 (default, `benchmark::Counter::OneK::kIs1000`), or 1024 678(`benchmark::Counter::OneK::kIs1024`)? 679 680```c++ 681 // sets a simple counter 682 state.counters["Foo"] = numFoos; 683 684 // Set the counter as a rate. It will be presented divided 685 // by the duration of the benchmark. 686 // Meaning: per one second, how many 'foo's are processed? 687 state.counters["FooRate"] = Counter(numFoos, benchmark::Counter::kIsRate); 688 689 // Set the counter as a rate. It will be presented divided 690 // by the duration of the benchmark, and the result inverted. 691 // Meaning: how many seconds it takes to process one 'foo'? 692 state.counters["FooInvRate"] = Counter(numFoos, benchmark::Counter::kIsRate | benchmark::Counter::kInvert); 693 694 // Set the counter as a thread-average quantity. It will 695 // be presented divided by the number of threads. 696 state.counters["FooAvg"] = Counter(numFoos, benchmark::Counter::kAvgThreads); 697 698 // There's also a combined flag: 699 state.counters["FooAvgRate"] = Counter(numFoos,benchmark::Counter::kAvgThreadsRate); 700 701 // This says that we process with the rate of state.range(0) bytes every iteration: 702 state.counters["BytesProcessed"] = Counter(state.range(0), benchmark::Counter::kIsIterationInvariantRate, benchmark::Counter::OneK::kIs1024); 703``` 704 705When you're compiling in C++11 mode or later you can use `insert()` with 706`std::initializer_list`: 707 708<!-- {% raw %} --> 709```c++ 710 // With C++11, this can be done: 711 state.counters.insert({{"Foo", numFoos}, {"Bar", numBars}, {"Baz", numBazs}}); 712 // ... instead of: 713 state.counters["Foo"] = numFoos; 714 state.counters["Bar"] = numBars; 715 state.counters["Baz"] = numBazs; 716``` 717<!-- {% endraw %} --> 718 719### Counter Reporting 720 721When using the console reporter, by default, user counters are printed at 722the end after the table, the same way as ``bytes_processed`` and 723``items_processed``. This is best for cases in which there are few counters, 724or where there are only a couple of lines per benchmark. Here's an example of 725the default output: 726 727``` 728------------------------------------------------------------------------------ 729Benchmark Time CPU Iterations UserCounters... 730------------------------------------------------------------------------------ 731BM_UserCounter/threads:8 2248 ns 10277 ns 68808 Bar=16 Bat=40 Baz=24 Foo=8 732BM_UserCounter/threads:1 9797 ns 9788 ns 71523 Bar=2 Bat=5 Baz=3 Foo=1024m 733BM_UserCounter/threads:2 4924 ns 9842 ns 71036 Bar=4 Bat=10 Baz=6 Foo=2 734BM_UserCounter/threads:4 2589 ns 10284 ns 68012 Bar=8 Bat=20 Baz=12 Foo=4 735BM_UserCounter/threads:8 2212 ns 10287 ns 68040 Bar=16 Bat=40 Baz=24 Foo=8 736BM_UserCounter/threads:16 1782 ns 10278 ns 68144 Bar=32 Bat=80 Baz=48 Foo=16 737BM_UserCounter/threads:32 1291 ns 10296 ns 68256 Bar=64 Bat=160 Baz=96 Foo=32 738BM_UserCounter/threads:4 2615 ns 10307 ns 68040 Bar=8 Bat=20 Baz=12 Foo=4 739BM_Factorial 26 ns 26 ns 26608979 40320 740BM_Factorial/real_time 26 ns 26 ns 26587936 40320 741BM_CalculatePiRange/1 16 ns 16 ns 45704255 0 742BM_CalculatePiRange/8 73 ns 73 ns 9520927 3.28374 743BM_CalculatePiRange/64 609 ns 609 ns 1140647 3.15746 744BM_CalculatePiRange/512 4900 ns 4901 ns 142696 3.14355 745``` 746 747If this doesn't suit you, you can print each counter as a table column by 748passing the flag `--benchmark_counters_tabular=true` to the benchmark 749application. This is best for cases in which there are a lot of counters, or 750a lot of lines per individual benchmark. Note that this will trigger a 751reprinting of the table header any time the counter set changes between 752individual benchmarks. Here's an example of corresponding output when 753`--benchmark_counters_tabular=true` is passed: 754 755``` 756--------------------------------------------------------------------------------------- 757Benchmark Time CPU Iterations Bar Bat Baz Foo 758--------------------------------------------------------------------------------------- 759BM_UserCounter/threads:8 2198 ns 9953 ns 70688 16 40 24 8 760BM_UserCounter/threads:1 9504 ns 9504 ns 73787 2 5 3 1 761BM_UserCounter/threads:2 4775 ns 9550 ns 72606 4 10 6 2 762BM_UserCounter/threads:4 2508 ns 9951 ns 70332 8 20 12 4 763BM_UserCounter/threads:8 2055 ns 9933 ns 70344 16 40 24 8 764BM_UserCounter/threads:16 1610 ns 9946 ns 70720 32 80 48 16 765BM_UserCounter/threads:32 1192 ns 9948 ns 70496 64 160 96 32 766BM_UserCounter/threads:4 2506 ns 9949 ns 70332 8 20 12 4 767-------------------------------------------------------------- 768Benchmark Time CPU Iterations 769-------------------------------------------------------------- 770BM_Factorial 26 ns 26 ns 26392245 40320 771BM_Factorial/real_time 26 ns 26 ns 26494107 40320 772BM_CalculatePiRange/1 15 ns 15 ns 45571597 0 773BM_CalculatePiRange/8 74 ns 74 ns 9450212 3.28374 774BM_CalculatePiRange/64 595 ns 595 ns 1173901 3.15746 775BM_CalculatePiRange/512 4752 ns 4752 ns 147380 3.14355 776BM_CalculatePiRange/4k 37970 ns 37972 ns 18453 3.14184 777BM_CalculatePiRange/32k 303733 ns 303744 ns 2305 3.14162 778BM_CalculatePiRange/256k 2434095 ns 2434186 ns 288 3.1416 779BM_CalculatePiRange/1024k 9721140 ns 9721413 ns 71 3.14159 780BM_CalculatePi/threads:8 2255 ns 9943 ns 70936 781``` 782 783Note above the additional header printed when the benchmark changes from 784``BM_UserCounter`` to ``BM_Factorial``. This is because ``BM_Factorial`` does 785not have the same counter set as ``BM_UserCounter``. 786 787<a name="multithreaded-benchmarks"/> 788 789## Multithreaded Benchmarks 790 791In a multithreaded test (benchmark invoked by multiple threads simultaneously), 792it is guaranteed that none of the threads will start until all have reached 793the start of the benchmark loop, and all will have finished before any thread 794exits the benchmark loop. (This behavior is also provided by the `KeepRunning()` 795API) As such, any global setup or teardown can be wrapped in a check against the thread 796index: 797 798```c++ 799static void BM_MultiThreaded(benchmark::State& state) { 800 if (state.thread_index() == 0) { 801 // Setup code here. 802 } 803 for (auto _ : state) { 804 // Run the test as normal. 805 } 806 if (state.thread_index() == 0) { 807 // Teardown code here. 808 } 809} 810BENCHMARK(BM_MultiThreaded)->Threads(2); 811``` 812 813To run the benchmark across a range of thread counts, instead of `Threads`, use 814`ThreadRange`. This takes two parameters (`min_threads` and `max_threads`) and 815runs the benchmark once for values in the inclusive range. For example: 816 817```c++ 818BENCHMARK(BM_MultiThreaded)->ThreadRange(1, 8); 819``` 820 821will run `BM_MultiThreaded` with thread counts 1, 2, 4, and 8. 822 823If the benchmarked code itself uses threads and you want to compare it to 824single-threaded code, you may want to use real-time ("wallclock") measurements 825for latency comparisons: 826 827```c++ 828BENCHMARK(BM_test)->Range(8, 8<<10)->UseRealTime(); 829``` 830 831Without `UseRealTime`, CPU time is used by default. 832 833<a name="cpu-timers" /> 834 835## CPU Timers 836 837By default, the CPU timer only measures the time spent by the main thread. 838If the benchmark itself uses threads internally, this measurement may not 839be what you are looking for. Instead, there is a way to measure the total 840CPU usage of the process, by all the threads. 841 842```c++ 843void callee(int i); 844 845static void MyMain(int size) { 846#pragma omp parallel for 847 for(int i = 0; i < size; i++) 848 callee(i); 849} 850 851static void BM_OpenMP(benchmark::State& state) { 852 for (auto _ : state) 853 MyMain(state.range(0)); 854} 855 856// Measure the time spent by the main thread, use it to decide for how long to 857// run the benchmark loop. Depending on the internal implementation detail may 858// measure to anywhere from near-zero (the overhead spent before/after work 859// handoff to worker thread[s]) to the whole single-thread time. 860BENCHMARK(BM_OpenMP)->Range(8, 8<<10); 861 862// Measure the user-visible time, the wall clock (literally, the time that 863// has passed on the clock on the wall), use it to decide for how long to 864// run the benchmark loop. This will always be meaningful, and will match the 865// time spent by the main thread in single-threaded case, in general decreasing 866// with the number of internal threads doing the work. 867BENCHMARK(BM_OpenMP)->Range(8, 8<<10)->UseRealTime(); 868 869// Measure the total CPU consumption, use it to decide for how long to 870// run the benchmark loop. This will always measure to no less than the 871// time spent by the main thread in single-threaded case. 872BENCHMARK(BM_OpenMP)->Range(8, 8<<10)->MeasureProcessCPUTime(); 873 874// A mixture of the last two. Measure the total CPU consumption, but use the 875// wall clock to decide for how long to run the benchmark loop. 876BENCHMARK(BM_OpenMP)->Range(8, 8<<10)->MeasureProcessCPUTime()->UseRealTime(); 877``` 878 879### Controlling Timers 880 881Normally, the entire duration of the work loop (`for (auto _ : state) {}`) 882is measured. But sometimes, it is necessary to do some work inside of 883that loop, every iteration, but without counting that time to the benchmark time. 884That is possible, although it is not recommended, since it has high overhead. 885 886<!-- {% raw %} --> 887```c++ 888static void BM_SetInsert_With_Timer_Control(benchmark::State& state) { 889 std::set<int> data; 890 for (auto _ : state) { 891 state.PauseTiming(); // Stop timers. They will not count until they are resumed. 892 data = ConstructRandomSet(state.range(0)); // Do something that should not be measured 893 state.ResumeTiming(); // And resume timers. They are now counting again. 894 // The rest will be measured. 895 for (int j = 0; j < state.range(1); ++j) 896 data.insert(RandomNumber()); 897 } 898} 899BENCHMARK(BM_SetInsert_With_Timer_Control)->Ranges({{1<<10, 8<<10}, {128, 512}}); 900``` 901<!-- {% endraw %} --> 902 903<a name="manual-timing" /> 904 905## Manual Timing 906 907For benchmarking something for which neither CPU time nor real-time are 908correct or accurate enough, completely manual timing is supported using 909the `UseManualTime` function. 910 911When `UseManualTime` is used, the benchmarked code must call 912`SetIterationTime` once per iteration of the benchmark loop to 913report the manually measured time. 914 915An example use case for this is benchmarking GPU execution (e.g. OpenCL 916or CUDA kernels, OpenGL or Vulkan or Direct3D draw calls), which cannot 917be accurately measured using CPU time or real-time. Instead, they can be 918measured accurately using a dedicated API, and these measurement results 919can be reported back with `SetIterationTime`. 920 921```c++ 922static void BM_ManualTiming(benchmark::State& state) { 923 int microseconds = state.range(0); 924 std::chrono::duration<double, std::micro> sleep_duration { 925 static_cast<double>(microseconds) 926 }; 927 928 for (auto _ : state) { 929 auto start = std::chrono::high_resolution_clock::now(); 930 // Simulate some useful workload with a sleep 931 std::this_thread::sleep_for(sleep_duration); 932 auto end = std::chrono::high_resolution_clock::now(); 933 934 auto elapsed_seconds = 935 std::chrono::duration_cast<std::chrono::duration<double>>( 936 end - start); 937 938 state.SetIterationTime(elapsed_seconds.count()); 939 } 940} 941BENCHMARK(BM_ManualTiming)->Range(1, 1<<17)->UseManualTime(); 942``` 943 944<a name="setting-the-time-unit" /> 945 946## Setting the Time Unit 947 948If a benchmark runs a few milliseconds it may be hard to visually compare the 949measured times, since the output data is given in nanoseconds per default. In 950order to manually set the time unit, you can specify it manually: 951 952```c++ 953BENCHMARK(BM_test)->Unit(benchmark::kMillisecond); 954``` 955 956Additionally the default time unit can be set globally with the 957`--benchmark_time_unit={ns|us|ms|s}` command line argument. The argument only 958affects benchmarks where the time unit is not set explicitly. 959 960<a name="preventing-optimization" /> 961 962## Preventing Optimization 963 964To prevent a value or expression from being optimized away by the compiler 965the `benchmark::DoNotOptimize(...)` and `benchmark::ClobberMemory()` 966functions can be used. 967 968```c++ 969static void BM_test(benchmark::State& state) { 970 for (auto _ : state) { 971 int x = 0; 972 for (int i=0; i < 64; ++i) { 973 benchmark::DoNotOptimize(x += i); 974 } 975 } 976} 977``` 978 979`DoNotOptimize(<expr>)` forces the *result* of `<expr>` to be stored in either 980memory or a register. For GNU based compilers it acts as read/write barrier 981for global memory. More specifically it forces the compiler to flush pending 982writes to memory and reload any other values as necessary. 983 984Note that `DoNotOptimize(<expr>)` does not prevent optimizations on `<expr>` 985in any way. `<expr>` may even be removed entirely when the result is already 986known. For example: 987 988```c++ 989 /* Example 1: `<expr>` is removed entirely. */ 990 int foo(int x) { return x + 42; } 991 while (...) DoNotOptimize(foo(0)); // Optimized to DoNotOptimize(42); 992 993 /* Example 2: Result of '<expr>' is only reused */ 994 int bar(int) __attribute__((const)); 995 while (...) DoNotOptimize(bar(0)); // Optimized to: 996 // int __result__ = bar(0); 997 // while (...) DoNotOptimize(__result__); 998``` 999 1000The second tool for preventing optimizations is `ClobberMemory()`. In essence 1001`ClobberMemory()` forces the compiler to perform all pending writes to global 1002memory. Memory managed by block scope objects must be "escaped" using 1003`DoNotOptimize(...)` before it can be clobbered. In the below example 1004`ClobberMemory()` prevents the call to `v.push_back(42)` from being optimized 1005away. 1006 1007```c++ 1008static void BM_vector_push_back(benchmark::State& state) { 1009 for (auto _ : state) { 1010 std::vector<int> v; 1011 v.reserve(1); 1012 auto data = v.data(); // Allow v.data() to be clobbered. Pass as non-const 1013 benchmark::DoNotOptimize(data); // lvalue to avoid undesired compiler optimizations 1014 v.push_back(42); 1015 benchmark::ClobberMemory(); // Force 42 to be written to memory. 1016 } 1017} 1018``` 1019 1020Note that `ClobberMemory()` is only available for GNU or MSVC based compilers. 1021 1022<a name="reporting-statistics" /> 1023 1024## Statistics: Reporting the Mean, Median and Standard Deviation / Coefficient of variation of Repeated Benchmarks 1025 1026By default each benchmark is run once and that single result is reported. 1027However benchmarks are often noisy and a single result may not be representative 1028of the overall behavior. For this reason it's possible to repeatedly rerun the 1029benchmark. 1030 1031The number of runs of each benchmark is specified globally by the 1032`--benchmark_repetitions` flag or on a per benchmark basis by calling 1033`Repetitions` on the registered benchmark object. When a benchmark is run more 1034than once the mean, median, standard deviation and coefficient of variation 1035of the runs will be reported. 1036 1037Additionally the `--benchmark_report_aggregates_only={true|false}`, 1038`--benchmark_display_aggregates_only={true|false}` flags or 1039`ReportAggregatesOnly(bool)`, `DisplayAggregatesOnly(bool)` functions can be 1040used to change how repeated tests are reported. By default the result of each 1041repeated run is reported. When `report aggregates only` option is `true`, 1042only the aggregates (i.e. mean, median, standard deviation and coefficient 1043of variation, maybe complexity measurements if they were requested) of the runs 1044is reported, to both the reporters - standard output (console), and the file. 1045However when only the `display aggregates only` option is `true`, 1046only the aggregates are displayed in the standard output, while the file 1047output still contains everything. 1048Calling `ReportAggregatesOnly(bool)` / `DisplayAggregatesOnly(bool)` on a 1049registered benchmark object overrides the value of the appropriate flag for that 1050benchmark. 1051 1052<a name="custom-statistics" /> 1053 1054## Custom Statistics 1055 1056While having these aggregates is nice, this may not be enough for everyone. 1057For example you may want to know what the largest observation is, e.g. because 1058you have some real-time constraints. This is easy. The following code will 1059specify a custom statistic to be calculated, defined by a lambda function. 1060 1061```c++ 1062void BM_spin_empty(benchmark::State& state) { 1063 for (auto _ : state) { 1064 for (int x = 0; x < state.range(0); ++x) { 1065 benchmark::DoNotOptimize(x); 1066 } 1067 } 1068} 1069 1070BENCHMARK(BM_spin_empty) 1071 ->ComputeStatistics("max", [](const std::vector<double>& v) -> double { 1072 return *(std::max_element(std::begin(v), std::end(v))); 1073 }) 1074 ->Arg(512); 1075``` 1076 1077While usually the statistics produce values in time units, 1078you can also produce percentages: 1079 1080```c++ 1081void BM_spin_empty(benchmark::State& state) { 1082 for (auto _ : state) { 1083 for (int x = 0; x < state.range(0); ++x) { 1084 benchmark::DoNotOptimize(x); 1085 } 1086 } 1087} 1088 1089BENCHMARK(BM_spin_empty) 1090 ->ComputeStatistics("ratio", [](const std::vector<double>& v) -> double { 1091 return std::begin(v) / std::end(v); 1092 }, benchmark::StatisticUnit::kPercentage) 1093 ->Arg(512); 1094``` 1095 1096<a name="memory-usage" /> 1097 1098## Memory Usage 1099 1100It's often useful to also track memory usage for benchmarks, alongside CPU 1101performance. For this reason, benchmark offers the `RegisterMemoryManager` 1102method that allows a custom `MemoryManager` to be injected. 1103 1104If set, the `MemoryManager::Start` and `MemoryManager::Stop` methods will be 1105called at the start and end of benchmark runs to allow user code to fill out 1106a report on the number of allocations, bytes used, etc. 1107 1108This data will then be reported alongside other performance data, currently 1109only when using JSON output. 1110 1111<a name="using-register-benchmark" /> 1112 1113## Using RegisterBenchmark(name, fn, args...) 1114 1115The `RegisterBenchmark(name, func, args...)` function provides an alternative 1116way to create and register benchmarks. 1117`RegisterBenchmark(name, func, args...)` creates, registers, and returns a 1118pointer to a new benchmark with the specified `name` that invokes 1119`func(st, args...)` where `st` is a `benchmark::State` object. 1120 1121Unlike the `BENCHMARK` registration macros, which can only be used at the global 1122scope, the `RegisterBenchmark` can be called anywhere. This allows for 1123benchmark tests to be registered programmatically. 1124 1125Additionally `RegisterBenchmark` allows any callable object to be registered 1126as a benchmark. Including capturing lambdas and function objects. 1127 1128For Example: 1129```c++ 1130auto BM_test = [](benchmark::State& st, auto Inputs) { /* ... */ }; 1131 1132int main(int argc, char** argv) { 1133 for (auto& test_input : { /* ... */ }) 1134 benchmark::RegisterBenchmark(test_input.name(), BM_test, test_input); 1135 benchmark::Initialize(&argc, argv); 1136 benchmark::RunSpecifiedBenchmarks(); 1137 benchmark::Shutdown(); 1138} 1139``` 1140 1141<a name="exiting-with-an-error" /> 1142 1143## Exiting with an Error 1144 1145When errors caused by external influences, such as file I/O and network 1146communication, occur within a benchmark the 1147`State::SkipWithError(const std::string& msg)` function can be used to skip that run 1148of benchmark and report the error. Note that only future iterations of the 1149`KeepRunning()` are skipped. For the ranged-for version of the benchmark loop 1150Users must explicitly exit the loop, otherwise all iterations will be performed. 1151Users may explicitly return to exit the benchmark immediately. 1152 1153The `SkipWithError(...)` function may be used at any point within the benchmark, 1154including before and after the benchmark loop. Moreover, if `SkipWithError(...)` 1155has been used, it is not required to reach the benchmark loop and one may return 1156from the benchmark function early. 1157 1158For example: 1159 1160```c++ 1161static void BM_test(benchmark::State& state) { 1162 auto resource = GetResource(); 1163 if (!resource.good()) { 1164 state.SkipWithError("Resource is not good!"); 1165 // KeepRunning() loop will not be entered. 1166 } 1167 while (state.KeepRunning()) { 1168 auto data = resource.read_data(); 1169 if (!resource.good()) { 1170 state.SkipWithError("Failed to read data!"); 1171 break; // Needed to skip the rest of the iteration. 1172 } 1173 do_stuff(data); 1174 } 1175} 1176 1177static void BM_test_ranged_fo(benchmark::State & state) { 1178 auto resource = GetResource(); 1179 if (!resource.good()) { 1180 state.SkipWithError("Resource is not good!"); 1181 return; // Early return is allowed when SkipWithError() has been used. 1182 } 1183 for (auto _ : state) { 1184 auto data = resource.read_data(); 1185 if (!resource.good()) { 1186 state.SkipWithError("Failed to read data!"); 1187 break; // REQUIRED to prevent all further iterations. 1188 } 1189 do_stuff(data); 1190 } 1191} 1192``` 1193<a name="a-faster-keep-running-loop" /> 1194 1195## A Faster KeepRunning Loop 1196 1197In C++11 mode, a ranged-based for loop should be used in preference to 1198the `KeepRunning` loop for running the benchmarks. For example: 1199 1200```c++ 1201static void BM_Fast(benchmark::State &state) { 1202 for (auto _ : state) { 1203 FastOperation(); 1204 } 1205} 1206BENCHMARK(BM_Fast); 1207``` 1208 1209The reason the ranged-for loop is faster than using `KeepRunning`, is 1210because `KeepRunning` requires a memory load and store of the iteration count 1211ever iteration, whereas the ranged-for variant is able to keep the iteration count 1212in a register. 1213 1214For example, an empty inner loop of using the ranged-based for method looks like: 1215 1216```asm 1217# Loop Init 1218 mov rbx, qword ptr [r14 + 104] 1219 call benchmark::State::StartKeepRunning() 1220 test rbx, rbx 1221 je .LoopEnd 1222.LoopHeader: # =>This Inner Loop Header: Depth=1 1223 add rbx, -1 1224 jne .LoopHeader 1225.LoopEnd: 1226``` 1227 1228Compared to an empty `KeepRunning` loop, which looks like: 1229 1230```asm 1231.LoopHeader: # in Loop: Header=BB0_3 Depth=1 1232 cmp byte ptr [rbx], 1 1233 jne .LoopInit 1234.LoopBody: # =>This Inner Loop Header: Depth=1 1235 mov rax, qword ptr [rbx + 8] 1236 lea rcx, [rax + 1] 1237 mov qword ptr [rbx + 8], rcx 1238 cmp rax, qword ptr [rbx + 104] 1239 jb .LoopHeader 1240 jmp .LoopEnd 1241.LoopInit: 1242 mov rdi, rbx 1243 call benchmark::State::StartKeepRunning() 1244 jmp .LoopBody 1245.LoopEnd: 1246``` 1247 1248Unless C++03 compatibility is required, the ranged-for variant of writing 1249the benchmark loop should be preferred. 1250 1251<a name="disabling-cpu-frequency-scaling" /> 1252 1253## Disabling CPU Frequency Scaling 1254 1255If you see this error: 1256 1257``` 1258***WARNING*** CPU scaling is enabled, the benchmark real time measurements may 1259be noisy and will incur extra overhead. 1260``` 1261 1262you might want to disable the CPU frequency scaling while running the 1263benchmark, as well as consider other ways to stabilize the performance of 1264your system while benchmarking. 1265 1266See [Reducing Variance](reducing_variance.md) for more information. 1267