1<html> 2<head> 3<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> 4<title>Rationale</title> 5<link rel="stylesheet" href="../../../../../doc/src/boostbook.css" type="text/css"> 6<meta name="generator" content="DocBook XSL Stylesheets V1.79.1"> 7<link rel="home" href="../index.html" title="Chapter 1. Boost.Histogram"> 8<link rel="up" href="../index.html" title="Chapter 1. Boost.Histogram"> 9<link rel="prev" href="../boost/histogram/axis/variant.html" title="Class template variant"> 10<link rel="next" href="history.html" title="Revision history"> 11</head> 12<body bgcolor="white" text="black" link="#0000FF" vlink="#840084" alink="#0000FF"> 13<table cellpadding="2" width="100%"><tr> 14<td valign="top"><img alt="Boost C++ Libraries" width="277" height="86" src="../../../../../boost.png"></td> 15<td align="center"><a href="../../../../../index.html">Home</a></td> 16<td align="center"><a href="../../../../libraries.htm">Libraries</a></td> 17<td align="center"><a href="http://www.boost.org/users/people.html">People</a></td> 18<td align="center"><a href="http://www.boost.org/users/faq.html">FAQ</a></td> 19<td align="center"><a href="../../../../../more/index.htm">More</a></td> 20</tr></table> 21<hr> 22<div class="spirit-nav"> 23<a accesskey="p" href="../boost/histogram/axis/variant.html"><img src="../../../../../doc/src/images/prev.png" alt="Prev"></a><a accesskey="u" href="../index.html"><img src="../../../../../doc/src/images/up.png" alt="Up"></a><a accesskey="h" href="../index.html"><img src="../../../../../doc/src/images/home.png" alt="Home"></a><a accesskey="n" href="history.html"><img src="../../../../../doc/src/images/next.png" alt="Next"></a> 24</div> 25<div class="section"> 26<div class="titlepage"><div><div><h2 class="title" style="clear: both"> 27<a name="histogram.rationale"></a><a class="link" href="rationale.html" title="Rationale">Rationale</a> 28</h2></div></div></div> 29<div class="toc"><dl class="toc"> 30<dt><span class="section"><a href="rationale.html#histogram.rationale.motivation">Motivation</a></span></dt> 31<dt><span class="section"><a href="rationale.html#histogram.rationale.guidelines">Guidelines</a></span></dt> 32<dt><span class="section"><a href="rationale.html#histogram.rationale.no_lambdas">No lambdas as axis types</a></span></dt> 33<dt><span class="section"><a href="rationale.html#histogram.rationale.uoflow">Under- and overflow bins</a></span></dt> 34<dt><span class="section"><a href="rationale.html#histogram.rationale.index_type">Size method of axis returns 35 signed integer</a></span></dt> 36<dt><span class="section"><a href="rationale.html#histogram.rationale.real_index_type">Continuous axis 37 accepts real-valued cell index</a></span></dt> 38<dt><span class="section"><a href="rationale.html#histogram.rationale.variance">On variance estimates</a></span></dt> 39<dt><span class="section"><a href="rationale.html#histogram.rationale.weights">Support of weighted fills</a></span></dt> 40<dt><span class="section"><a href="rationale.html#histogram.rationale.python_support">Python support</a></span></dt> 41<dt><span class="section"><a href="rationale.html#histogram.rationale.support_of_boost_accumulators">Support 42 of Boost.Accumulators</a></span></dt> 43<dt><span class="section"><a href="rationale.html#histogram.rationale.support_of_boost_range">Support of 44 Boost.Range</a></span></dt> 45<dt><span class="section"><a href="rationale.html#histogram.rationale.support_of_serialization">Support 46 of serialization</a></span></dt> 47<dt><span class="section"><a href="rationale.html#histogram.rationale.comparison_to_boost_accumulators">Comparison 48 to Boost.Accumulators</a></span></dt> 49<dt><span class="section"><a href="rationale.html#histogram.rationale.why_is_boost_histogram_not_built">Why 50 is Boost.Histogram not built on top of Boost.MultiArray?</a></span></dt> 51</dl></div> 52<div class="section"> 53<div class="titlepage"><div><div><h3 class="title"> 54<a name="histogram.rationale.motivation"></a><a class="link" href="rationale.html#histogram.rationale.motivation" title="Motivation">Motivation</a> 55</h3></div></div></div> 56<p> 57 C++ lacks a widely-used, free multi-dimensional histogram class. While it 58 is easy to write a one-dimensional histogram, writing a general multi-dimensional 59 histogram poses more of a challenge. If a few more features required by scientific 60 professionals are added onto the wish-list, then the implementation becomes 61 non-trivial and a well-tested library solution desirable. 62 </p> 63<p> 64 The <a href="https://www.gnu.org/software/gsl" target="_top">GNU Scientific Library 65 (GSL)</a> and the <a href="https://root.cern.ch" target="_top">ROOT framework</a> 66 from CERN have histogram implementations. The GSL has histograms for one 67 and two dimensions in C. The implementations are not customizable. ROOT has 68 well-tested implementations of histograms, but they are not customizable 69 and they are not easy to use correctly. ROOT also has new implementations 70 in beta-stage similar to this one, but they are still less flexible, not 71 easy to use, and they cannot be used without the rest of ROOT, which is a 72 huge library to install just to get histograms. 73 </p> 74<p> 75 The templated histogram class in this library has a minimal interface and 76 focuses on the core task of creating histograms from input data. It is very 77 customizable and extensible through user-provided classes. A single implementation 78 is used for one and multi-dimensional histograms. While being safe, customizable, 79 and convenient, the histogram is also very fast. The static version, which 80 has an axis configuration that is hard-coded at compile-time, is faster than 81 any tested competitor. 82 </p> 83<p> 84 One of the central design goals was to hide the implementation details of 85 the internal counters of the histogram. The internal counting mechanism is 86 encapsulated in a storage class, which can be switched out. The default storage 87 uses an adaptive memory management which is safe to use, memory-efficient, 88 and fast. The safety comes from the guarantee, that counts cannot overflow 89 or be capped. This is a rare guarantee, hardly found in other libraries. 90 In the standard configuration, the histogram <span class="emphasis"><em>just works</em></span> 91 under any circumstance. Yet, users with special requirements can implement 92 their own custom storage class or use an alternative builtin array-based 93 storage. 94 </p> 95</div> 96<div class="section"> 97<div class="titlepage"><div><div><h3 class="title"> 98<a name="histogram.rationale.guidelines"></a><a class="link" href="rationale.html#histogram.rationale.guidelines" title="Guidelines">Guidelines</a> 99</h3></div></div></div> 100<p> 101 This library was written based on a decade of experience collected in working 102 with big data, more precisely in the field of particle physics and astroparticle 103 physics. The design is guided by advice from people like Bjarne Stroustrup, 104 Scott Meyers, Herb Sutter, and Andrei Alexandrescu, and Chandler Carruth. 105 The <a href="https://www.python.org/dev/peps/pep-0020" target="_top">Zen of Python</a> 106 (also applies to other languages) was an inspiration and well as ideas from 107 the <a href="https://eigen.tuxfamily.org/" target="_top">Eigen library</a>. The 108 feature set was designed to be a superset of what is offered by the <a href="https://root.cern.ch" target="_top">ROOT framework</a> and the <a href="https://www.gnu.org/software/gsl" target="_top">GNU 109 Scientific Library (GSL)</a>. 110 </p> 111<p> 112 Design goals of the library: 113 </p> 114<div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "> 115<li class="listitem"> 116 Provide a simple and convenient default behavior for the casual user, 117 yet allow a maximum of customization for the power user. Follow the "Don't 118 pay for what you don't use" principle. Features that you don't use 119 should not affect your performance negatively. 120 </li> 121<li class="listitem"> 122 Provide the same interface for one-dimensional and multi-dimensional 123 histograms. This makes the interface easier to learn, and makes it easier 124 to move a project from one-dimensional to multi-dimensional analysis. 125 </li> 126<li class="listitem"> 127 Hide the details of how the bin counters work. This design allows for 128 interesting implementations, such as the default storage that provides 129 a no-overflow-guarantee, which no other library offers. 130 </li> 131<li class="listitem"> 132 Minimalism, STL and Boost compatibility. Focus the library on the task 133 of creating histograms. Functionality on top of that (drawing, further 134 processing...) should come from other libraries. This gives users maximum 135 flexibility to mix and match libraries. The histogram provides iterators 136 ranges that allow other libraries access to the histogram state. The 137 library provides iterators to access its internal counters, making it 138 compatible with STL algorithms and other Boost libraries. In addition, 139 the library was made compatible with <a href="../../../../../libs/accumulators/index.html" target="_top">Boost.Accumulators</a> 140 and <a href="../../../../../libs/range/index.html" target="_top">Boost.Range</a>. 141 </li> 142</ul></div> 143</div> 144<div class="section"> 145<div class="titlepage"><div><div><h3 class="title"> 146<a name="histogram.rationale.no_lambdas"></a><a class="link" href="rationale.html#histogram.rationale.no_lambdas" title="No lambdas as axis types">No lambdas as axis types</a> 147</h3></div></div></div> 148<p> 149 Lambdas were considered and rejected as a form of simple user-defined axis 150 type, because they do not allow access to their state, such as the current 151 axis size. Lambdas can be fully replaced by locally-defined structs. A local 152 struct cannot be templated and cannot have templated methods, but this is 153 not an issue. In the local context where the struct is created, all relevant 154 types must be known already so that locally defined structs can simply use 155 these concrete types and there is no need for templates. 156 </p> 157</div> 158<div class="section"> 159<div class="titlepage"><div><div><h3 class="title"> 160<a name="histogram.rationale.uoflow"></a><a class="link" href="rationale.html#histogram.rationale.uoflow" title="Under- and overflow bins">Under- and overflow bins</a> 161</h3></div></div></div> 162<p> 163 Axis instances by default add extra bins that count values which fall below 164 or above the range covered by the axis (for those types where that makes 165 sense). These extra bins are called under- and overflow bins, respectively. 166 The extra bins can be turned off individually for each axis to conserve memory, 167 but it is generally recommended to have them. The normal bins, excluding 168 under- and overflow, are called <span class="bold"><strong>inner bins</strong></span>. 169 </p> 170<p> 171 Under- and overflow bins are useful in one-dimensional histograms, and nearly 172 essential in multi-dimensional histograms. Here are the advantages: 173 </p> 174<div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "> 175<li class="listitem"> 176 No loss: The total sum over all bin counts is strictly equal to the number 177 of times the histogram was filled. Even NaN values are counted, they 178 are put in the overflow-bin by convention. 179 </li> 180<li class="listitem"> 181 Diagnosis: Unexpected extreme values show up in the extra bins, which 182 otherwise may be overlooked. 183 </li> 184<li class="listitem"> 185 Ability to reduce histograms: In multi-dimensional histograms, an out-of-range 186 value along one axis may be paired with an in-range value along another 187 axis. If under- and overflow bins are missing, such a value pair is lost 188 completely. If you apply a <code class="computeroutput"><span class="identifier">reduce</span></code> 189 operation on a histogram, which removes some axes by summing all counts 190 along that dimension, this would lead to distortions of the histogram 191 along the remaining axes. When under- and overflow bins are present, 192 the <code class="computeroutput"><span class="identifier">reduce</span></code> operation 193 always produces a sub-histogram identical to one obtained, if it was 194 filled with the original data. 195 </li> 196</ul></div> 197<p> 198 The presence of the extra bins does not interfere with normal indexing. On 199 an axis with <code class="computeroutput"><span class="identifier">n</span></code> bins, the 200 first bin has the index <code class="computeroutput"><span class="number">0</span></code>, the 201 last bin <code class="computeroutput"><span class="identifier">n</span><span class="special">-</span><span class="number">1</span></code>, while the under- and overflow bins are accessible 202 at the indices <code class="computeroutput"><span class="special">-</span><span class="number">1</span></code> 203 and <code class="computeroutput"><span class="identifier">n</span></code>, respectively. This 204 choice is optimized for users who are unaware of the existence of these extra 205 bins. They would find the other indexing scheme surprising, where you start 206 with <code class="computeroutput"><span class="number">0</span></code> at the underflow bin and 207 the first normal bin is at <code class="computeroutput"><span class="number">1</span></code>. 208 Also, the chosen scheme allows one to turn off the extra bins in the code 209 where the histogram is created, without changing any code downstream that 210 addresses inner bins with indices. 211 </p> 212</div> 213<div class="section"> 214<div class="titlepage"><div><div><h3 class="title"> 215<a name="histogram.rationale.index_type"></a><a class="link" href="rationale.html#histogram.rationale.index_type" title="Size method of axis returns signed integer">Size method of axis returns 216 signed integer</a> 217</h3></div></div></div> 218<p> 219 The standard library returns a container size as an unsigned integer, because 220 a container size cannot be negative. The <code class="computeroutput"><span class="identifier">size</span><span class="special">()</span></code> method of the histogram class follows this 221 rule, but the <code class="computeroutput"><span class="identifier">size</span><span class="special">()</span></code> 222 methods of axis types return a signed integral type. Why? 223 </p> 224<p> 225 As explained in the <a class="link" href="rationale.html#histogram.rationale.uoflow" title="Under- and overflow bins">section about 226 under- and overflow</a>, a histogram axis may have an optional underflow 227 bin, which is addressed by the index <code class="computeroutput"><span class="special">-</span><span class="number">1</span></code>. It follows that the index type must be signed 228 integer for all axis types. 229 </p> 230<p> 231 The <code class="computeroutput"><span class="identifier">size</span><span class="special">()</span></code> 232 method of any axis returns the same signed integer type. The size of an axis 233 cannot be negative, but this choice has two advantages. Firstly, the value 234 returned by <code class="computeroutput"><span class="identifier">size</span><span class="special">()</span></code> 235 itself is guaranteed to be a valid index, which is good since it may address 236 the overflow bin. Secondly, comparisons between an index and the value returned 237 by <code class="computeroutput"><span class="identifier">size</span><span class="special">()</span></code> 238 are frequent. If <code class="computeroutput"><span class="identifier">size</span><span class="special">()</span></code> 239 returned an unsigned integral type, compilers would produce a warning for 240 each comparisons, and rightly so. <a href="https://www.youtube.com/watch?v=wvtFGa6XJDU" target="_top">Something 241 awful happens</a> on most machines when you compare <code class="computeroutput"><span class="special">-</span><span class="number">1</span></code> with an unsigned integer, <code class="computeroutput"><span class="special">-</span><span class="number">1</span> <span class="special"><</span> <span class="number">1u</span> 242 <span class="special">==</span> <span class="keyword">false</span></code>, 243 which causes a serious bug in the following innocent-looking loop: 244 </p> 245<pre class="programlisting"><span class="keyword">auto</span> <span class="identifier">my_axis</span> <span class="special">=</span> <span class="comment">/* ... */</span><span class="special">;</span> 246<span class="comment">// naive loop to iterate over all bins, including underflow and overflow</span> 247<span class="keyword">for</span> <span class="special">(</span><span class="keyword">int</span> <span class="identifier">i</span> <span class="special">=</span> <span class="special">-</span><span class="number">1</span><span class="special">;</span> <span class="identifier">i</span> <span class="special"><=</span> <span class="identifier">my_axis</span><span class="special">.</span><span class="identifier">size</span><span class="special">();</span> <span class="special">++</span><span class="identifier">i</span><span class="special">)</span> <span class="special">{</span> 248 <span class="comment">// body is never executed if return value of my_axis.size() is an unsigned integral type</span> 249<span class="special">}</span> 250</pre> 251<p> 252 The advantages clearly override the disadvantages of this choice. 253 </p> 254</div> 255<div class="section"> 256<div class="titlepage"><div><div><h3 class="title"> 257<a name="histogram.rationale.real_index_type"></a><a class="link" href="rationale.html#histogram.rationale.real_index_type" title="Continuous axis accepts real-valued cell index">Continuous axis 258 accepts real-valued cell index</a> 259</h3></div></div></div> 260<p> 261 Each axis has a method called <code class="computeroutput"><span class="identifier">value</span><span class="special">(</span><span class="identifier">index_type</span><span class="special">)</span></code> which converts an index into the equivalent 262 value at that index. If the axis is continuous, there are many possible values 263 in the interval between two adjacent integer indices. User often want to 264 access the center of such an interval. An easy and very efficient way to 265 access the center value is for this method to accept real-valued indices. 266 Then, the center of the first bin between index <code class="computeroutput"><span class="identifier">i</span></code> 267 and <code class="computeroutput"><span class="identifier">i</span><span class="special">+</span><span class="number">1</span></code> is simply obtained by passing <code class="computeroutput"><span class="identifier">i</span><span class="special">+</span><span class="number">0.5</span></code>. 268 </p> 269<p> 270 This scheme is computationally efficient and intuitive. Each continuous axis 271 is required to accept a real-valued index, in fact, internal library code 272 relies uses this to detect whether an axis is continuous or discrete. 273 </p> 274</div> 275<div class="section"> 276<div class="titlepage"><div><div><h3 class="title"> 277<a name="histogram.rationale.variance"></a><a class="link" href="rationale.html#histogram.rationale.variance" title="On variance estimates">On variance estimates</a> 278</h3></div></div></div> 279<p> 280 Once a histogram is filled, the bin counter can be accessed with the <code class="computeroutput"><span class="identifier">at</span><span class="special">(...)</span></code> 281 method. Some accumulators offer a <code class="computeroutput"><span class="identifier">value</span><span class="special">()</span></code> method to return the cell value <span class="emphasis"><em>k</em></span> 282 and a <code class="computeroutput"><span class="identifier">variance</span><span class="special">()</span></code> 283 method, which returns an estimate <span class="emphasis"><em>v</em></span> of the <a href="https://en.wikipedia.org/wiki/Variance" target="_top">variance</a> 284 of that cell. 285 </p> 286<p> 287 If the input values for the histogram come from a <a href="https://en.wikipedia.org/wiki/Stochastic_process" target="_top">stochastic 288 process</a>, the variance estimate provides useful additional information. 289 Examples for a stochastic process are a physics experiment or a random person 290 filling out a questionnaire <a href="#ftn.histogram.rationale.variance.f0" class="footnote" name="histogram.rationale.variance.f0"><sup class="footnote">[3]</sup></a>. The variance <span class="emphasis"><em>v</em></span> is the square of the <a href="https://en.wikipedia.org/wiki/Standard_deviation" target="_top">standard deviation</a>. 291 The standard deviation is a number that tells us how much we can expect the 292 observed value to fluctuate if we or someone else would repeat our experiment 293 with new random input. 294 </p> 295<p> 296 Variance estimates are useful in many ways: 297 </p> 298<div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "> 299<li class="listitem"> 300 Error bars: Drawing an <a href="https://en.wikipedia.org/wiki/Error_bar" target="_top">error 301 bar</a> over the interval <span class="emphasis"><em>(k - sqrt(v), k + sqrt(v))</em></span> 302 is a simple visualization of the expected random scatter of the bin value 303 <span class="emphasis"><em>k</em></span>, if the histogram was cleared and filled again 304 with another independent sample of the same size (e.g. by repeating the 305 physics experiment or asking more people to fill a questionnaire). If 306 you compare the result with a fitted model (see next item), about 2/3 307 of the error bars should overlap with the model, if the model is correct. 308 </li> 309<li class="listitem"> 310 Least-squares fitting: Often you have a model of the expected number 311 of counts <span class="emphasis"><em>lambda</em></span> per bin, which is a function of 312 parameters with unknown values. A simple method to find good (sometimes 313 the best) estimates for those parameter values is to vary them until 314 the sum of squared residuals <span class="emphasis"><em>(k - lambda)^2/v</em></span> is 315 minimized. This is the <a href="https://en.wikipedia.org/wiki/Least_squares" target="_top">method 316 of least squares</a>, in which both the bin values <span class="emphasis"><em>k</em></span> 317 and variance estimates <span class="emphasis"><em>v</em></span> enter. 318 </li> 319<li class="listitem"> 320 Pull distributions: If you have two histograms filled with the same number 321 of samples and you want to know whether they are in agreement, you can 322 compare the so-called pull distribution. It is formed by subtracting 323 the counts and dividing by the square root of their variances <span class="emphasis"><em>(k1 324 - k2)/sqrt(v1 + v2)</em></span>. If the histograms are identical, the 325 pull distribution randomly scatters around zero, and about 2/3 of the 326 values are in the interval <span class="emphasis"><em>[ -1, 1]</em></span>. 327 </li> 328</ul></div> 329<p> 330 Why return the variance <span class="emphasis"><em>v</em></span> and not the standard deviation 331 <span class="emphasis"><em>s = sqrt(v)</em></span>? The reason is that variances can be trivially 332 added and it is computationally more efficient to return the variance. <a href="https://en.wikipedia.org/wiki/Variance#Properties" target="_top">Variances of independent 333 samples can be added</a> like normal numbers <span class="emphasis"><em>v3 = v1 + v2</em></span>. 334 This is not true for standard deviations, where the addition law is more 335 complex <span class="emphasis"><em>s3 = sqrt(s1^2 + s2^2)</em></span>. In that sense, the variance 336 is more straight-forward to use during data processing. The user can take 337 the square-root at the end of the processing obtain the standard deviation 338 as needed. 339 </p> 340<p> 341 How is the variance estimate <span class="emphasis"><em>v</em></span> computed for a normal 342 counting histogram? If we know the expected number of counts <span class="emphasis"><em>lambda</em></span> 343 per bin, we could compute the variance as <span class="emphasis"><em>v = lambda</em></span>, 344 because counts in a histogram follow the <a href="https://en.wikipedia.org/wiki/Poisson_distribution" target="_top">Poisson 345 distribution</a> <a href="#ftn.histogram.rationale.variance.f1" class="footnote" name="histogram.rationale.variance.f1"><sup class="footnote">[4]</sup></a>. After filling a histogram, we do not know the expected number 346 of counts <span class="emphasis"><em>lambda</em></span> for any particular bin, but we know 347 the observed count <span class="emphasis"><em>k</em></span>, which is not too far from <span class="emphasis"><em>lambda</em></span>. 348 We therefore might be tempted to just replace <span class="emphasis"><em>lambda</em></span> 349 with <span class="emphasis"><em>k</em></span> in the formula <span class="emphasis"><em>v = lambda = k</em></span>. 350 This is in fact the so-called non-parametric estimate for the variance based 351 on the <a href="https://en.wikipedia.org/wiki/Plug-in_principle" target="_top">plug-in 352 principle</a>. It is the best (and only) estimate for the variance, if 353 we know nothing more about the underlying stochastic process which generated 354 the inputs (or want to feign ignorance about it). 355 </p> 356</div> 357<div class="section"> 358<div class="titlepage"><div><div><h3 class="title"> 359<a name="histogram.rationale.weights"></a><a class="link" href="rationale.html#histogram.rationale.weights" title="Support of weighted fills">Support of weighted fills</a> 360</h3></div></div></div> 361<p> 362 A histogram sorts input values into bins and increments a bin counter if 363 an input value falls into the range covered by that bin. The <code class="computeroutput"><a class="link" href="../boost/histogram/unlimited_storage.html" title="Class template unlimited_storage">standard 364 storage</a></code> uses integer types to store these counts, see the <a class="link" href="overview.html#histogram.overview.structure.storage" title="Storage types">storage section</a> how 365 integer overflow is avoided. However, sometimes histograms need to be filled 366 with values that have a weight <span class="emphasis"><em>w</em></span> attached to them. In 367 this case, the corresponding bin counter is not increased by one, but by 368 the weight value <span class="emphasis"><em>w</em></span>. 369 </p> 370<div class="note"><table border="0" summary="Note"> 371<tr> 372<td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="../../../../../doc/src/images/note.png"></td> 373<th align="left">Note</th> 374</tr> 375<tr><td align="left" valign="top"><p> 376 There are several use-cases for weighted increments. The main use in particle 377 physics is to adapt simulated data of an experiment to real data. Simulations 378 are needed to determine various corrections and efficiencies, but a simulated 379 experiment is almost never a perfect replica of the real experiment. In 380 addition, simulations are expensive to do. So, when deviations in a simulated 381 distribution of a variable are found, one typically does not rerun the 382 simulations, but assigns weights to match the simulated distribution to 383 the real one. 384 </p></td></tr> 385</table></div> 386<p> 387 When the <code class="computeroutput"><a class="link" href="reference.html#boost.histogram.weight_storage">weight_storage</a></code> 388 is used, histograms may be filled with weighted value tuples. Two real numbers 389 per bin are stored in this case. The first keeps track of the sum of weights. 390 The second keeps track of the sum of weights squared, which is the variance 391 estimate in this case. The former is accessed with the <code class="computeroutput"><span class="identifier">value</span><span class="special">()</span></code> method of the bin counter, and the latter 392 with the <code class="computeroutput"><span class="identifier">variance</span><span class="special">()</span></code> 393 method. 394 </p> 395<p> 396 Why the sum of weights squared is the variance estimate can be derived from 397 the <a href="https://en.wikipedia.org/wiki/Variance#Properties" target="_top">mathematical 398 properties of the variance</a>. Let us say a bin is filled <span class="emphasis"><em>k1</em></span> 399 times with a fixed weight <span class="emphasis"><em>w1</em></span>. The sum of weights is 400 then <span class="emphasis"><em>w1 k1</em></span>. It then follows from the variance properties 401 that <span class="emphasis"><em>Var(w1 k1) = w1^2 Var(k1)</em></span>. Using the reasoning 402 from before, the estimated variance of <span class="emphasis"><em>k1</em></span> is <span class="emphasis"><em>k1</em></span>, 403 so that <span class="emphasis"><em>Var(w1 k1) = w1^2 Var(k1) = w1^2 k1</em></span>. Variances 404 of independent samples are additive. If the bin is further filled <span class="emphasis"><em>k2</em></span> 405 times with weight <span class="emphasis"><em>w2</em></span>, the sum of weights is <span class="emphasis"><em>w1 406 k1 + w2 k2</em></span>, with variance <span class="emphasis"><em>w1^2 k1 + w2^2 k2</em></span>. 407 This also holds for <span class="emphasis"><em>k1 = k2 = 1</em></span>. Therefore, the sum 408 of weights <span class="emphasis"><em>w[i]</em></span> has variance sum of <span class="emphasis"><em>w[i]^2</em></span>. 409 In other words, to incrementally keep track of the variance of the sum of 410 weights, we need to keep track of the sum of weights squared. 411 </p> 412</div> 413<div class="section"> 414<div class="titlepage"><div><div><h3 class="title"> 415<a name="histogram.rationale.python_support"></a><a class="link" href="rationale.html#histogram.rationale.python_support" title="Python support">Python support</a> 416</h3></div></div></div> 417<p> 418 Python is a popular scripting language in the data science community. Thus, 419 the library must be designed to support Python bindings, which are developed 420 separately. The histogram should usable as an interface between a complex 421 simulation or data-storage system written in C++ and data-analysis/plotting 422 in Python. Users are able to define a histogram in Python, let it be filled 423 on the C++ side, and then get it back for further data analysis or plotting. 424 </p> 425<p> 426 This is a major reason why a purely static design was rejected, where the 427 histogram must be fully configured at compile-time. While this generates 428 more efficient code, it does not work with Python, which requires one to 429 configure histograms at run-time without recompiling the code. 430 </p> 431</div> 432<div class="section"> 433<div class="titlepage"><div><div><h3 class="title"> 434<a name="histogram.rationale.support_of_boost_accumulators"></a><a class="link" href="rationale.html#histogram.rationale.support_of_boost_accumulators" title="Support of Boost.Accumulators">Support 435 of Boost.Accumulators</a> 436</h3></div></div></div> 437<p> 438 Boost.Histogram can be configured to use arbitrary accumulators as cells, 439 in particular the accumulators from <a href="../../../../../libs/accumulators/index.html" target="_top">Boost.Accumulators</a>. 440 Sample values can be passed to the cell accumulator, which it may use to 441 compute the mean, median, variance or other statistics of the samples sorted 442 into each cell. 443 </p> 444</div> 445<div class="section"> 446<div class="titlepage"><div><div><h3 class="title"> 447<a name="histogram.rationale.support_of_boost_range"></a><a class="link" href="rationale.html#histogram.rationale.support_of_boost_range" title="Support of Boost.Range">Support of 448 Boost.Range</a> 449</h3></div></div></div> 450<p> 451 The histogram class is a valid range and can be used with the <a href="../../../../../libs/range/index.html" target="_top">Boost.Range</a> 452 library. This library provides a custom adaptor generator, <code class="computeroutput"><span class="identifier">indexed</span></code>, analog to the corresponding adaptor 453 generator in Boost.Range, but with a potentially multi-dimensional index. 454 </p> 455</div> 456<div class="section"> 457<div class="titlepage"><div><div><h3 class="title"> 458<a name="histogram.rationale.support_of_serialization"></a><a class="link" href="rationale.html#histogram.rationale.support_of_serialization" title="Support of serialization">Support 459 of serialization</a> 460</h3></div></div></div> 461<p> 462 Serialization is implemented using <a href="../../../../../libs/serialization/index.html" target="_top">Boost.Serialization</a>. 463 It would be great to have a portable binary archive with support for floating 464 point data to store and retrieve histograms efficiently, which is currently 465 not available. The library has to be open for other serialization libraries. 466 </p> 467</div> 468<div class="section"> 469<div class="titlepage"><div><div><h3 class="title"> 470<a name="histogram.rationale.comparison_to_boost_accumulators"></a><a class="link" href="rationale.html#histogram.rationale.comparison_to_boost_accumulators" title="Comparison to Boost.Accumulators">Comparison 471 to Boost.Accumulators</a> 472</h3></div></div></div> 473<p> 474 Boost.Histogram has a minor overlap with <a href="../../../../../libs/accumulators/index.html" target="_top">Boost.Accumulators</a>, 475 but the scopes are rather different. The statistical accumulators <code class="computeroutput"><span class="identifier">density</span></code> and <code class="computeroutput"><span class="identifier">weighted_density</span></code> 476 in Boost.Accumulators generate one-dimensional histograms. The axis range 477 and the bin widths are determined automatically from a cached sample of initial 478 values. They cannot be used for multi-dimensional data. Boost.Histogram focuses 479 on multi-dimensional data and gives the user full control of how the binning 480 should be done for each dimension. 481 </p> 482<p> 483 Automatic binning is not an option for Boost.Histogram, because it does not 484 scale well to many dimensions. Because of the Curse of Dimensionality, a 485 prohibitive number of samples would need to be collected. 486 </p> 487<div class="note"><table border="0" summary="Note"> 488<tr> 489<td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="../../../../../doc/src/images/note.png"></td> 490<th align="left">Note</th> 491</tr> 492<tr><td align="left" valign="top"><p> 493 There is no scientific consensus on how do automatic binning in an optimal 494 way, mostly because there is no consensus over the cost function (there 495 are many articles with different solutions in the literature). The problem 496 is not solved for one-dimensional data, and even less so for multi-dimensional 497 data. 498 </p></td></tr> 499</table></div> 500<p> 501 Recommendation: 502 </p> 503<div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "> 504<li class="listitem"> 505 Boost.Accumulators 506 <div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: circle; "><li class="listitem"> 507 You have one-dimensional data of which you know nothing about, 508 and you want a histogram quickly without worrying about binning 509 details. 510 </li></ul></div> 511 </li> 512<li class="listitem"> 513 Boost.Histogram 514 <div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: circle; "> 515<li class="listitem"> 516 You have multi-dimensional data or you suspect you will switch 517 to multi-dimensional data later. 518 </li> 519<li class="listitem"> 520 You want to customize the binning by hand, for example, to make 521 bin edges coincide with special values or to handle special properties 522 of your values, like angles defined on a circle. 523 </li> 524</ul></div> 525 </li> 526</ul></div> 527</div> 528<div class="section"> 529<div class="titlepage"><div><div><h3 class="title"> 530<a name="histogram.rationale.why_is_boost_histogram_not_built"></a><a class="link" href="rationale.html#histogram.rationale.why_is_boost_histogram_not_built" title="Why is Boost.Histogram not built on top of Boost.MultiArray?">Why 531 is Boost.Histogram not built on top of Boost.MultiArray?</a> 532</h3></div></div></div> 533<p> 534 Boost.MultiArray implements a multi-dimensional array, it also converts an 535 index tuple into a global index that is used to access an element in the 536 array. Boost.Histogram and Boost.MultiArray share this functionality, but 537 Boost.Histogram cannot use Boost.MultiArray as a back-end. Boost.MultiArray 538 makes the rank of the array a compile-time property, while this library needs 539 the rank to be dynamic. 540 </p> 541<p> 542 Boost.MultiArray also does not allow to change the element type dynamically. 543 This is needed to implement the adaptive storage mentioned further up. Using 544 a variant type as the element type of a Boost.MultiArray would not work, 545 because it creates this wasteful layout: 546 </p> 547<p> 548 <code class="computeroutput"><span class="special">[</span><span class="identifier">type</span><span class="special">-</span><span class="identifier">index</span> <span class="number">1</span><span class="special">][</span><span class="identifier">value</span> 549 <span class="number">1</span><span class="special">][</span><span class="identifier">type</span><span class="special">-</span><span class="identifier">index</span> 550 <span class="number">2</span><span class="special">][</span><span class="identifier">value</span> <span class="number">2</span><span class="special">]...</span></code> 551 </p> 552<p> 553 A type index is stored for each cell. Moreover, the variant is always as 554 large as the largest type in the union, so there is no way to safe memory 555 by using a smaller type when the bin count is low, as it is done by the adaptive 556 storage. The adaptive storage uses only one type-index for the whole array 557 and allocates a homogeneous array of values of the same type that exactly 558 matches their sizes, creating the following layout: 559 </p> 560<p> 561 <code class="computeroutput"><span class="special">[</span><span class="identifier">type</span><span class="special">-</span><span class="identifier">index</span><span class="special">][</span><span class="identifier">value</span> <span class="number">1</span><span class="special">][</span><span class="identifier">value</span> 562 <span class="number">2</span><span class="special">][</span><span class="identifier">value</span> <span class="number">3</span><span class="special">]...</span></code> 563 </p> 564<p> 565 There is only one type index and the number of allocated bytes for the array 566 can adapted dynamically to the size of the value type. 567 </p> 568</div> 569<div class="footnotes"> 570<br><hr style="width:100; text-align:left;margin-left: 0"> 571<div id="ftn.histogram.rationale.variance.f0" class="footnote"><p><a href="#histogram.rationale.variance.f0" class="para"><sup class="para">[3] </sup></a> 572 The choices of the person are most likely not random, but if we pick a 573 random person from a group, we randomly sample from a pool of opinions 574 </p></div> 575<div id="ftn.histogram.rationale.variance.f1" class="footnote"><p><a href="#histogram.rationale.variance.f1" class="para"><sup class="para">[4] </sup></a> 576 The Poisson distribution is correct as far as the counts <span class="emphasis"><em>k</em></span> 577 themselves are of interest. If the fractions per bin <span class="emphasis"><em>p = k / 578 N</em></span> are of interest, where <span class="emphasis"><em>N</em></span> is the total 579 number of counts, then the correct distribution to describe the fractions 580 is the <a href="https://en.wikipedia.org/wiki/Multinomial_distribution" target="_top">multinomial 581 distribution</a>. 582 </p></div> 583</div> 584</div> 585<table xmlns:rev="http://www.cs.rpi.edu/~gregod/boost/tools/doc/revision" width="100%"><tr> 586<td align="left"></td> 587<td align="right"><div class="copyright-footer">Copyright © 2016-2019 Hans 588 Dembinski<p> 589 Distributed under the Boost Software License, Version 1.0. (See accompanying 590 file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt) 591 </p> 592</div></td> 593</tr></table> 594<hr> 595<div class="spirit-nav"> 596<a accesskey="p" href="../boost/histogram/axis/variant.html"><img src="../../../../../doc/src/images/prev.png" alt="Prev"></a><a accesskey="u" href="../index.html"><img src="../../../../../doc/src/images/up.png" alt="Up"></a><a accesskey="h" href="../index.html"><img src="../../../../../doc/src/images/home.png" alt="Home"></a><a accesskey="n" href="history.html"><img src="../../../../../doc/src/images/next.png" alt="Next"></a> 597</div> 598</body> 599</html> 600