1<html> 2<head> 3<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> 4<title>Overview</title> 5<link rel="stylesheet" href="../../../../../doc/src/boostbook.css" type="text/css"> 6<meta name="generator" content="DocBook XSL Stylesheets V1.79.1"> 7<link rel="home" href="../index.html" title="Chapter 1. Boost.Histogram"> 8<link rel="up" href="../index.html" title="Chapter 1. Boost.Histogram"> 9<link rel="prev" href="../index.html" title="Chapter 1. Boost.Histogram"> 10<link rel="next" href="getting_started.html" title="Getting started"> 11</head> 12<body bgcolor="white" text="black" link="#0000FF" vlink="#840084" alink="#0000FF"> 13<table cellpadding="2" width="100%"><tr> 14<td valign="top"><img alt="Boost C++ Libraries" width="277" height="86" src="../../../../../boost.png"></td> 15<td align="center"><a href="../../../../../index.html">Home</a></td> 16<td align="center"><a href="../../../../libraries.htm">Libraries</a></td> 17<td align="center"><a href="http://www.boost.org/users/people.html">People</a></td> 18<td align="center"><a href="http://www.boost.org/users/faq.html">FAQ</a></td> 19<td align="center"><a href="../../../../../more/index.htm">More</a></td> 20</tr></table> 21<hr> 22<div class="spirit-nav"> 23<a accesskey="p" href="../index.html"><img src="../../../../../doc/src/images/prev.png" alt="Prev"></a><a accesskey="u" href="../index.html"><img src="../../../../../doc/src/images/up.png" alt="Up"></a><a accesskey="h" href="../index.html"><img src="../../../../../doc/src/images/home.png" alt="Home"></a><a accesskey="n" href="getting_started.html"><img src="../../../../../doc/src/images/next.png" alt="Next"></a> 24</div> 25<div class="section"> 26<div class="titlepage"><div><div><h2 class="title" style="clear: both"> 27<a name="histogram.overview"></a><a class="link" href="overview.html" title="Overview">Overview</a> 28</h2></div></div></div> 29<div class="toc"><dl class="toc"> 30<dt><span class="section"><a href="overview.html#histogram.overview.introduction">Introduction</a></span></dt> 31<dt><span class="section"><a href="overview.html#histogram.overview.structure">Structure</a></span></dt> 32<dd><dl> 33<dt><span class="section"><a href="overview.html#histogram.overview.structure.host">Histogram host class</a></span></dt> 34<dt><span class="section"><a href="overview.html#histogram.overview.structure.axis">Axis types</a></span></dt> 35<dt><span class="section"><a href="overview.html#histogram.overview.structure.storage">Storage types</a></span></dt> 36</dl></dd> 37</dl></div> 38<div class="section"> 39<div class="titlepage"><div><div><h3 class="title"> 40<a name="histogram.overview.introduction"></a><a class="link" href="overview.html#histogram.overview.introduction" title="Introduction">Introduction</a> 41</h3></div></div></div> 42<p> 43 <a href="https://en.wikipedia.org/wiki/Histogram" target="_top">Histograms</a> are 44 a basic tool in statistical analysis. A histogram consists of a number of 45 non-overlapping cells in data space. When an input value is passed to the 46 histogram, the corresponding cell that envelopes the value is found and an 47 associated counter is incremented. 48 </p> 49<p> 50 When analyzing a large low-dimensional data set, it is more convenient to 51 work with a histogram of the input values than the original values. Keeping 52 the cell counts in memory for analysis and/or processing the counts requires 53 far fewer resources than keeping the original values in memory and processing 54 them. Information present in the original can also be extracted from the 55 histogram<a href="#ftn.histogram.overview.introduction.f0" class="footnote" name="histogram.overview.introduction.f0"><sup class="footnote">[1]</sup></a>. Some information is lost in this way, but if the cells are small 56 enough<a href="#ftn.histogram.overview.introduction.f1" class="footnote" name="histogram.overview.introduction.f1"><sup class="footnote">[2]</sup></a>, the loss is negligible. A histogram is a kind of lossy data-compression. 57 It is also often used as a simple estimator for the <a href="https://en.wikipedia.org/wiki/Probability_density_function" target="_top">probability 58 density function</a> of the input data. More complex density estimators 59 exist, but histograms remain attractive because they are easy to reason about. 60 </p> 61<p> 62 This library provides a histogram for multi-dimensional data. In the multi-dimensional 63 case, the input consist of tuples of values which belong together and describing 64 different aspects of the same entity. For example, when you make a digital 65 image with a camera, photons hit a pixel detector. The photon is the entity 66 and it has two coordinates values where it hit the detector. The camera only 67 counts how often a photon hit each cell, so it is a real-life example of 68 making a two-dimensional histogram. A two-dimensional histogram collects 69 more information than two separate one-dimensional histograms, one for each 70 coordinate. For example, if the two-dimensional image looks like a checker 71 board, with high and low densities are alternating along each coordinate, 72 then the one-dimensional histograms along each coordinate would look flat. 73 There would be no hint that there is a complex structure in two dimensions. 74 </p> 75<p> 76 The term <span class="emphasis"><em>histogram</em></span> is usually strictly used for something 77 with cells over discrete or continuous data. This histogram class can also 78 process categorical variables and it even allows for non-consecutive cells 79 if that is desired. There is no restriction to numbers as input either. Any 80 C++ type can be fed into the histogram, if the user provides a specialized 81 axis class that maps values of this type to a cell index. The only remaining 82 restriction is that cells are non-overlapping, since there must be a unique 83 mapping from input value to cell. The library is not able to automatically 84 ensure this for user-provided axis classes, so the responsibly is on the 85 user. 86 </p> 87<p> 88 Furthermore, the histogram can handle weighted input. Normally, the cell 89 counter which is connected to an input tuple is incremented by one, but sometimes 90 it is useful to increment by a weight, an integral or floating point number. 91 Finally, the histogram can be configured to store any kind of accumulator 92 in each cell. Arbitrary samples can be passed to this accumulator, which 93 may compute the mean or other interesting quantities from the samples that 94 are sorted into the cell. When the accumulator computes a mean, the result 95 is called a <span class="emphasis"><em>profile</em></span>. The feature set is informed by 96 popular libraries for scientific computing, notably <a href="https://root.cern.ch" target="_top">CERN's 97 ROOT framework</a> and the <a href="https://www.gnu.org/software/gsl" target="_top">GNU 98 Scientific Library</a>. 99 </p> 100</div> 101<div class="section"> 102<div class="titlepage"><div><div><h3 class="title"> 103<a name="histogram.overview.structure"></a><a class="link" href="overview.html#histogram.overview.structure" title="Structure">Structure</a> 104</h3></div></div></div> 105<div class="toc"><dl class="toc"> 106<dt><span class="section"><a href="overview.html#histogram.overview.structure.host">Histogram host class</a></span></dt> 107<dt><span class="section"><a href="overview.html#histogram.overview.structure.axis">Axis types</a></span></dt> 108<dt><span class="section"><a href="overview.html#histogram.overview.structure.storage">Storage types</a></span></dt> 109</dl></div> 110<p> 111 The library consists of three orthogonal components: 112 </p> 113<div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "> 114<li class="listitem"> 115 <a class="link" href="overview.html#histogram.overview.structure.host" title="Histogram host class">histogram host class</a>: 116 The histogram host class defines the public user interface and holds 117 axis objects (one for each dimension) and a storage object. The user 118 can chose whether axis objects are stored in a static tuple, a vector, 119 or a vector of variants. 120 </li> 121<li class="listitem"> 122 <a class="link" href="overview.html#histogram.overview.structure.axis" title="Axis types">axis types</a>: 123 Defines how input values are mapped to bins. Several axis types are provided 124 which implement different specializations. Users can make their own axis 125 types following the axis concept and use them with the library. A variant 126 type is provided, which can hold one of several concrete axis types. 127 </li> 128<li class="listitem"> 129 <a class="link" href="overview.html#histogram.overview.structure.storage" title="Storage types">storage types</a>: 130 Manages a collection of bin counters. The requirements for a storage 131 differ from those of an STL container, it needs to follow the storage 132 concept. Two implementations are provided. 133 </li> 134</ul></div> 135<div class="section"> 136<div class="titlepage"><div><div><h4 class="title"> 137<a name="histogram.overview.structure.host"></a><a class="link" href="overview.html#histogram.overview.structure.host" title="Histogram host class">Histogram host class</a> 138</h4></div></div></div> 139<p> 140 Histograms store axis objects and a storage object. A one-dimensional histogram 141 has one axis, a multi-dimensional histogram has several. When you pass 142 an input tuple, say (v1, v2, v3), then the first axis will map v1 onto 143 index i1, the second axis v2 onto i2, and so on, to generate the index 144 tuple (i1, i2, i3). The histogram host class then converts these indices 145 into a linear global index that is used to address bin counter in the storage 146 object. 147 </p> 148<div class="note"><table border="0" summary="Note"> 149<tr> 150<td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="../../../../../doc/src/images/note.png"></td> 151<th align="left">Note</th> 152</tr> 153<tr><td align="left" valign="top"><p> 154 To understand the need for multi-dimensional histograms, think of point 155 coordinates. If all points that you consider lie on a line, you need 156 only one value to describe the point. If all points lie in a plane, you 157 need two values to describe the position. Three values are needed for 158 a point in space. A histogram puts a discrete grid over the line, the 159 plane or the space, and counts how many points lie in each cell of the 160 grid. To approximate a point distribution on a line, a 1d-histogram is 161 sufficient. To do the same in 3d-space, one needs a 3d-histogram. 162 </p></td></tr> 163</table></div> 164<p> 165 This library supports different axis types, so that the user can customize 166 how the mapping is done exactly, see <a class="link" href="overview.html#histogram.overview.structure.axis" title="Axis types">axis 167 types</a>. Users can furthermore chose between several ways of storing 168 axis types in the histogram. 169 </p> 170<p> 171 When the number and types of the axes are known at compile-time, the histogram 172 host class stores axis types in a <code class="computeroutput"><span class="identifier">std</span><span class="special">::</span><span class="identifier">tuple</span></code>. 173 We call this a <span class="emphasis"><em>static histogram</em></span>. To access a particular 174 axis, one should use a compile-time number as index (a run-time index also 175 works with some limitations). A static histogram is extremely fast (see 176 <a class="link" href="benchmarks.html" title="Benchmarks">benchmark</a>), because there is 177 no overhead and the compiler can inline code, unroll loops, and more. Also, 178 many user errors are can be caught at compile-time rather than run-time. 179 </p> 180<p> 181 Static histograms are the best kind, but cannot be used when histograms 182 are to be created with an axis configuration that is only known at run-time. 183 This is the case, for example, when histograms are created from Python 184 or from a graphical user interface. Therefore also more dynamic kinds of 185 histograms are supported. 186 </p> 187<p> 188 There are two levels of dynamism. The histogram can hold instances of a 189 single axis type in a <code class="computeroutput"><span class="identifier">std</span><span class="special">::</span><span class="identifier">vector</span></code>. 190 Now the number of axis instances per histogram can vary at run-time, but 191 the axis type must be the same for all instances. We call this a <span class="emphasis"><em>semi-dynamic 192 histogram</em></span>. 193 </p> 194<p> 195 If also the axis types need to vary at run-time, one can place <code class="computeroutput"><span class="identifier">boost</span><span class="special">::</span><span class="identifier">histogram</span><span class="special">::</span><span class="identifier">axis</span><span class="special">::</span><span class="identifier">variant</span></code> type in a <code class="computeroutput"><span class="identifier">std</span><span class="special">::</span><span class="identifier">vector</span></code>, 196 which can hold one of a set of different concrete axis types. We call this 197 a <span class="emphasis"><em>dynamic histogram</em></span>. The dynamic histogram is a single 198 type that can store arbitrary sequences of different axes types, which 199 may be generated at run-time. The polymorphic behavior of the generic 200 <code class="computeroutput"><span class="identifier">boost</span><span class="special">::</span><span class="identifier">histogram</span><span class="special">::</span><span class="identifier">axis</span><span class="special">::</span><span class="identifier">variant</span></code> type has a run-time cost, however. 201 Typically, the performance is reduced by a factor of two compared to a 202 static histogram. 203 </p> 204<div class="note"><table border="0" summary="Note"> 205<tr> 206<td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="../../../../../doc/src/images/note.png"></td> 207<th align="left">Note</th> 208</tr> 209<tr><td align="left" valign="top"><p> 210 The design decision to store axis types in the variant-like type <code class="computeroutput"><span class="identifier">boost</span><span class="special">::</span><span class="identifier">histogram</span><span class="special">::</span><span class="identifier">axis</span><span class="special">::</span><span class="identifier">variant</span></code> has several advantages over 211 forms of run-time polymorphism. Firstly, it guarantees that axis objects 212 which belong to the same histogram are stored locally together in memory, 213 which reduces cache misses when the histogram iterates over axis objects 214 in a tight loop, which it often does. Secondly, each axis can accept 215 a different value type in this scheme. Classic polymorphism with vtables 216 requires that all classes share the same method call signatures, but 217 we want different axis to allowed to accepted different types of arguments. 218 Variants work well for this case. 219 </p></td></tr> 220</table></div> 221</div> 222<div class="section"> 223<div class="titlepage"><div><div><h4 class="title"> 224<a name="histogram.overview.structure.axis"></a><a class="link" href="overview.html#histogram.overview.structure.axis" title="Axis types">Axis types</a> 225</h4></div></div></div> 226<p> 227 An axis defines an injective mapping of (a range of) input values to a 228 bin. The logic is encapsulated in an axis type. Users can create their 229 own axis classes and use them with the library, by implementing the <a class="link" href="concepts.html#histogram.concepts.Axis" title="Axis"><span class="bold"><strong>Axis</strong></span> 230 concept</a>. The library comes with four builtin types, which implement 231 different specializations. 232 </p> 233<div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "> 234<li class="listitem"> 235 <code class="computeroutput"><a class="link" href="../boost/histogram/axis/regular.html" title="Class template regular">boost::histogram::axis::regular</a></code> 236 sorts real numbers into bins with equal width. The regular axis also 237 supports monotonic transforms, which are applied when the input values 238 are passed to the axis. This can be used to make a fast logarithmic 239 axis, where the bins have equal width in the logarithm of the variable. 240 </li> 241<li class="listitem"> 242 <code class="computeroutput"><a class="link" href="../boost/histogram/axis/variable.html" title="Class template variable">boost::histogram::axis::variable</a></code> 243 sorts real numbers into bins with varying width. 244 </li> 245<li class="listitem"> 246 <code class="computeroutput"><a class="link" href="../boost/histogram/axis/integer.html" title="Class template integer">boost::histogram::axis::integer</a></code> 247 is a specialization of a regular axis for a range of integers with 248 unit bin width. It is much faster than a regular axis. 249 </li> 250<li class="listitem"> 251 <code class="computeroutput"><a class="link" href="../boost/histogram/axis/category.html" title="Class template category">boost::histogram::axis::category</a></code> 252 is a bijective mapping of unique values onto bin indices and vice versa. 253 This can be used with discrete categorical data, like "red", 254 "green", "blue", for example. 255 </li> 256</ul></div> 257<p> 258 Each builtin axis type has a few compile-time options, which change its 259 behavior. 260 </p> 261<div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "> 262<li class="listitem"> 263 All axis types can have an optional overflow bin. When the overflow 264 bin is enabled and an input value is above the range covered by the 265 axis, it is not discarded but counted in the overflow bin. 266 </li> 267<li class="listitem"> 268 All axis types except the category axis can have an optional underflow 269 bin. When the underflow bin is enabled and an input value is below 270 the range covered by the axis, it is not discarded but counted in the 271 underflow bin. 272 </li> 273<li class="listitem"> 274 All axis types except the category axis can be circular, meaning that 275 the axis range is periodic. This is useful for periodic data like polar 276 angles. 277 </li> 278<li class="listitem"> 279 All axis types can optionally grow. When an input value is outside 280 of the range of an axis which is configured to grow, the range of the 281 axis is extended until the value is in range. This option is incompatible 282 with the circular option, only either can be active. 283 </li> 284</ul></div> 285</div> 286<div class="section"> 287<div class="titlepage"><div><div><h4 class="title"> 288<a name="histogram.overview.structure.storage"></a><a class="link" href="overview.html#histogram.overview.structure.storage" title="Storage types">Storage types</a> 289</h4></div></div></div> 290<p> 291 A storage type holds the actual cell values. It uses a one-dimensional 292 index for cell lookup, computed by the histogram host from the indices 293 generated by its axes. The storage needs to know nothing about axes. Users 294 can integrate their own storage classes with the library, by implementing 295 the <a class="link" href="concepts.html#histogram.concepts.Storage" title="Storage">storage concept</a>. 296 Standard containers can be used as storage backends, the library adapts 297 them with the <code class="computeroutput"><a class="link" href="../boost/histogram/storage_adaptor.html" title="Class template storage_adaptor">boost::histogram::storage_adaptor</a></code>. 298 </p> 299<p> 300 Cell lookup is often happening in a tight loop and is random-access. A 301 normal <code class="computeroutput"><span class="identifier">std</span><span class="special">::</span><span class="identifier">vector</span></code> works well as a storage backend. 302 Sometimes this is the best solution, but there are some caveats to this 303 approach. The user has to decide which type should represents the cell 304 counts and it is not an obvious choice. An integer type needs to be large 305 enough to avoid counter overflow, but only a fraction of the bits are used 306 if its capacity is too large. This is a waste of memory then. When memory 307 is wasted, more cache misses occur and performance is degraded (see the 308 benchmarks). The performance of modern CPUs depends a lot on effective 309 utilization of the CPU cache, which is still small. Using floating point 310 numbers instead of integers is also dangerous. They don't overflow, but 311 cap the bin count when the bits in the mantissa are used up. 312 </p> 313<p> 314 The default storage used in the library is <code class="computeroutput"><a class="link" href="../boost/histogram/unlimited_storage.html" title="Class template unlimited_storage">boost::histogram::unlimited_storage</a></code>. 315 It solves these issues with a dynamic counter type management, based on 316 the following insight. The <a href="https://en.wikipedia.org/wiki/Curse_of_dimensionality" target="_top">curse 317 of dimensionality</a> makes the total number of bins grow very fast 318 as the dimension of the histogram grows. However, having many bins also 319 reduces the typical number of counts per bin, since the input values are 320 spread over many more bins now. This means a small counter is often sufficient 321 for high-dimensional histograms. 322 </p> 323<p> 324 The default storage therefore starts with a minimum amount of memory per 325 cell, it uses an 1 byte. If the count in any cell is about to overflow, 326 all cells switch to the next larger integer type simultaneously. This goes 327 on, the capacity per cell is always doubled when it is used up, until 8 328 byte per bin are reached. The following images illustrate this progression 329 for a storage of 3 bin counters. A new memory block is always allocated 330 for all counters, when the first one of them hits its capacity limit. 331 </p> 332<p> 333 <span class="inlinemediaobject"><object type="image/svg+xml" data="../../storage_3_uint8.svg" width="65" height="23"></object></span> 334 </p> 335<p> 336 <span class="inlinemediaobject"><object type="image/svg+xml" data="../../storage_3_uint16.svg" width="129" height="23"></object></span> 337 </p> 338<p> 339 <span class="inlinemediaobject"><object type="image/svg+xml" data="../../storage_3_uint32.svg" width="256" height="23"></object></span> 340 </p> 341<p> 342 <span class="inlinemediaobject"><object type="image/svg+xml" data="../../storage_3_uint64.svg" width="511" height="23"></object></span> 343 </p> 344<p> 345 When even that is not enough, the default storage switches to a multiprecision 346 type similar to those in <a href="../../../../../libs/multiprecision/index.html" target="_top">Boost.Multiprecision</a>, 347 whose capacity is limited only by available memory. The following image 348 is not to scale: 349 </p> 350<p> 351 <span class="inlinemediaobject"><object type="image/svg+xml" data="../../storage_3_cpp_int.svg" width="511" height="23"></object></span> 352 </p> 353<p> 354 This approach is not only memory conserving, but also provides the strong 355 guarantee that bin counters cannot overflow. 356 </p> 357<div class="note"><table border="0" summary="Note"> 358<tr> 359<td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="../../../../../doc/src/images/note.png"></td> 360<th align="left">Note</th> 361</tr> 362<tr><td align="left" valign="top"><p> 363 The no-overflow-guarantee only applies when the histogram is not using 364 weighted fills or if all weights are integral numbers. When floating 365 point weights are used, the default storage switches to a double counter 366 per cell to store the sum of such weights. A double cannot provide the 367 no-overflow-guarantee. 368 </p></td></tr> 369</table></div> 370<p> 371 The best part: this approach is even faster for a histogram with sufficient 372 size despite the run-time overheads of handling the counter type dynamically. 373 The benchmarks show that the gains from better cache usage outweigh the 374 run-time overheads of dynamic dispatching to the right bin counter type 375 and the occasional allocation costs. Doubling the size of the bin counters 376 each time helps, because the allocations happen only O(logN) times for 377 N increments. 378 </p> 379</div> 380</div> 381<div class="footnotes"> 382<br><hr style="width:100; text-align:left;margin-left: 0"> 383<div id="ftn.histogram.overview.introduction.f0" class="footnote"><p><a href="#histogram.overview.introduction.f0" class="para"><sup class="para">[1] </sup></a> 384 Parameters of interest, like the center of a distribution, can be extracted 385 from the histogram instead of the original data set; likewise, statistical 386 models can be fitted to histograms. 387 </p></div> 388<div id="ftn.histogram.overview.introduction.f1" class="footnote"><p><a href="#histogram.overview.introduction.f1" class="para"><sup class="para">[2] </sup></a> 389 What small enough means has to be decided case by case. 390 </p></div> 391</div> 392</div> 393<table xmlns:rev="http://www.cs.rpi.edu/~gregod/boost/tools/doc/revision" width="100%"><tr> 394<td align="left"></td> 395<td align="right"><div class="copyright-footer">Copyright © 2016-2019 Hans 396 Dembinski<p> 397 Distributed under the Boost Software License, Version 1.0. (See accompanying 398 file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt) 399 </p> 400</div></td> 401</tr></table> 402<hr> 403<div class="spirit-nav"> 404<a accesskey="p" href="../index.html"><img src="../../../../../doc/src/images/prev.png" alt="Prev"></a><a accesskey="u" href="../index.html"><img src="../../../../../doc/src/images/up.png" alt="Up"></a><a accesskey="h" href="../index.html"><img src="../../../../../doc/src/images/home.png" alt="Home"></a><a accesskey="n" href="getting_started.html"><img src="../../../../../doc/src/images/next.png" alt="Next"></a> 405</div> 406</body> 407</html> 408