1[/ def names all end in distrib to avoid clashes with names of functions] 2 3[def __binomial_distrib [link math_toolkit.dist_ref.dists.binomial_dist Binomial Distribution]] 4[def __chi_squared_distrib [link math_toolkit.dist_ref.dists.chi_squared_dist Chi Squared Distribution]] 5[def __normal_distrib [link math_toolkit.dist_ref.dists.normal_dist Normal Distribution]] 6[def __F_distrib [link math_toolkit.dist_ref.dists.f_dist Fisher F Distribution]] 7[def __students_t_distrib [link math_toolkit.dist_ref.dists.students_t_dist Students t Distribution]] 8 9[def __handbook [@http://www.itl.nist.gov/div898/handbook/ 10NIST/SEMATECH e-Handbook of Statistical Methods.]] 11 12[section:stat_tut Statistical Distributions Tutorial] 13This library is centred around statistical distributions, this tutorial 14will give you an overview of what they are, how they can be used, and 15provides a few worked examples of applying the library to statistical tests. 16 17[section:overview Overview of Statistical Distributions] 18 19[section:headers Headers and Namespaces] 20 21All the code in this library is inside `namespace boost::math`. 22 23In order to use a distribution /my_distribution/ you will need to include 24either the header(s) `<boost/math/my_distribution.hpp>` (quicker compiles), or 25the "include all the distributions" header: `<boost/math/distributions.hpp>`. 26 27For example, to use the Students-t distribution include either 28`<boost/math/students_t.hpp>` or 29`<boost/math/distributions.hpp>` 30 31You also need to bring distribution names into scope, 32perhaps with a `using namespace boost::math;` declaration, 33 34or specific `using` declarations like `using boost::math::normal;` (*recommended*). 35 36[caution Some math function names are also used in `namespace std` so including `<random>` could cause ambiguity!] 37 38[endsect] [/ section:headers Headers and Namespaces] 39 40[section:objects Distributions are Objects] 41 42Each kind of distribution in this library is a class type - an object, with member functions. 43 44[tip If you are familiar with statistics libraries using functions, 45and 'Distributions as Objects' seem alien, see 46[link math_toolkit.stat_tut.weg.nag_library the comparison to 47other statistics libraries.] 48] [/tip] 49 50[link policy Policies] provide optional fine-grained control 51of the behaviour of these classes, allowing the user to customise 52behaviour such as how errors are handled, or how the quantiles 53of discrete distributions behave. 54 55Making distributions class types does two things: 56 57* It encapsulates the kind of distribution in the C++ type system; 58so, for example, Students-t distributions are always a different C++ type from 59Chi-Squared distributions. 60* The distribution objects store any parameters associated with the 61distribution: for example, the Students-t distribution has a 62['degrees of freedom] parameter that controls the shape of the distribution. 63This ['degrees of freedom] parameter has to be provided 64to the Students-t object when it is constructed. 65 66Although the distribution classes in this library are templates, there 67are typedefs on type /double/ that mostly take the usual name of the 68distribution 69(except where there is a clash with a function of the same name: beta and gamma, 70in which case using the default template arguments - `RealType = double` - 71is nearly as convenient). 72Probably 95% of uses are covered by these typedefs: 73 74 // using namespace boost::math; // Avoid potential ambiguity with names in std <random> 75 // Safer to declare specific functions with using statement(s): 76 77 using boost::math::beta_distribution; 78 using boost::math::binomial_distribution; 79 using boost::math::students_t; 80 81 // Construct a students_t distribution with 4 degrees of freedom: 82 students_t d1(4); 83 84 // Construct a double-precision beta distribution 85 // with parameters a = 10, b = 20 86 beta_distribution<> d2(10, 20); // Note: _distribution<> suffix ! 87 88If you need to use the distributions with a type other than `double`, 89then you can instantiate the template directly: the names of the 90templates are the same as the `double` typedef but with `_distribution` 91appended, for example: __students_t_distrib or __binomial_distrib: 92 93 // Construct a students_t distribution, of float type, 94 // with 4 degrees of freedom: 95 students_t_distribution<float> d3(4); 96 97 // Construct a binomial distribution, of long double type, 98 // with probability of success 0.3 99 // and 20 trials in total: 100 binomial_distribution<long double> d4(20, 0.3); 101 102The parameters passed to the distributions can be accessed via getter member 103functions: 104 105 d1.degrees_of_freedom(); // returns 4.0 106 107This is all well and good, but not very useful so far. What we often want 108is to be able to calculate the /cumulative distribution functions/ and 109/quantiles/ etc for these distributions. 110 111[endsect] [/section:objects Distributions are Objects] 112 113 114[section:generic Generic operations common to all distributions are non-member functions] 115 116Want to calculate the PDF (Probability Density Function) of a distribution? 117No problem, just use: 118 119 pdf(my_dist, x); // Returns PDF (density) at point x of distribution my_dist. 120 121Or how about the CDF (Cumulative Distribution Function): 122 123 cdf(my_dist, x); // Returns CDF (integral from -infinity to point x) 124 // of distribution my_dist. 125 126And quantiles are just the same: 127 128 quantile(my_dist, p); // Returns the value of the random variable x 129 // such that cdf(my_dist, x) == p. 130 131If you're wondering why these aren't member functions, it's to 132make the library more easily extensible: if you want to add additional 133generic operations - let's say the /n'th moment/ - then all you have to 134do is add the appropriate non-member functions, overloaded for each 135implemented distribution type. 136 137[tip 138 139[*Random numbers that approximate Quantiles of Distributions] 140 141If you want random numbers that are distributed in a specific way, 142for example in a uniform, normal or triangular, 143see [@http://www.boost.org/libs/random/ Boost.Random]. 144 145Whilst in principal there's nothing to prevent you from using the 146quantile function to convert a uniformly distributed random 147number to another distribution, in practice there are much more 148efficient algorithms available that are specific to random number generation. 149] [/tip Random numbers that approximate Quantiles of Distributions] 150 151For example, the binomial distribution has two parameters: 152n (the number of trials) and p (the probability of success on any one trial). 153 154The `binomial_distribution` constructor therefore has two parameters: 155 156`binomial_distribution(RealType n, RealType p);` 157 158For this distribution the __random_variate is k: the number of successes observed. 159The probability density\/mass function (pdf) is therefore written as ['f(k; n, p)]. 160 161[note 162 163[*Random Variates and Distribution Parameters] 164 165The concept of a __random_variable is closely linked to the term __random_variate: 166a random variate is a particular value (outcome) of a random variable. 167and [@http://en.wikipedia.org/wiki/Parameter distribution parameters] 168are conventionally distinguished (for example in Wikipedia and Wolfram MathWorld) 169by placing a semi-colon or vertical bar) 170/after/ the __random_variable (whose value you 'choose'), 171to separate the variate from the parameter(s) that defines the shape of the distribution. 172 173For example, the binomial distribution probability distribution function (PDF) is written as 174[role serif_italic ['f(k| n, p)] = Pr(K = k|n, p) = ] probability of observing k successes out of n trials. 175K is the __random_variable, k is the __random_variate, 176the parameters are n (trials) and p (probability). 177] [/tip Random Variates and Distribution Parameters] 178 179[note By convention, __random_variate are lower case, usually k is integral, x if real, and 180__random_variable are upper case, K if integral, X if real. But this implementation treats 181all as floating point values `RealType`, so if you really want an integral result, 182you must round: see note on Discrete Probability Distributions below for details.] 183 184As noted above the non-member function `pdf` has one parameter for the distribution object, 185and a second for the random variate. So taking our binomial distribution 186example, we would write: 187 188`pdf(binomial_distribution<RealType>(n, p), k);` 189 190The ranges of __random_variate values that are permitted and are supported can be 191tested by using two functions `range` and `support`. 192 193The distribution (effectively the __random_variate) is said to be 'supported' 194over a range that is 195[@http://en.wikipedia.org/wiki/Probability_distribution 196 "the smallest closed set whose complement has probability zero"]. 197MathWorld uses the word 'defined' for this range. 198Non-mathematicians might say it means the 'interesting' smallest range 199of random variate x that has the cdf going from zero to unity. 200Outside are uninteresting zones where the pdf is zero, and the cdf zero or unity. 201 202For most distributions, with probability distribution functions one might describe 203as 'well-behaved', we have decided that it is most useful for the supported range 204to *exclude* random variate values like exact zero *if the end point is discontinuous*. 205For example, the Weibull (scale 1, shape 1) distribution smoothly heads for unity 206as the random variate x declines towards zero. 207But at x = zero, the value of the pdf is suddenly exactly zero, by definition. 208If you are plotting the PDF, or otherwise calculating, 209zero is not the most useful value for the lower limit of supported, as we discovered. 210So for this, and similar distributions, 211we have decided it is most numerically useful to use 212the closest value to zero, min_value, for the limit of the supported range. 213(The `range` remains from zero, so you will still get `pdf(weibull, 0) == 0`). 214(Exponential and gamma distributions have similarly discontinuous functions). 215 216Mathematically, the functions may make sense with an (+ or -) infinite value, 217but except for a few special cases (in the Normal and Cauchy distributions) 218this implementation limits random variates to finite values from the `max` 219to `min` for the `RealType`. 220(See [link math_toolkit.sf_implementation.handling_of_floating_point_infin 221Handling of Floating-Point Infinity] for rationale). 222 223 224[note 225 226[*Discrete Probability Distributions] 227 228Note that the [@http://en.wikipedia.org/wiki/Discrete_probability_distribution 229discrete distributions], including the binomial, negative binomial, Poisson & Bernoulli, 230are all mathematically defined as discrete functions: 231that is to say the functions `cdf` and `pdf` are only defined for integral values 232of the random variate. 233 234However, because the method of calculation often uses continuous functions 235it is convenient to treat them as if they were continuous functions, 236and permit non-integral values of their parameters. 237 238Users wanting to enforce a strict mathematical model may use `floor` 239or `ceil` functions on the random variate prior to calling the distribution 240function. 241 242The quantile functions for these distributions are hard to specify 243in a manner that will satisfy everyone all of the time. The default 244behaviour is to return an integer result, that has been rounded 245/outwards/: that is to say, lower quantiles - where the probability 246is less than 0.5 are rounded down, while upper quantiles - where 247the probability is greater than 0.5 - are rounded up. This behaviour 248ensures that if an X% quantile is requested, then /at least/ the requested 249coverage will be present in the central region, and /no more than/ 250the requested coverage will be present in the tails. 251 252This behaviour can be changed so that the quantile functions are rounded 253differently, or return a real-valued result using 254[link math_toolkit.pol_overview Policies]. It is strongly 255recommended that you read the tutorial 256[link math_toolkit.pol_tutorial.understand_dis_quant 257Understanding Quantiles of Discrete Distributions] before 258using the quantile function on a discrete distribution. The 259[link math_toolkit.pol_ref.discrete_quant_ref reference docs] 260describe how to change the rounding policy 261for these distributions. 262 263For similar reasons continuous distributions with parameters like 264"degrees of freedom" 265that might appear to be integral, are treated as real values 266(and are promoted from integer to floating-point if necessary). 267In this case however, there are a small number of situations where non-integral 268degrees of freedom do have a genuine meaning. 269] 270 271[endsect] [/ section:generic Generic operations common to all distributions are non-member functions] 272 273[section:complements Complements are supported too - and when to use them] 274 275Often you don't want the value of the CDF, but its complement, which is 276to say `1-p` rather than `p`. It is tempting to calculate the CDF and subtract 277it from `1`, but if `p` is very close to `1` then cancellation error 278will cause you to lose accuracy, perhaps totally. 279 280[link why_complements See below ['"Why and when to use complements?"]] 281 282In this library, whenever you want to receive a complement, just wrap 283all the function arguments in a call to `complement(...)`, for example: 284 285 students_t dist(5); 286 cout << "CDF at t = 1 is " << cdf(dist, 1.0) << endl; 287 cout << "Complement of CDF at t = 1 is " << cdf(complement(dist, 1.0)) << endl; 288 289But wait, now that we have a complement, we have to be able to use it as well. 290Any function that accepts a probability as an argument can also accept a complement 291by wrapping all of its arguments in a call to `complement(...)`, for example: 292 293 students_t dist(5); 294 295 for(double i = 10; i < 1e10; i *= 10) 296 { 297 // Calculate the quantile for a 1 in i chance: 298 double t = quantile(complement(dist, 1/i)); 299 // Print it out: 300 cout << "Quantile of students-t with 5 degrees of freedom\n" 301 "for a 1 in " << i << " chance is " << t << endl; 302 } 303 304[tip 305 306[*Critical values are just quantiles] 307 308Some texts talk about quantiles, or percentiles or fractiles, 309others about critical values, the basic rule is: 310 311['Lower critical values] are the same as the quantile. 312 313['Upper critical values] are the same as the quantile from the complement 314of the probability. 315 316For example, suppose we have a Bernoulli process, giving rise to a binomial 317distribution with success ratio 0.1 and 100 trials in total. The 318['lower critical value] for a probability of 0.05 is given by: 319 320`quantile(binomial(100, 0.1), 0.05)` 321 322and the ['upper critical value] is given by: 323 324`quantile(complement(binomial(100, 0.1), 0.05))` 325 326which return 4.82 and 14.63 respectively. 327] 328 329[#why_complements] 330[tip 331 332[*Why bother with complements anyway?] 333 334It's very tempting to dispense with complements, and simply subtract 335the probability from 1 when required. However, consider what happens when 336the probability is very close to 1: let's say the probability expressed at 337float precision is `0.999999940f`, then `1 - 0.999999940f = 5.96046448e-008`, 338but the result is actually accurate to just ['one single bit]: the only 339bit that didn't cancel out! 340 341Or to look at this another way: consider that we want the risk of falsely 342rejecting the null-hypothesis in the Student's t test to be 1 in 1 billion, 343for a sample size of 10,000. 344This gives a probability of 1 - 10[super -9], which is exactly 1 when 345calculated at float precision. In this case calculating the quantile from 346the complement neatly solves the problem, so for example: 347 348`quantile(complement(students_t(10000), 1e-9))` 349 350returns the expected t-statistic `6.00336`, where as: 351 352`quantile(students_t(10000), 1-1e-9f)` 353 354raises an overflow error, since it is the same as: 355 356`quantile(students_t(10000), 1)` 357 358Which has no finite result. 359 360With all distributions, even for more reasonable probability 361(unless the value of p can be represented exactly in the floating-point type) 362the loss of accuracy quickly becomes significant if you simply calculate probability from 1 - p 363(because it will be mostly garbage digits for p ~ 1). 364 365So always avoid, for example, using a probability near to unity like 0.99999 366 367`quantile(my_distribution, 0.99999)` 368 369and instead use 370 371`quantile(complement(my_distribution, 0.00001))` 372 373since 1 - 0.99999 is not exactly equal to 0.00001 when using floating-point arithmetic. 374 375This assumes that the 0.00001 value is either a constant, 376or can be computed by some manner other than subtracting 0.99999 from 1. 377 378] [/ tip *Why bother with complements anyway?] 379 380[endsect] [/ section:complements Complements are supported too - and why] 381 382[section:parameters Parameters can be calculated] 383 384Sometimes it's the parameters that define the distribution that you 385need to find. Suppose, for example, you have conducted a Students-t test 386for equal means and the result is borderline. Maybe your two samples 387differ from each other, or maybe they don't; based on the result 388of the test you can't be sure. A legitimate question to ask then is 389"How many more measurements would I have to take before I would get 390an X% probability that the difference is real?" Parameter finders 391can answer questions like this, and are necessarily different for 392each distribution. They are implemented as static member functions 393of the distributions, for example: 394 395 students_t::find_degrees_of_freedom( 396 1.3, // difference from true mean to detect 397 0.05, // maximum risk of falsely rejecting the null-hypothesis. 398 0.1, // maximum risk of falsely failing to reject the null-hypothesis. 399 0.13); // sample standard deviation 400 401Returns the number of degrees of freedom required to obtain a 95% 402probability that the observed differences in means is not down to 403chance alone. In the case that a borderline Students-t test result 404was previously obtained, this can be used to estimate how large the sample size 405would have to become before the observed difference was considered 406significant. It assumes, of course, that the sample mean and standard 407deviation are invariant with sample size. 408 409[endsect] [/ section:parameters Parameters can be calculated] 410 411[section:summary Summary] 412 413* Distributions are objects, which are constructed from whatever 414parameters the distribution may have. 415* Member functions allow you to retrieve the parameters of a distribution. 416* Generic non-member functions provide access to the properties that 417are common to all the distributions (PDF, CDF, quantile etc). 418* Complements of probabilities are calculated by wrapping the function's 419arguments in a call to `complement(...)`. 420* Functions that accept a probability can accept a complement of the 421probability as well, by wrapping the function's 422arguments in a call to `complement(...)`. 423* Static member functions allow the parameters of a distribution 424to be found from other information. 425 426Now that you have the basics, the next section looks at some worked examples. 427 428[endsect] [/section:summary Summary] 429[endsect] [/section:overview Overview] 430 431[section:weg Worked Examples] 432[include distribution_construction.qbk] 433[include students_t_examples.qbk] 434[include chi_squared_examples.qbk] 435[include f_dist_example.qbk] 436[include binomial_example.qbk] 437[include geometric_example.qbk] 438[include negative_binomial_example.qbk] 439[include normal_example.qbk] 440[/include inverse_gamma_example.qbk] 441[/include inverse_gaussian_example.qbk] 442[include inverse_chi_squared_eg.qbk] 443[include nc_chi_squared_example.qbk] 444[include error_handling_example.qbk] 445[include find_location_and_scale.qbk] 446[include nag_library.qbk] 447[include c_sharp.qbk] 448[endsect] [/section:weg Worked Examples] 449 450[include background.qbk] 451 452[endsect] [/ section:stat_tut Statistical Distributions Tutorial] 453 454[/ dist_tutorial.qbk 455 Copyright 2006, 2010, 2011 John Maddock and Paul A. Bristow. 456 Distributed under the Boost Software License, Version 1.0. 457 (See accompanying file LICENSE_1_0.txt or copy at 458 http://www.boost.org/LICENSE_1_0.txt). 459] 460 461