1[section:hypergeometric_dist Hypergeometric Distribution] 2 3``#include <boost/math/distributions/hypergeometric.hpp>`` 4 5 namespace boost{ namespace math{ 6 7 template <class RealType = double, 8 class ``__Policy`` = ``__policy_class`` > 9 class hypergeometric_distribution; 10 11 template <class RealType, class Policy> 12 class hypergeometric_distribution 13 { 14 public: 15 typedef RealType value_type; 16 typedef Policy policy_type; 17 // Construct: 18 hypergeometric_distribution(unsigned r, unsigned n, unsigned N); 19 // Accessors: 20 unsigned total()const; 21 unsigned defective()const; 22 unsigned sample_count()const; 23 }; 24 25 typedef hypergeometric_distribution<> hypergeometric; 26 27 }} // namespaces 28 29The hypergeometric distribution describes the number of "events" /k/ 30from a sample /n/ drawn from a total population /N/ ['without replacement]. 31 32Imagine we have a sample of /N/ objects of which /r/ are "defective" 33and N-r are "not defective" 34(the terms "success\/failure" or "red\/blue" are also used). If we sample /n/ 35items /without replacement/ then what is the probability that exactly 36/k/ items in the sample are defective? The answer is given by the pdf of the 37hypergeometric distribution `f(k; r, n, N)`, whilst the probability of 38/k/ defectives or fewer is given by F(k; r, n, N), where F(k) is the 39CDF of the hypergeometric distribution. 40 41[note Unlike almost all of the other distributions in this library, 42the hypergeometric distribution is strictly discrete: it can not be 43extended to real valued arguments of its parameters or random variable.] 44 45The following graph shows how the distribution changes as the proportion 46of "defective" items changes, while keeping the population and sample sizes 47constant: 48 49[graph hypergeometric_pdf_1] 50 51Note that since the distribution is symmetrical in parameters /n/ and /r/, if we 52change the sample size and keep the population and proportion "defective" the same 53then we obtain basically the same graphs: 54 55[graph hypergeometric_pdf_2] 56 57[h4 Member Functions] 58 59 hypergeometric_distribution(unsigned r, unsigned n, unsigned N); 60 61Constructs a hypergeometric distribution with a population of /N/ objects, 62of which /r/ are defective, and from which /n/ are sampled. 63 64 unsigned total()const; 65 66Returns the total number of objects /N/. 67 68 unsigned defective()const; 69 70Returns the number of objects /r/ in population /N/ which are defective. 71 72 unsigned sample_count()const; 73 74Returns the number of objects /n/ which are sampled from the population /N/. 75 76[h4 Non-member Accessors] 77 78All the [link math_toolkit.dist_ref.nmp usual non-member accessor functions] 79that are generic to all distributions are supported: __usual_accessors. 80 81The domain of the random variable is the unsigned integers in the range 82\[max(0, n + r - N), min(n, r)\]. A __domain_error is raised if the 83random variable is outside this range, or is not an integral value. 84 85[caution 86The quantile function will by default return an integer result that has been 87/rounded outwards/. That is to say lower quantiles (where the probability is 88less than 0.5) are rounded downward, and upper quantiles (where the probability 89is greater than 0.5) are rounded upwards. This behaviour 90ensures that if an X% quantile is requested, then /at least/ the requested 91coverage will be present in the central region, and /no more than/ 92the requested coverage will be present in the tails. 93 94This behaviour can be changed so that the quantile functions are rounded 95differently using 96[link math_toolkit.pol_overview Policies]. It is strongly 97recommended that you read the tutorial 98[link math_toolkit.pol_tutorial.understand_dis_quant 99Understanding Quantiles of Discrete Distributions] before 100using the quantile function on the Hypergeometric distribution. The 101[link math_toolkit.pol_ref.discrete_quant_ref reference docs] 102describe how to change the rounding policy 103for these distributions. 104 105However, note that the implementation method of the quantile function 106always returns an integral value, therefore attempting to use a __Policy 107that requires (or produces) a real valued result will result in a 108compile time error. 109] [/ caution] 110 111 112[h4 Accuracy] 113 114For small N such that 115`N < boost::math::max_factorial<RealType>::value` then table based 116lookup of the results gives an accuracy to a few epsilon. 117`boost::math::max_factorial<RealType>::value` is 170 at double or long double 118precision. 119 120For larger N such that `N < boost::math::prime(boost::math::max_prime)` 121then only basic arithmetic is required for the calculation 122and the accuracy is typically < 20 epsilon. This takes care of N 123up to 104729. 124 125For `N > boost::math::prime(boost::math::max_prime)` then accuracy quickly 126degrades, with 5 or 6 decimal digits being lost for N = 110000. 127 128In general for very large N, the user should expect to lose log[sub 10]N 129decimal digits of precision during the calculation, with the results 130becoming meaningless for N >= 10[super 15]. 131 132[h4 Testing] 133 134There are three sets of tests: our implementation is tested against a table of values 135produced by Mathematica's implementation of this distribution. We also sanity check 136our implementation against some spot values computed using the online calculator 137here [@http://stattrek.com/Tables/Hypergeometric.aspx http://stattrek.com/Tables/Hypergeometric.aspx]. 138Finally we test accuracy against some high precision test data using 139this implementation and NTL::RR. 140 141[h4 Implementation] 142 143The PDF can be calculated directly using the formula: 144 145[equation hypergeometric1] 146 147However, this can only be used directly when the largest of the factorials 148is guaranteed not to overflow the floating point representation used. 149This formula is used directly when `N < max_factorial<RealType>::value` 150in which case table lookup of the factorials gives a rapid and accurate 151implementation method. 152 153For larger /N/ the method described in 154"An Accurate Computation of the Hypergeometric Distribution Function", 155Trong Wu, ACM Transactions on Mathematical Software, Vol. 19, No. 1, 156March 1993, Pages 33-43 is used. The method relies on the fact that 157there is an easy method for factorising a factorial into the product 158of prime numbers: 159 160[equation hypergeometric2] 161 162Where p[sub i] is the i'th prime number, and e[sub i] is a small 163positive integer or zero, which can be calculated via: 164 165[equation hypergeometric3] 166 167Further we can combine the factorials in the expression for the PDF 168to yield the PDF directly as the product of prime numbers: 169 170[equation hypergeometric4] 171 172With this time the exponents e[sub i] being either positive, negative 173or zero. Indeed such a degree of cancellation occurs in the calculation 174of the e[sub i] that many are zero, and typically most have a magnitude 175or no more than 1 or 2. 176 177Calculation of the product of the primes requires some care to prevent 178numerical overflow, we use a novel recursive method which splits the 179calculation into a series of sub-products, with a new sub-product 180started each time the next multiplication would cause either overflow 181or underflow. The sub-products are stored in a linked list on the 182program stack, and combined in an order that will guarantee no overflow 183or unnecessary-underflow once the last sub-product has been calculated. 184 185This method can be used as long as N is smaller than the largest prime 186number we have stored in our table of primes (currently 104729). The method 187is relatively slow (calculating the exponents requires the most time), but 188requires only a small number of arithmetic operations to 189calculate the result (indeed there is no shorter method involving only basic 190arithmetic once the exponents have been found), the method is therefore 191much more accurate than the alternatives. 192 193For much larger N, we can calculate the PDF from the factorials using 194either lgamma, or by directly combining lanczos approximations to avoid 195calculating via logarithms. We use the latter method, as it is usually 1961 or 2 decimal digits more accurate than computing via logarithms with 197lgamma. However, in this area where N > 104729, the user should expect 198to lose around log[sub 10]N decimal digits during the calculation in 199the worst case. 200 201The CDF and its complement is calculated by directly summing the PDF's. 202We start by deciding whether the CDF, or its complement, is likely to be 203the smaller of the two and then calculate the PDF at /k/ (or /k+1/ if we're 204calculating the complement) and calculate successive PDF values via the 205recurrence relations: 206 207[equation hypergeometric5] 208 209Until we either reach the end of the distributions domain, or the next 210PDF value to be summed would be too small to affect the result. 211 212The quantile is calculated in a similar manner to the CDF: we first guess 213which end of the distribution we're nearer to, and then sum PDFs starting 214from the end of the distribution this time, until we have some value /k/ that 215gives the required CDF. 216 217The median is simply the quantile at 0.5, and the remaining properties are 218calculated via: 219 220[equation hypergeometric6] 221 222[endsect] 223 224[/ hypergeometric.qbk 225 Copyright 2008 John Maddock. 226 Distributed under the Boost Software License, Version 1.0. 227 (See accompanying file LICENSE_1_0.txt or copy at 228 http://www.boost.org/LICENSE_1_0.txt). 229] 230