1[/ 2 Copyright 2018 Nick Thompson 3 4 Distributed under the Boost Software License, Version 1.0. 5 (See accompanying file LICENSE_1_0.txt or copy at 6 http://www.boost.org/LICENSE_1_0.txt). 7] 8 9[section:univariate_statistics Univariate Statistics] 10 11[heading Synopsis] 12 13`` 14#include <boost/math/statistics/univariate_statistics.hpp> 15 16namespace boost{ namespace math{ namespace statistics { 17 18 template<class Container> 19 auto mean(Container const & c); 20 21 template<class ForwardIterator> 22 auto mean(ForwardIterator first, ForwardIterator last); 23 24 template<class Container> 25 auto variance(Container const & c); 26 27 template<class ForwardIterator> 28 auto variance(ForwardIterator first, ForwardIterator last); 29 30 template<class Container> 31 auto sample_variance(Container const & c); 32 33 template<class ForwardIterator> 34 auto sample_variance(ForwardIterator first, ForwardIterator last); 35 36 template<class Container> 37 auto mean_and_sample_variance(Container const & c); 38 39 template<class Container> 40 auto skewness(Container const & c); 41 42 template<class ForwardIterator> 43 auto skewness(ForwardIterator first, ForwardIterator last); 44 45 template<class Container> 46 auto kurtosis(Container const & c); 47 48 template<class ForwardIterator> 49 auto kurtosis(ForwardIterator first, ForwardIterator last); 50 51 template<class Container> 52 auto excess_kurtosis(Container const & c); 53 54 template<class ForwardIterator> 55 auto excess_kurtosis(ForwardIterator first, ForwardIterator last); 56 57 template<class Container> 58 auto first_four_moments(Container const & c); 59 60 template<class ForwardIterator> 61 auto first_four_moments(ForwardIterator first, ForwardIterator last); 62 63 template<class Container> 64 auto median(Container & c); 65 66 template<class ForwardIterator> 67 auto median(ForwardIterator first, ForwardIterator last); 68 69 template<class RandomAccessIterator> 70 auto median_absolute_deviation(ForwardIterator first, ForwardIterator last, typename std::iterator_traits<RandomAccessIterator>::value_type center=std::numeric_limits<Real>::quiet_NaN()); 71 72 template<class RandomAccessContainer> 73 auto median_absolute_deviation(RandomAccessContainer v, typename RandomAccessContainer::value_type center=std::numeric_limits<Real>::quiet_NaN()); 74 75 template<class RandomAccessIterator> 76 auto interquartile_range(ForwardIterator first, ForwardIterator last); 77 78 template<class RandomAccessContainer> 79 auto interquartile_range(RandomAccessContainer v); 80 81 template<class Container> 82 auto gini_coefficient(Container & c); 83 84 template<class ForwardIterator> 85 auto gini_coefficient(ForwardIterator first, ForwardIterator last); 86 87 template<class Container> 88 auto sample_gini_coefficient(Container & c); 89 90 template<class ForwardIterator> 91 auto sample_gini_coefficient(ForwardIterator first, ForwardIterator last); 92 93}}} 94`` 95 96[heading Description] 97 98The file `boost/math/statistics/univariate_statistics.hpp` is a set of facilities for computing scalar values from vectors. 99 100Many of these functionals have trivial naive implementations, but experienced programmers will recognize that even trivial algorithms are easy to screw up, and that numerical instabilities often lurk in corner cases. 101We have attempted to do our "due diligence" to root out these problems-scouring the literature for numerically stable algorithms for even the simplest of functionals. 102 103/Nota bene/: Some similar functionality is provided in [@https://www.boost.org/doc/libs/1_68_0/doc/html/accumulators/user_s_guide.html Boost Accumulators Framework]. 104These accumulators should be used in real-time applications; `univariate_statistics.hpp` should be used when CPU vectorization is needed. 105As a reminder, remember that to actually /get/ vectorization, compile with `-march=native -O3` flags. 106 107We now describe each functional in detail. 108Our examples use `std::vector<double>` to hold the data, but this not required. 109In general, you can store your data in an Eigen array, and Armadillo vector, `std::array`, and for many of the routines, a `std::forward_list`. 110These routines are usable in float, double, long double, and Boost.Multiprecision precision, as well as their complex extensions whenever the computation is well-defined. 111For certain operations (total variation, for example) integer inputs are supported. 112 113[heading Mean] 114 115 std::vector<double> v{1,2,3,4,5}; 116 double mu = boost::math::statistics::mean(v.cbegin(), v.cend()); 117 // Alternative syntax if you want to use entire container: 118 mu = boost::math::statistics::mean(v); 119 120The implementation follows [@https://doi.org/10.1137/1.9780898718027 Higham 1.6a]. 121The data is not modified and must be forward iterable. 122Works with real and integer data. 123If the input is an integer type, the output is a double precision float. 124 125[heading Variance] 126 127 std::vector<double> v{1,2,3,4,5}; 128 Real sigma_sq = boost::math::statistics::variance(v.cbegin(), v.cend()); 129 130If you don't need to calculate on a subset of the input, then the range call is more terse: 131 132 std::vector<double> v{1,2,3,4,5}; 133 Real sigma_sq = boost::math::statistics::variance(v); 134 135The implementation follows [@https://doi.org/10.1137/1.9780898718027 Higham 1.6b]. 136The input data must be forward iterable and the range `[first, last)` must contain at least two elements. 137It is /not/ in general sensible to pass complex numbers to this routine. 138If integers are passed as input, then the output is a double precision float. 139 140`boost::math::statistics::variance` returns the population variance. 141If you want a sample variance, use 142 143 std::vector<double> v{1,2,3,4,5}; 144 Real sn_sq = boost::math::statistics::sample_variance(v); 145 146 147[heading Skewness] 148 149Computes the skewness of a dataset: 150 151 std::vector<double> v{1,2,3,4,5}; 152 double skewness = boost::math::statistics::skewness(v); 153 // skewness = 0. 154 155The input vector is not modified, works with integral and real data. 156If the input data is integral, the output is a double precision float. 157 158For a dataset consisting of a single constant value, we take the skewness to be zero by definition. 159 160The implementation follows [@https://prod.sandia.gov/techlib-noauth/access-control.cgi/2008/086212.pdf Pebay]. 161 162[heading Kurtosis] 163 164Computes the kurtosis of a dataset: 165 166 std::vector<double> v{1,2,3,4,5}; 167 double kurtosis = boost::math::statistics::kurtosis(v); 168 // kurtosis = 17/10 169 170The implementation follows [@https://prod.sandia.gov/techlib-noauth/access-control.cgi/2008/086212.pdf Pebay]. 171The input data must be forward iterable and must consist of real or integral values. 172If the input data is integral, the output is a double precision float. 173Note that this is /not/ the excess kurtosis. 174If you require the excess kurtosis, use `boost::math::statistics::excess_kurtosis`. 175This function simply subtracts 3 from the kurtosis, but it makes eminently clear our definition of kurtosis. 176 177[heading First four moments] 178 179Simultaneously computes the first four [@https://en.wikipedia.org/wiki/Central_moment central moments] in a single pass through the data: 180 181 std::vector<double> v{1,2,3,4,5}; 182 auto [M1, M2, M3, M4] = boost::math::statistics::first_four_moments(v); 183 184 185[heading Median] 186 187Computes the median of a dataset: 188 189 std::vector<double> v{1,2,3,4,5}; 190 double m = boost::math::statistics::median(v.begin(), v.end()); 191 192/Nota bene: The input vector is modified./ 193The calculation of the median is a thin wrapper around the C++11 [@https://en.cppreference.com/w/cpp/algorithm/nth_element `nth_element`]. 194Therefore, all requirements of `std::nth_element` are inherited by the median calculation. 195In particular, the container must allow random access. 196 197[heading Median Absolute Deviation] 198 199Computes the [@https://en.wikipedia.org/wiki/Median_absolute_deviation median absolute deviation] of a dataset: 200 201 std::vector<double> v{1,2,3,4,5}; 202 double mad = boost::math::statistics::median_absolute_deviation(v); 203 204By default, the deviation from the median is used. 205If you have some prior that the median is zero, or wish to compute the median absolute deviation from the mean, 206use the following: 207 208 // prior is that center is zero: 209 double center = 0; 210 double mad = boost::math::statistics::median_absolute_deviation(v, center); 211 212 // compute median absolute deviation from the mean: 213 double mu = boost::math::statistics::mean(v); 214 double mad = boost::math::statistics::median_absolute_deviation(v, mu); 215 216/Nota bene:/ The input vector is modified. 217Again the vector is passed into a call to [@https://en.cppreference.com/w/cpp/algorithm/nth_element `nth_element`]. 218 219[heading Interquartile Range] 220 221Computes the [@https://en.wikipedia.org/wiki/Interquartile_range interquartile range] of a dataset: 222 223 std::vector<double> v{1,2,3,4,5}; 224 double iqr = boost::math::statistics::interquartile_range(v); 225 // Q1 = 1.5, Q3 = 4.5 => iqr = 3 226 227For a vector of length /2n+1/ or /2n/, the first quartile /Q/[sub 1] is the median of the /n/ smallest values, 228and the third quartile /Q/[sub 3] is the median of the /n/ largest values. 229The interquartile range is then /Q/[sub 3] - /Q/[sub 1]. 230The function `interquartile_range`, like the `median`, calls into `std::nth_element`, and hence partially sorts the data. 231 232[heading Gini Coefficient] 233 234Compute the Gini coefficient of a dataset: 235 236 std::vector<double> v{1,0,0,0}; 237 double gini = boost::math::statistics::gini_coefficient(v); 238 // gini = 3/4 239 double s_gini = boost::math::statistics::sample_gini_coefficient(v); 240 // s_gini = 1. 241 std::vector<double> w{1,1,1,1}; 242 gini = boost::math::statistics::gini_coefficient(w.begin(), w.end()); 243 // gini = 0, as all elements are now equal. 244 245/Nota bene/: The input data is altered: in particular, it is sorted. Makes a call to `std::sort`, and as such requires random access iterators. 246 247The sample Gini coefficient lies in the range [0,1], whereas the population Gini coefficient is in the range [0, 1 - 1/ /n/]. 248 249/Nota bene:/ There is essentially no reason to pass negative values to the Gini coefficient function. 250However, a use case (measuring wealth inequality when some people have negative wealth) exists, so we do not throw an exception when negative values are encountered. 251You should have /very/ good cause to pass negative values to the Gini coefficient calculator. 252Another use case is found in signal processing, but the sorting is by magnitude and hence has a different implementation. 253See `absolute_gini_coefficient` for details. 254 255[heading References] 256 257* Higham, Nicholas J. ['Accuracy and stability of numerical algorithms.] Vol. 80. Siam, 2002. 258* Philippe P. Pébay: ["Formulas for Robust, One-Pass Parallel Computation of Covariances and Arbitrary-Order Statistical Moments.] Technical Report SAND2008-6212, Sandia National Laboratories, September 2008. 259 260[endsect] 261[/section:univariate_statistics Univariate Statistics] 262