• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1[/
2  Copyright 2018 Nick Thompson
3
4  Distributed under the Boost Software License, Version 1.0.
5  (See accompanying file LICENSE_1_0.txt or copy at
6  http://www.boost.org/LICENSE_1_0.txt).
7]
8
9[section:univariate_statistics Univariate Statistics]
10
11[heading Synopsis]
12
13``
14#include <boost/math/statistics/univariate_statistics.hpp>
15
16namespace boost{ namespace math{ namespace statistics {
17
18    template<class Container>
19    auto mean(Container const & c);
20
21    template<class ForwardIterator>
22    auto mean(ForwardIterator first, ForwardIterator last);
23
24    template<class Container>
25    auto variance(Container const & c);
26
27    template<class ForwardIterator>
28    auto variance(ForwardIterator first, ForwardIterator last);
29
30    template<class Container>
31    auto sample_variance(Container const & c);
32
33    template<class ForwardIterator>
34    auto sample_variance(ForwardIterator first, ForwardIterator last);
35
36    template<class Container>
37    auto mean_and_sample_variance(Container const & c);
38
39    template<class Container>
40    auto skewness(Container const & c);
41
42    template<class ForwardIterator>
43    auto skewness(ForwardIterator first, ForwardIterator last);
44
45    template<class Container>
46    auto kurtosis(Container const & c);
47
48    template<class ForwardIterator>
49    auto kurtosis(ForwardIterator first, ForwardIterator last);
50
51    template<class Container>
52    auto excess_kurtosis(Container const & c);
53
54    template<class ForwardIterator>
55    auto excess_kurtosis(ForwardIterator first, ForwardIterator last);
56
57    template<class Container>
58    auto first_four_moments(Container const & c);
59
60    template<class ForwardIterator>
61    auto first_four_moments(ForwardIterator first, ForwardIterator last);
62
63    template<class Container>
64    auto median(Container & c);
65
66    template<class ForwardIterator>
67    auto median(ForwardIterator first, ForwardIterator last);
68
69    template<class RandomAccessIterator>
70    auto median_absolute_deviation(ForwardIterator first, ForwardIterator last, typename std::iterator_traits<RandomAccessIterator>::value_type center=std::numeric_limits<Real>::quiet_NaN());
71
72    template<class RandomAccessContainer>
73    auto median_absolute_deviation(RandomAccessContainer v, typename RandomAccessContainer::value_type center=std::numeric_limits<Real>::quiet_NaN());
74
75    template<class RandomAccessIterator>
76    auto interquartile_range(ForwardIterator first, ForwardIterator last);
77
78    template<class RandomAccessContainer>
79    auto interquartile_range(RandomAccessContainer v);
80
81    template<class Container>
82    auto gini_coefficient(Container & c);
83
84    template<class ForwardIterator>
85    auto gini_coefficient(ForwardIterator first, ForwardIterator last);
86
87    template<class Container>
88    auto sample_gini_coefficient(Container & c);
89
90    template<class ForwardIterator>
91    auto sample_gini_coefficient(ForwardIterator first, ForwardIterator last);
92
93}}}
94``
95
96[heading Description]
97
98The file `boost/math/statistics/univariate_statistics.hpp` is a set of facilities for computing scalar values from vectors.
99
100Many of these functionals have trivial naive implementations, but experienced programmers will recognize that even trivial algorithms are easy to screw up, and that numerical instabilities often lurk in corner cases.
101We have attempted to do our "due diligence" to root out these problems-scouring the literature for numerically stable algorithms for even the simplest of functionals.
102
103/Nota bene/: Some similar functionality is provided in [@https://www.boost.org/doc/libs/1_68_0/doc/html/accumulators/user_s_guide.html Boost Accumulators Framework].
104These accumulators should be used in real-time applications; `univariate_statistics.hpp` should be used when CPU vectorization is needed.
105As a reminder, remember that to actually /get/ vectorization, compile with `-march=native -O3` flags.
106
107We now describe each functional in detail.
108Our examples use `std::vector<double>` to hold the data, but this not required.
109In general, you can store your data in an Eigen array, and Armadillo vector, `std::array`, and for many of the routines, a `std::forward_list`.
110These routines are usable in float, double, long double, and Boost.Multiprecision precision, as well as their complex extensions whenever the computation is well-defined.
111For certain operations (total variation, for example) integer inputs are supported.
112
113[heading Mean]
114
115    std::vector<double> v{1,2,3,4,5};
116    double mu = boost::math::statistics::mean(v.cbegin(), v.cend());
117    // Alternative syntax if you want to use entire container:
118    mu = boost::math::statistics::mean(v);
119
120The implementation follows [@https://doi.org/10.1137/1.9780898718027 Higham 1.6a].
121The data is not modified and must be forward iterable.
122Works with real and integer data.
123If the input is an integer type, the output is a double precision float.
124
125[heading Variance]
126
127    std::vector<double> v{1,2,3,4,5};
128    Real sigma_sq = boost::math::statistics::variance(v.cbegin(), v.cend());
129
130If you don't need to calculate on a subset of the input, then the range call is more terse:
131
132    std::vector<double> v{1,2,3,4,5};
133    Real sigma_sq = boost::math::statistics::variance(v);
134
135The implementation follows [@https://doi.org/10.1137/1.9780898718027 Higham 1.6b].
136The input data must be forward iterable and the range `[first, last)` must contain at least two elements.
137It is /not/ in general sensible to pass complex numbers to this routine.
138If integers are passed as input, then the output is a double precision float.
139
140`boost::math::statistics::variance` returns the population variance.
141If you want a sample variance, use
142
143    std::vector<double> v{1,2,3,4,5};
144    Real sn_sq = boost::math::statistics::sample_variance(v);
145
146
147[heading Skewness]
148
149Computes the skewness of a dataset:
150
151    std::vector<double> v{1,2,3,4,5};
152    double skewness = boost::math::statistics::skewness(v);
153    // skewness = 0.
154
155The input vector is not modified, works with integral and real data.
156If the input data is integral, the output is a double precision float.
157
158For a dataset consisting of a single constant value, we take the skewness to be zero by definition.
159
160The implementation follows [@https://prod.sandia.gov/techlib-noauth/access-control.cgi/2008/086212.pdf Pebay].
161
162[heading Kurtosis]
163
164Computes the kurtosis of a dataset:
165
166    std::vector<double> v{1,2,3,4,5};
167    double kurtosis = boost::math::statistics::kurtosis(v);
168    // kurtosis = 17/10
169
170The implementation follows [@https://prod.sandia.gov/techlib-noauth/access-control.cgi/2008/086212.pdf Pebay].
171The input data must be forward iterable and must consist of real or integral values.
172If the input data is integral, the output is a double precision float.
173Note that this is /not/ the excess kurtosis.
174If you require the excess kurtosis, use `boost::math::statistics::excess_kurtosis`.
175This function simply subtracts 3 from the kurtosis, but it makes eminently clear our definition of kurtosis.
176
177[heading First four moments]
178
179Simultaneously computes the first four [@https://en.wikipedia.org/wiki/Central_moment central moments] in a single pass through the data:
180
181    std::vector<double> v{1,2,3,4,5};
182    auto [M1, M2, M3, M4] = boost::math::statistics::first_four_moments(v);
183
184
185[heading Median]
186
187Computes the median of a dataset:
188
189    std::vector<double> v{1,2,3,4,5};
190    double m = boost::math::statistics::median(v.begin(), v.end());
191
192/Nota bene: The input vector is modified./
193The calculation of the median is a thin wrapper around the C++11 [@https://en.cppreference.com/w/cpp/algorithm/nth_element `nth_element`].
194Therefore, all requirements of `std::nth_element` are inherited by the median calculation.
195In particular, the container must allow random access.
196
197[heading Median Absolute Deviation]
198
199Computes the [@https://en.wikipedia.org/wiki/Median_absolute_deviation median absolute deviation] of a dataset:
200
201    std::vector<double> v{1,2,3,4,5};
202    double mad = boost::math::statistics::median_absolute_deviation(v);
203
204By default, the deviation from the median is used.
205If you have some prior that the median is zero, or wish to compute the median absolute deviation from the mean,
206use the following:
207
208    // prior is that center is zero:
209    double center = 0;
210    double mad = boost::math::statistics::median_absolute_deviation(v, center);
211
212    // compute median absolute deviation from the mean:
213    double mu = boost::math::statistics::mean(v);
214    double mad = boost::math::statistics::median_absolute_deviation(v, mu);
215
216/Nota bene:/ The input vector is modified.
217Again the vector is passed into a call to [@https://en.cppreference.com/w/cpp/algorithm/nth_element `nth_element`].
218
219[heading Interquartile Range]
220
221Computes the [@https://en.wikipedia.org/wiki/Interquartile_range interquartile range] of a dataset:
222
223    std::vector<double> v{1,2,3,4,5};
224    double iqr = boost::math::statistics::interquartile_range(v);
225    // Q1 = 1.5, Q3 = 4.5 => iqr = 3
226
227For a vector of length /2n+1/ or /2n/, the first quartile /Q/[sub 1] is the median of the /n/ smallest values,
228and the third quartile /Q/[sub 3] is the median of the /n/ largest values.
229The interquartile range is then /Q/[sub 3] - /Q/[sub 1].
230The function `interquartile_range`, like the `median`, calls into `std::nth_element`, and hence partially sorts the data.
231
232[heading Gini Coefficient]
233
234Compute the Gini coefficient of a dataset:
235
236    std::vector<double> v{1,0,0,0};
237    double gini = boost::math::statistics::gini_coefficient(v);
238    // gini = 3/4
239    double s_gini = boost::math::statistics::sample_gini_coefficient(v);
240    // s_gini = 1.
241    std::vector<double> w{1,1,1,1};
242    gini = boost::math::statistics::gini_coefficient(w.begin(), w.end());
243    // gini = 0, as all elements are now equal.
244
245/Nota bene/: The input data is altered: in particular, it is sorted. Makes a call to `std::sort`, and as such requires random access iterators.
246
247The sample Gini coefficient lies in the range [0,1], whereas the population Gini coefficient is in the range [0, 1 - 1/ /n/].
248
249/Nota bene:/ There is essentially no reason to pass negative values to the Gini coefficient function.
250However, a use case (measuring wealth inequality when some people have negative wealth) exists, so we do not throw an exception when negative values are encountered.
251You should have /very/ good cause to pass negative values to the Gini coefficient calculator.
252Another use case is found in signal processing, but the sorting is by magnitude and hence has a different implementation.
253See `absolute_gini_coefficient` for details.
254
255[heading References]
256
257* Higham, Nicholas J. ['Accuracy and stability of numerical algorithms.] Vol. 80. Siam, 2002.
258* Philippe P. Pébay: ["Formulas for Robust, One-Pass Parallel Computation of Covariances and Arbitrary-Order Statistical Moments.] Technical Report SAND2008-6212, Sandia National Laboratories, September 2008.
259
260[endsect]
261[/section:univariate_statistics Univariate Statistics]
262