• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1:mod:`!statistics` --- Mathematical statistics functions
2========================================================
3
4.. module:: statistics
5   :synopsis: Mathematical statistics functions
6
7.. moduleauthor:: Steven D'Aprano <steve+python@pearwood.info>
8.. sectionauthor:: Steven D'Aprano <steve+python@pearwood.info>
9
10.. versionadded:: 3.4
11
12**Source code:** :source:`Lib/statistics.py`
13
14.. testsetup:: *
15
16   from statistics import *
17   import math
18   __name__ = '<doctest>'
19
20--------------
21
22This module provides functions for calculating mathematical statistics of
23numeric (:class:`~numbers.Real`-valued) data.
24
25The module is not intended to be a competitor to third-party libraries such
26as `NumPy <https://numpy.org>`_, `SciPy <https://scipy.org/>`_, or
27proprietary full-featured statistics packages aimed at professional
28statisticians such as Minitab, SAS and Matlab. It is aimed at the level of
29graphing and scientific calculators.
30
31Unless explicitly noted, these functions support :class:`int`,
32:class:`float`, :class:`~decimal.Decimal` and :class:`~fractions.Fraction`.
33Behaviour with other types (whether in the numeric tower or not) is
34currently unsupported.  Collections with a mix of types are also undefined
35and implementation-dependent.  If your input data consists of mixed types,
36you may be able to use :func:`map` to ensure a consistent result, for
37example: ``map(float, input_data)``.
38
39Some datasets use ``NaN`` (not a number) values to represent missing data.
40Since NaNs have unusual comparison semantics, they cause surprising or
41undefined behaviors in the statistics functions that sort data or that count
42occurrences.  The functions affected are ``median()``, ``median_low()``,
43``median_high()``, ``median_grouped()``, ``mode()``, ``multimode()``, and
44``quantiles()``.  The ``NaN`` values should be stripped before calling these
45functions::
46
47    >>> from statistics import median
48    >>> from math import isnan
49    >>> from itertools import filterfalse
50
51    >>> data = [20.7, float('NaN'),19.2, 18.3, float('NaN'), 14.4]
52    >>> sorted(data)  # This has surprising behavior
53    [20.7, nan, 14.4, 18.3, 19.2, nan]
54    >>> median(data)  # This result is unexpected
55    16.35
56
57    >>> sum(map(isnan, data))    # Number of missing values
58    2
59    >>> clean = list(filterfalse(isnan, data))  # Strip NaN values
60    >>> clean
61    [20.7, 19.2, 18.3, 14.4]
62    >>> sorted(clean)  # Sorting now works as expected
63    [14.4, 18.3, 19.2, 20.7]
64    >>> median(clean)       # This result is now well defined
65    18.75
66
67
68Averages and measures of central location
69-----------------------------------------
70
71These functions calculate an average or typical value from a population
72or sample.
73
74=======================  ===============================================================
75:func:`mean`             Arithmetic mean ("average") of data.
76:func:`fmean`            Fast, floating-point arithmetic mean, with optional weighting.
77:func:`geometric_mean`   Geometric mean of data.
78:func:`harmonic_mean`    Harmonic mean of data.
79:func:`kde`              Estimate the probability density distribution of the data.
80:func:`kde_random`       Random sampling from the PDF generated by kde().
81:func:`median`           Median (middle value) of data.
82:func:`median_low`       Low median of data.
83:func:`median_high`      High median of data.
84:func:`median_grouped`   Median (50th percentile) of grouped data.
85:func:`mode`             Single mode (most common value) of discrete or nominal data.
86:func:`multimode`        List of modes (most common values) of discrete or nominal data.
87:func:`quantiles`        Divide data into intervals with equal probability.
88=======================  ===============================================================
89
90Measures of spread
91------------------
92
93These functions calculate a measure of how much the population or sample
94tends to deviate from the typical or average values.
95
96=======================  =============================================
97:func:`pstdev`           Population standard deviation of data.
98:func:`pvariance`        Population variance of data.
99:func:`stdev`            Sample standard deviation of data.
100:func:`variance`         Sample variance of data.
101=======================  =============================================
102
103Statistics for relations between two inputs
104-------------------------------------------
105
106These functions calculate statistics regarding relations between two inputs.
107
108=========================  =====================================================
109:func:`covariance`         Sample covariance for two variables.
110:func:`correlation`        Pearson and Spearman's correlation coefficients.
111:func:`linear_regression`  Slope and intercept for simple linear regression.
112=========================  =====================================================
113
114
115Function details
116----------------
117
118Note: The functions do not require the data given to them to be sorted.
119However, for reading convenience, most of the examples show sorted sequences.
120
121.. function:: mean(data)
122
123   Return the sample arithmetic mean of *data* which can be a sequence or iterable.
124
125   The arithmetic mean is the sum of the data divided by the number of data
126   points.  It is commonly called "the average", although it is only one of many
127   different mathematical averages.  It is a measure of the central location of
128   the data.
129
130   If *data* is empty, :exc:`StatisticsError` will be raised.
131
132   Some examples of use:
133
134   .. doctest::
135
136      >>> mean([1, 2, 3, 4, 4])
137      2.8
138      >>> mean([-1.0, 2.5, 3.25, 5.75])
139      2.625
140
141      >>> from fractions import Fraction as F
142      >>> mean([F(3, 7), F(1, 21), F(5, 3), F(1, 3)])
143      Fraction(13, 21)
144
145      >>> from decimal import Decimal as D
146      >>> mean([D("0.5"), D("0.75"), D("0.625"), D("0.375")])
147      Decimal('0.5625')
148
149   .. note::
150
151      The mean is strongly affected by `outliers
152      <https://en.wikipedia.org/wiki/Outlier>`_ and is not necessarily a
153      typical example of the data points. For a more robust, although less
154      efficient, measure of `central tendency
155      <https://en.wikipedia.org/wiki/Central_tendency>`_, see :func:`median`.
156
157      The sample mean gives an unbiased estimate of the true population mean,
158      so that when taken on average over all the possible samples,
159      ``mean(sample)`` converges on the true mean of the entire population.  If
160      *data* represents the entire population rather than a sample, then
161      ``mean(data)`` is equivalent to calculating the true population mean μ.
162
163
164.. function:: fmean(data, weights=None)
165
166   Convert *data* to floats and compute the arithmetic mean.
167
168   This runs faster than the :func:`mean` function and it always returns a
169   :class:`float`.  The *data* may be a sequence or iterable.  If the input
170   dataset is empty, raises a :exc:`StatisticsError`.
171
172   .. doctest::
173
174      >>> fmean([3.5, 4.0, 5.25])
175      4.25
176
177   Optional weighting is supported.  For example, a professor assigns a
178   grade for a course by weighting quizzes at 20%, homework at 20%, a
179   midterm exam at 30%, and a final exam at 30%:
180
181   .. doctest::
182
183      >>> grades = [85, 92, 83, 91]
184      >>> weights = [0.20, 0.20, 0.30, 0.30]
185      >>> fmean(grades, weights)
186      87.6
187
188   If *weights* is supplied, it must be the same length as the *data* or
189   a :exc:`ValueError` will be raised.
190
191   .. versionadded:: 3.8
192
193   .. versionchanged:: 3.11
194      Added support for *weights*.
195
196
197.. function:: geometric_mean(data)
198
199   Convert *data* to floats and compute the geometric mean.
200
201   The geometric mean indicates the central tendency or typical value of the
202   *data* using the product of the values (as opposed to the arithmetic mean
203   which uses their sum).
204
205   Raises a :exc:`StatisticsError` if the input dataset is empty,
206   if it contains a zero, or if it contains a negative value.
207   The *data* may be a sequence or iterable.
208
209   No special efforts are made to achieve exact results.
210   (However, this may change in the future.)
211
212   .. doctest::
213
214      >>> round(geometric_mean([54, 24, 36]), 1)
215      36.0
216
217   .. versionadded:: 3.8
218
219
220.. function:: harmonic_mean(data, weights=None)
221
222   Return the harmonic mean of *data*, a sequence or iterable of
223   real-valued numbers.  If *weights* is omitted or ``None``, then
224   equal weighting is assumed.
225
226   The harmonic mean is the reciprocal of the arithmetic :func:`mean` of the
227   reciprocals of the data. For example, the harmonic mean of three values *a*,
228   *b* and *c* will be equivalent to ``3/(1/a + 1/b + 1/c)``.  If one of the
229   values is zero, the result will be zero.
230
231   The harmonic mean is a type of average, a measure of the central
232   location of the data.  It is often appropriate when averaging
233   ratios or rates, for example speeds.
234
235   Suppose a car travels 10 km at 40 km/hr, then another 10 km at 60 km/hr.
236   What is the average speed?
237
238   .. doctest::
239
240      >>> harmonic_mean([40, 60])
241      48.0
242
243   Suppose a car travels 40 km/hr for 5 km, and when traffic clears,
244   speeds-up to 60 km/hr for the remaining 30 km of the journey. What
245   is the average speed?
246
247   .. doctest::
248
249      >>> harmonic_mean([40, 60], weights=[5, 30])
250      56.0
251
252   :exc:`StatisticsError` is raised if *data* is empty, any element
253   is less than zero, or if the weighted sum isn't positive.
254
255   The current algorithm has an early-out when it encounters a zero
256   in the input.  This means that the subsequent inputs are not tested
257   for validity.  (This behavior may change in the future.)
258
259   .. versionadded:: 3.6
260
261   .. versionchanged:: 3.10
262      Added support for *weights*.
263
264
265.. function:: kde(data, h, kernel='normal', *, cumulative=False)
266
267   `Kernel Density Estimation (KDE)
268   <https://www.itm-conferences.org/articles/itmconf/pdf/2018/08/itmconf_sam2018_00037.pdf>`_:
269   Create a continuous probability density function or cumulative
270   distribution function from discrete samples.
271
272   The basic idea is to smooth the data using `a kernel function
273   <https://en.wikipedia.org/wiki/Kernel_(statistics)>`_.
274   to help draw inferences about a population from a sample.
275
276   The degree of smoothing is controlled by the scaling parameter *h*
277   which is called the bandwidth.  Smaller values emphasize local
278   features while larger values give smoother results.
279
280   The *kernel* determines the relative weights of the sample data
281   points.  Generally, the choice of kernel shape does not matter
282   as much as the more influential bandwidth smoothing parameter.
283
284   Kernels that give some weight to every sample point include
285   *normal* (*gauss*), *logistic*, and *sigmoid*.
286
287   Kernels that only give weight to sample points within the bandwidth
288   include *rectangular* (*uniform*), *triangular*, *parabolic*
289   (*epanechnikov*), *quartic* (*biweight*), *triweight*, and *cosine*.
290
291   If *cumulative* is true, will return a cumulative distribution function.
292
293   A :exc:`StatisticsError` will be raised if the *data* sequence is empty.
294
295   `Wikipedia has an example
296   <https://en.wikipedia.org/wiki/Kernel_density_estimation#Example>`_
297   where we can use :func:`kde` to generate and plot a probability
298   density function estimated from a small sample:
299
300   .. doctest::
301
302      >>> sample = [-2.1, -1.3, -0.4, 1.9, 5.1, 6.2]
303      >>> f_hat = kde(sample, h=1.5)
304      >>> xarr = [i/100 for i in range(-750, 1100)]
305      >>> yarr = [f_hat(x) for x in xarr]
306
307   The points in ``xarr`` and ``yarr`` can be used to make a PDF plot:
308
309   .. image:: kde_example.png
310      :alt: Scatter plot of the estimated probability density function.
311
312   .. versionadded:: 3.13
313
314
315.. function:: kde_random(data, h, kernel='normal', *, seed=None)
316
317   Return a function that makes a random selection from the estimated
318   probability density function produced by ``kde(data, h, kernel)``.
319
320   Providing a *seed* allows reproducible selections. In the future, the
321   values may change slightly as more accurate kernel inverse CDF estimates
322   are implemented.  The seed may be an integer, float, str, or bytes.
323
324   A :exc:`StatisticsError` will be raised if the *data* sequence is empty.
325
326   Continuing the example for :func:`kde`, we can use
327   :func:`kde_random` to generate new random selections from an
328   estimated probability density function:
329
330      >>> data = [-2.1, -1.3, -0.4, 1.9, 5.1, 6.2]
331      >>> rand = kde_random(data, h=1.5, seed=8675309)
332      >>> new_selections = [rand() for i in range(10)]
333      >>> [round(x, 1) for x in new_selections]
334      [0.7, 6.2, 1.2, 6.9, 7.0, 1.8, 2.5, -0.5, -1.8, 5.6]
335
336   .. versionadded:: 3.13
337
338
339.. function:: median(data)
340
341   Return the median (middle value) of numeric data, using the common "mean of
342   middle two" method.  If *data* is empty, :exc:`StatisticsError` is raised.
343   *data* can be a sequence or iterable.
344
345   The median is a robust measure of central location and is less affected by
346   the presence of outliers.  When the number of data points is odd, the
347   middle data point is returned:
348
349   .. doctest::
350
351      >>> median([1, 3, 5])
352      3
353
354   When the number of data points is even, the median is interpolated by taking
355   the average of the two middle values:
356
357   .. doctest::
358
359      >>> median([1, 3, 5, 7])
360      4.0
361
362   This is suited for when your data is discrete, and you don't mind that the
363   median may not be an actual data point.
364
365   If the data is ordinal (supports order operations) but not numeric (doesn't
366   support addition), consider using :func:`median_low` or :func:`median_high`
367   instead.
368
369.. function:: median_low(data)
370
371   Return the low median of numeric data.  If *data* is empty,
372   :exc:`StatisticsError` is raised.  *data* can be a sequence or iterable.
373
374   The low median is always a member of the data set.  When the number of data
375   points is odd, the middle value is returned.  When it is even, the smaller of
376   the two middle values is returned.
377
378   .. doctest::
379
380      >>> median_low([1, 3, 5])
381      3
382      >>> median_low([1, 3, 5, 7])
383      3
384
385   Use the low median when your data are discrete and you prefer the median to
386   be an actual data point rather than interpolated.
387
388
389.. function:: median_high(data)
390
391   Return the high median of data.  If *data* is empty, :exc:`StatisticsError`
392   is raised.  *data* can be a sequence or iterable.
393
394   The high median is always a member of the data set.  When the number of data
395   points is odd, the middle value is returned.  When it is even, the larger of
396   the two middle values is returned.
397
398   .. doctest::
399
400      >>> median_high([1, 3, 5])
401      3
402      >>> median_high([1, 3, 5, 7])
403      5
404
405   Use the high median when your data are discrete and you prefer the median to
406   be an actual data point rather than interpolated.
407
408
409.. function:: median_grouped(data, interval=1.0)
410
411   Estimates the median for numeric data that has been `grouped or binned
412   <https://en.wikipedia.org/wiki/Data_binning>`_ around the midpoints
413   of consecutive, fixed-width intervals.
414
415   The *data* can be any iterable of numeric data with each value being
416   exactly the midpoint of a bin.  At least one value must be present.
417
418   The *interval* is the width of each bin.
419
420   For example, demographic information may have been summarized into
421   consecutive ten-year age groups with each group being represented
422   by the 5-year midpoints of the intervals:
423
424   .. doctest::
425
426      >>> from collections import Counter
427      >>> demographics = Counter({
428      ...    25: 172,   # 20 to 30 years old
429      ...    35: 484,   # 30 to 40 years old
430      ...    45: 387,   # 40 to 50 years old
431      ...    55:  22,   # 50 to 60 years old
432      ...    65:   6,   # 60 to 70 years old
433      ... })
434      ...
435
436   The 50th percentile (median) is the 536th person out of the 1071
437   member cohort.  That person is in the 30 to 40 year old age group.
438
439   The regular :func:`median` function would assume that everyone in the
440   tricenarian age group was exactly 35 years old.  A more tenable
441   assumption is that the 484 members of that age group are evenly
442   distributed between 30 and 40.  For that, we use
443   :func:`median_grouped`:
444
445   .. doctest::
446
447       >>> data = list(demographics.elements())
448       >>> median(data)
449       35
450       >>> round(median_grouped(data, interval=10), 1)
451       37.5
452
453   The caller is responsible for making sure the data points are separated
454   by exact multiples of *interval*.  This is essential for getting a
455   correct result.  The function does not check this precondition.
456
457   Inputs may be any numeric type that can be coerced to a float during
458   the interpolation step.
459
460
461.. function:: mode(data)
462
463   Return the single most common data point from discrete or nominal *data*.
464   The mode (when it exists) is the most typical value and serves as a
465   measure of central location.
466
467   If there are multiple modes with the same frequency, returns the first one
468   encountered in the *data*.  If the smallest or largest of those is
469   desired instead, use ``min(multimode(data))`` or ``max(multimode(data))``.
470   If the input *data* is empty, :exc:`StatisticsError` is raised.
471
472   ``mode`` assumes discrete data and returns a single value. This is the
473   standard treatment of the mode as commonly taught in schools:
474
475   .. doctest::
476
477      >>> mode([1, 1, 2, 3, 3, 3, 3, 4])
478      3
479
480   The mode is unique in that it is the only statistic in this package that
481   also applies to nominal (non-numeric) data:
482
483   .. doctest::
484
485      >>> mode(["red", "blue", "blue", "red", "green", "red", "red"])
486      'red'
487
488   Only hashable inputs are supported.  To handle type :class:`set`,
489   consider casting to :class:`frozenset`.  To handle type :class:`list`,
490   consider casting to :class:`tuple`.  For mixed or nested inputs, consider
491   using this slower quadratic algorithm that only depends on equality tests:
492   ``max(data, key=data.count)``.
493
494   .. versionchanged:: 3.8
495      Now handles multimodal datasets by returning the first mode encountered.
496      Formerly, it raised :exc:`StatisticsError` when more than one mode was
497      found.
498
499
500.. function:: multimode(data)
501
502   Return a list of the most frequently occurring values in the order they
503   were first encountered in the *data*.  Will return more than one result if
504   there are multiple modes or an empty list if the *data* is empty:
505
506   .. doctest::
507
508        >>> multimode('aabbbbccddddeeffffgg')
509        ['b', 'd', 'f']
510        >>> multimode('')
511        []
512
513   .. versionadded:: 3.8
514
515
516.. function:: pstdev(data, mu=None)
517
518   Return the population standard deviation (the square root of the population
519   variance).  See :func:`pvariance` for arguments and other details.
520
521   .. doctest::
522
523      >>> pstdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75])
524      0.986893273527251
525
526
527.. function:: pvariance(data, mu=None)
528
529   Return the population variance of *data*, a non-empty sequence or iterable
530   of real-valued numbers.  Variance, or second moment about the mean, is a
531   measure of the variability (spread or dispersion) of data.  A large
532   variance indicates that the data is spread out; a small variance indicates
533   it is clustered closely around the mean.
534
535   If the optional second argument *mu* is given, it should be the *population*
536   mean of the *data*.  It can also be used to compute the second moment around
537   a point that is not the mean.  If it is missing or ``None`` (the default),
538   the arithmetic mean is automatically calculated.
539
540   Use this function to calculate the variance from the entire population.  To
541   estimate the variance from a sample, the :func:`variance` function is usually
542   a better choice.
543
544   Raises :exc:`StatisticsError` if *data* is empty.
545
546   Examples:
547
548   .. doctest::
549
550      >>> data = [0.0, 0.25, 0.25, 1.25, 1.5, 1.75, 2.75, 3.25]
551      >>> pvariance(data)
552      1.25
553
554   If you have already calculated the mean of your data, you can pass it as the
555   optional second argument *mu* to avoid recalculation:
556
557   .. doctest::
558
559      >>> mu = mean(data)
560      >>> pvariance(data, mu)
561      1.25
562
563   Decimals and Fractions are supported:
564
565   .. doctest::
566
567      >>> from decimal import Decimal as D
568      >>> pvariance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")])
569      Decimal('24.815')
570
571      >>> from fractions import Fraction as F
572      >>> pvariance([F(1, 4), F(5, 4), F(1, 2)])
573      Fraction(13, 72)
574
575   .. note::
576
577      When called with the entire population, this gives the population variance
578      σ².  When called on a sample instead, this is the biased sample variance
579      s², also known as variance with N degrees of freedom.
580
581      If you somehow know the true population mean μ, you may use this
582      function to calculate the variance of a sample, giving the known
583      population mean as the second argument.  Provided the data points are a
584      random sample of the population, the result will be an unbiased estimate
585      of the population variance.
586
587
588.. function:: stdev(data, xbar=None)
589
590   Return the sample standard deviation (the square root of the sample
591   variance).  See :func:`variance` for arguments and other details.
592
593   .. doctest::
594
595      >>> stdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75])
596      1.0810874155219827
597
598
599.. function:: variance(data, xbar=None)
600
601   Return the sample variance of *data*, an iterable of at least two real-valued
602   numbers.  Variance, or second moment about the mean, is a measure of the
603   variability (spread or dispersion) of data.  A large variance indicates that
604   the data is spread out; a small variance indicates it is clustered closely
605   around the mean.
606
607   If the optional second argument *xbar* is given, it should be the *sample*
608   mean of *data*.  If it is missing or ``None`` (the default), the mean is
609   automatically calculated.
610
611   Use this function when your data is a sample from a population. To calculate
612   the variance from the entire population, see :func:`pvariance`.
613
614   Raises :exc:`StatisticsError` if *data* has fewer than two values.
615
616   Examples:
617
618   .. doctest::
619
620      >>> data = [2.75, 1.75, 1.25, 0.25, 0.5, 1.25, 3.5]
621      >>> variance(data)
622      1.3720238095238095
623
624   If you have already calculated the sample mean of your data, you can pass it
625   as the optional second argument *xbar* to avoid recalculation:
626
627   .. doctest::
628
629      >>> m = mean(data)
630      >>> variance(data, m)
631      1.3720238095238095
632
633   This function does not attempt to verify that you have passed the actual mean
634   as *xbar*.  Using arbitrary values for *xbar* can lead to invalid or
635   impossible results.
636
637   Decimal and Fraction values are supported:
638
639   .. doctest::
640
641      >>> from decimal import Decimal as D
642      >>> variance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")])
643      Decimal('31.01875')
644
645      >>> from fractions import Fraction as F
646      >>> variance([F(1, 6), F(1, 2), F(5, 3)])
647      Fraction(67, 108)
648
649   .. note::
650
651      This is the sample variance s² with Bessel's correction, also known as
652      variance with N-1 degrees of freedom.  Provided that the data points are
653      representative (e.g. independent and identically distributed), the result
654      should be an unbiased estimate of the true population variance.
655
656      If you somehow know the actual population mean μ you should pass it to the
657      :func:`pvariance` function as the *mu* parameter to get the variance of a
658      sample.
659
660.. function:: quantiles(data, *, n=4, method='exclusive')
661
662   Divide *data* into *n* continuous intervals with equal probability.
663   Returns a list of ``n - 1`` cut points separating the intervals.
664
665   Set *n* to 4 for quartiles (the default).  Set *n* to 10 for deciles.  Set
666   *n* to 100 for percentiles which gives the 99 cuts points that separate
667   *data* into 100 equal sized groups.  Raises :exc:`StatisticsError` if *n*
668   is not least 1.
669
670   The *data* can be any iterable containing sample data.  For meaningful
671   results, the number of data points in *data* should be larger than *n*.
672   Raises :exc:`StatisticsError` if there is not at least one data point.
673
674   The cut points are linearly interpolated from the
675   two nearest data points.  For example, if a cut point falls one-third
676   of the distance between two sample values, ``100`` and ``112``, the
677   cut-point will evaluate to ``104``.
678
679   The *method* for computing quantiles can be varied depending on
680   whether the *data* includes or excludes the lowest and
681   highest possible values from the population.
682
683   The default *method* is "exclusive" and is used for data sampled from
684   a population that can have more extreme values than found in the
685   samples.  The portion of the population falling below the *i-th* of
686   *m* sorted data points is computed as ``i / (m + 1)``.  Given nine
687   sample values, the method sorts them and assigns the following
688   percentiles: 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%.
689
690   Setting the *method* to "inclusive" is used for describing population
691   data or for samples that are known to include the most extreme values
692   from the population.  The minimum value in *data* is treated as the 0th
693   percentile and the maximum value is treated as the 100th percentile.
694   The portion of the population falling below the *i-th* of *m* sorted
695   data points is computed as ``(i - 1) / (m - 1)``.  Given 11 sample
696   values, the method sorts them and assigns the following percentiles:
697   0%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 100%.
698
699   .. doctest::
700
701        # Decile cut points for empirically sampled data
702        >>> data = [105, 129, 87, 86, 111, 111, 89, 81, 108, 92, 110,
703        ...         100, 75, 105, 103, 109, 76, 119, 99, 91, 103, 129,
704        ...         106, 101, 84, 111, 74, 87, 86, 103, 103, 106, 86,
705        ...         111, 75, 87, 102, 121, 111, 88, 89, 101, 106, 95,
706        ...         103, 107, 101, 81, 109, 104]
707        >>> [round(q, 1) for q in quantiles(data, n=10)]
708        [81.0, 86.2, 89.0, 99.4, 102.5, 103.6, 106.0, 109.8, 111.0]
709
710   .. versionadded:: 3.8
711
712   .. versionchanged:: 3.13
713      No longer raises an exception for an input with only a single data point.
714      This allows quantile estimates to be built up one sample point
715      at a time becoming gradually more refined with each new data point.
716
717.. function:: covariance(x, y, /)
718
719   Return the sample covariance of two inputs *x* and *y*. Covariance
720   is a measure of the joint variability of two inputs.
721
722   Both inputs must be of the same length (no less than two), otherwise
723   :exc:`StatisticsError` is raised.
724
725   Examples:
726
727   .. doctest::
728
729      >>> x = [1, 2, 3, 4, 5, 6, 7, 8, 9]
730      >>> y = [1, 2, 3, 1, 2, 3, 1, 2, 3]
731      >>> covariance(x, y)
732      0.75
733      >>> z = [9, 8, 7, 6, 5, 4, 3, 2, 1]
734      >>> covariance(x, z)
735      -7.5
736      >>> covariance(z, x)
737      -7.5
738
739   .. versionadded:: 3.10
740
741.. function:: correlation(x, y, /, *, method='linear')
742
743   Return the `Pearson's correlation coefficient
744   <https://en.wikipedia.org/wiki/Pearson_correlation_coefficient>`_
745   for two inputs. Pearson's correlation coefficient *r* takes values
746   between -1 and +1. It measures the strength and direction of a linear
747   relationship.
748
749   If *method* is "ranked", computes `Spearman's rank correlation coefficient
750   <https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient>`_
751   for two inputs. The data is replaced by ranks.  Ties are averaged so that
752   equal values receive the same rank.  The resulting coefficient measures the
753   strength of a monotonic relationship.
754
755   Spearman's correlation coefficient is appropriate for ordinal data or for
756   continuous data that doesn't meet the linear proportion requirement for
757   Pearson's correlation coefficient.
758
759   Both inputs must be of the same length (no less than two), and need
760   not to be constant, otherwise :exc:`StatisticsError` is raised.
761
762   Example with `Kepler's laws of planetary motion
763   <https://en.wikipedia.org/wiki/Kepler's_laws_of_planetary_motion>`_:
764
765   .. doctest::
766
767      >>> # Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, and  Neptune
768      >>> orbital_period = [88, 225, 365, 687, 4331, 10_756, 30_687, 60_190]    # days
769      >>> dist_from_sun = [58, 108, 150, 228, 778, 1_400, 2_900, 4_500] # million km
770
771      >>> # Show that a perfect monotonic relationship exists
772      >>> correlation(orbital_period, dist_from_sun, method='ranked')
773      1.0
774
775      >>> # Observe that a linear relationship is imperfect
776      >>> round(correlation(orbital_period, dist_from_sun), 4)
777      0.9882
778
779      >>> # Demonstrate Kepler's third law: There is a linear correlation
780      >>> # between the square of the orbital period and the cube of the
781      >>> # distance from the sun.
782      >>> period_squared = [p * p for p in orbital_period]
783      >>> dist_cubed = [d * d * d for d in dist_from_sun]
784      >>> round(correlation(period_squared, dist_cubed), 4)
785      1.0
786
787   .. versionadded:: 3.10
788
789   .. versionchanged:: 3.12
790      Added support for Spearman's rank correlation coefficient.
791
792.. function:: linear_regression(x, y, /, *, proportional=False)
793
794   Return the slope and intercept of `simple linear regression
795   <https://en.wikipedia.org/wiki/Simple_linear_regression>`_
796   parameters estimated using ordinary least squares. Simple linear
797   regression describes the relationship between an independent variable *x* and
798   a dependent variable *y* in terms of this linear function:
799
800      *y = slope \* x + intercept + noise*
801
802   where ``slope`` and ``intercept`` are the regression parameters that are
803   estimated, and ``noise`` represents the
804   variability of the data that was not explained by the linear regression
805   (it is equal to the difference between predicted and actual values
806   of the dependent variable).
807
808   Both inputs must be of the same length (no less than two), and
809   the independent variable *x* cannot be constant;
810   otherwise a :exc:`StatisticsError` is raised.
811
812   For example, we can use the `release dates of the Monty
813   Python films <https://en.wikipedia.org/wiki/Monty_Python#Films>`_
814   to predict the cumulative number of Monty Python films
815   that would have been produced by 2019
816   assuming that they had kept the pace.
817
818   .. doctest::
819
820      >>> year = [1971, 1975, 1979, 1982, 1983]
821      >>> films_total = [1, 2, 3, 4, 5]
822      >>> slope, intercept = linear_regression(year, films_total)
823      >>> round(slope * 2019 + intercept)
824      16
825
826   If *proportional* is true, the independent variable *x* and the
827   dependent variable *y* are assumed to be directly proportional.
828   The data is fit to a line passing through the origin.
829   Since the *intercept* will always be 0.0, the underlying linear
830   function simplifies to:
831
832      *y = slope \* x + noise*
833
834   Continuing the example from :func:`correlation`, we look to see
835   how well a model based on major planets can predict the orbital
836   distances for dwarf planets:
837
838   .. doctest::
839
840      >>> model = linear_regression(period_squared, dist_cubed, proportional=True)
841      >>> slope = model.slope
842
843      >>> # Dwarf planets:   Pluto,  Eris,    Makemake, Haumea, Ceres
844      >>> orbital_periods = [90_560, 204_199, 111_845, 103_410, 1_680]  # days
845      >>> predicted_dist = [math.cbrt(slope * (p * p)) for p in orbital_periods]
846      >>> list(map(round, predicted_dist))
847      [5912, 10166, 6806, 6459, 414]
848
849      >>> [5_906, 10_152, 6_796, 6_450, 414]  # actual distance in million km
850      [5906, 10152, 6796, 6450, 414]
851
852   .. versionadded:: 3.10
853
854   .. versionchanged:: 3.11
855      Added support for *proportional*.
856
857Exceptions
858----------
859
860A single exception is defined:
861
862.. exception:: StatisticsError
863
864   Subclass of :exc:`ValueError` for statistics-related exceptions.
865
866
867:class:`NormalDist` objects
868---------------------------
869
870:class:`NormalDist` is a tool for creating and manipulating normal
871distributions of a `random variable
872<http://www.stat.yale.edu/Courses/1997-98/101/ranvar.htm>`_.  It is a
873class that treats the mean and standard deviation of data
874measurements as a single entity.
875
876Normal distributions arise from the `Central Limit Theorem
877<https://en.wikipedia.org/wiki/Central_limit_theorem>`_ and have a wide range
878of applications in statistics.
879
880.. class:: NormalDist(mu=0.0, sigma=1.0)
881
882    Returns a new *NormalDist* object where *mu* represents the `arithmetic
883    mean <https://en.wikipedia.org/wiki/Arithmetic_mean>`_ and *sigma*
884    represents the `standard deviation
885    <https://en.wikipedia.org/wiki/Standard_deviation>`_.
886
887    If *sigma* is negative, raises :exc:`StatisticsError`.
888
889    .. attribute:: mean
890
891       A read-only property for the `arithmetic mean
892       <https://en.wikipedia.org/wiki/Arithmetic_mean>`_ of a normal
893       distribution.
894
895    .. attribute:: median
896
897       A read-only property for the `median
898       <https://en.wikipedia.org/wiki/Median>`_ of a normal
899       distribution.
900
901    .. attribute:: mode
902
903       A read-only property for the `mode
904       <https://en.wikipedia.org/wiki/Mode_(statistics)>`_ of a normal
905       distribution.
906
907    .. attribute:: stdev
908
909       A read-only property for the `standard deviation
910       <https://en.wikipedia.org/wiki/Standard_deviation>`_ of a normal
911       distribution.
912
913    .. attribute:: variance
914
915       A read-only property for the `variance
916       <https://en.wikipedia.org/wiki/Variance>`_ of a normal
917       distribution. Equal to the square of the standard deviation.
918
919    .. classmethod:: NormalDist.from_samples(data)
920
921       Makes a normal distribution instance with *mu* and *sigma* parameters
922       estimated from the *data* using :func:`fmean` and :func:`stdev`.
923
924       The *data* can be any :term:`iterable` and should consist of values
925       that can be converted to type :class:`float`.  If *data* does not
926       contain at least two elements, raises :exc:`StatisticsError` because it
927       takes at least one point to estimate a central value and at least two
928       points to estimate dispersion.
929
930    .. method:: NormalDist.samples(n, *, seed=None)
931
932       Generates *n* random samples for a given mean and standard deviation.
933       Returns a :class:`list` of :class:`float` values.
934
935       If *seed* is given, creates a new instance of the underlying random
936       number generator.  This is useful for creating reproducible results,
937       even in a multi-threading context.
938
939       .. versionchanged:: 3.13
940
941       Switched to a faster algorithm.  To reproduce samples from previous
942       versions, use :func:`random.seed` and :func:`random.gauss`.
943
944    .. method:: NormalDist.pdf(x)
945
946       Using a `probability density function (pdf)
947       <https://en.wikipedia.org/wiki/Probability_density_function>`_, compute
948       the relative likelihood that a random variable *X* will be near the
949       given value *x*.  Mathematically, it is the limit of the ratio ``P(x <=
950       X < x+dx) / dx`` as *dx* approaches zero.
951
952       The relative likelihood is computed as the probability of a sample
953       occurring in a narrow range divided by the width of the range (hence
954       the word "density").  Since the likelihood is relative to other points,
955       its value can be greater than ``1.0``.
956
957    .. method:: NormalDist.cdf(x)
958
959       Using a `cumulative distribution function (cdf)
960       <https://en.wikipedia.org/wiki/Cumulative_distribution_function>`_,
961       compute the probability that a random variable *X* will be less than or
962       equal to *x*.  Mathematically, it is written ``P(X <= x)``.
963
964    .. method:: NormalDist.inv_cdf(p)
965
966       Compute the inverse cumulative distribution function, also known as the
967       `quantile function <https://en.wikipedia.org/wiki/Quantile_function>`_
968       or the `percent-point
969       <https://web.archive.org/web/20190203145224/https://www.statisticshowto.datasciencecentral.com/inverse-distribution-function/>`_
970       function.  Mathematically, it is written ``x : P(X <= x) = p``.
971
972       Finds the value *x* of the random variable *X* such that the
973       probability of the variable being less than or equal to that value
974       equals the given probability *p*.
975
976    .. method:: NormalDist.overlap(other)
977
978       Measures the agreement between two normal probability distributions.
979       Returns a value between 0.0 and 1.0 giving `the overlapping area for
980       the two probability density functions
981       <https://www.rasch.org/rmt/rmt101r.htm>`_.
982
983    .. method:: NormalDist.quantiles(n=4)
984
985        Divide the normal distribution into *n* continuous intervals with
986        equal probability.  Returns a list of (n - 1) cut points separating
987        the intervals.
988
989        Set *n* to 4 for quartiles (the default).  Set *n* to 10 for deciles.
990        Set *n* to 100 for percentiles which gives the 99 cuts points that
991        separate the normal distribution into 100 equal sized groups.
992
993    .. method:: NormalDist.zscore(x)
994
995        Compute the
996        `Standard Score <https://www.statisticshowto.com/probability-and-statistics/z-score/>`_
997        describing *x* in terms of the number of standard deviations
998        above or below the mean of the normal distribution:
999        ``(x - mean) / stdev``.
1000
1001        .. versionadded:: 3.9
1002
1003    Instances of :class:`NormalDist` support addition, subtraction,
1004    multiplication and division by a constant.  These operations
1005    are used for translation and scaling.  For example:
1006
1007    .. doctest::
1008
1009        >>> temperature_february = NormalDist(5, 2.5)             # Celsius
1010        >>> temperature_february * (9/5) + 32                     # Fahrenheit
1011        NormalDist(mu=41.0, sigma=4.5)
1012
1013    Dividing a constant by an instance of :class:`NormalDist` is not supported
1014    because the result wouldn't be normally distributed.
1015
1016    Since normal distributions arise from additive effects of independent
1017    variables, it is possible to `add and subtract two independent normally
1018    distributed random variables
1019    <https://en.wikipedia.org/wiki/Sum_of_normally_distributed_random_variables>`_
1020    represented as instances of :class:`NormalDist`.  For example:
1021
1022    .. doctest::
1023
1024        >>> birth_weights = NormalDist.from_samples([2.5, 3.1, 2.1, 2.4, 2.7, 3.5])
1025        >>> drug_effects = NormalDist(0.4, 0.15)
1026        >>> combined = birth_weights + drug_effects
1027        >>> round(combined.mean, 1)
1028        3.1
1029        >>> round(combined.stdev, 1)
1030        0.5
1031
1032    .. versionadded:: 3.8
1033
1034
1035Examples and Recipes
1036--------------------
1037
1038
1039Classic probability problems
1040****************************
1041
1042:class:`NormalDist` readily solves classic probability problems.
1043
1044For example, given `historical data for SAT exams
1045<https://nces.ed.gov/programs/digest/d17/tables/dt17_226.40.asp>`_ showing
1046that scores are normally distributed with a mean of 1060 and a standard
1047deviation of 195, determine the percentage of students with test scores
1048between 1100 and 1200, after rounding to the nearest whole number:
1049
1050.. doctest::
1051
1052    >>> sat = NormalDist(1060, 195)
1053    >>> fraction = sat.cdf(1200 + 0.5) - sat.cdf(1100 - 0.5)
1054    >>> round(fraction * 100.0, 1)
1055    18.4
1056
1057Find the `quartiles <https://en.wikipedia.org/wiki/Quartile>`_ and `deciles
1058<https://en.wikipedia.org/wiki/Decile>`_ for the SAT scores:
1059
1060.. doctest::
1061
1062    >>> list(map(round, sat.quantiles()))
1063    [928, 1060, 1192]
1064    >>> list(map(round, sat.quantiles(n=10)))
1065    [810, 896, 958, 1011, 1060, 1109, 1162, 1224, 1310]
1066
1067
1068Monte Carlo inputs for simulations
1069**********************************
1070
1071To estimate the distribution for a model that isn't easy to solve
1072analytically, :class:`NormalDist` can generate input samples for a `Monte
1073Carlo simulation <https://en.wikipedia.org/wiki/Monte_Carlo_method>`_:
1074
1075.. doctest::
1076
1077    >>> def model(x, y, z):
1078    ...     return (3*x + 7*x*y - 5*y) / (11 * z)
1079    ...
1080    >>> n = 100_000
1081    >>> X = NormalDist(10, 2.5).samples(n, seed=3652260728)
1082    >>> Y = NormalDist(15, 1.75).samples(n, seed=4582495471)
1083    >>> Z = NormalDist(50, 1.25).samples(n, seed=6582483453)
1084    >>> quantiles(map(model, X, Y, Z))       # doctest: +SKIP
1085    [1.4591308524824727, 1.8035946855390597, 2.175091447274739]
1086
1087Approximating binomial distributions
1088************************************
1089
1090Normal distributions can be used to approximate `Binomial
1091distributions <https://mathworld.wolfram.com/BinomialDistribution.html>`_
1092when the sample size is large and when the probability of a successful
1093trial is near 50%.
1094
1095For example, an open source conference has 750 attendees and two rooms with a
1096500 person capacity.  There is a talk about Python and another about Ruby.
1097In previous conferences, 65% of the attendees preferred to listen to Python
1098talks.  Assuming the population preferences haven't changed, what is the
1099probability that the Python room will stay within its capacity limits?
1100
1101.. doctest::
1102
1103    >>> n = 750             # Sample size
1104    >>> p = 0.65            # Preference for Python
1105    >>> q = 1.0 - p         # Preference for Ruby
1106    >>> k = 500             # Room capacity
1107
1108    >>> # Approximation using the cumulative normal distribution
1109    >>> from math import sqrt
1110    >>> round(NormalDist(mu=n*p, sigma=sqrt(n*p*q)).cdf(k + 0.5), 4)
1111    0.8402
1112
1113    >>> # Exact solution using the cumulative binomial distribution
1114    >>> from math import comb, fsum
1115    >>> round(fsum(comb(n, r) * p**r * q**(n-r) for r in range(k+1)), 4)
1116    0.8402
1117
1118    >>> # Approximation using a simulation
1119    >>> from random import seed, binomialvariate
1120    >>> seed(8675309)
1121    >>> mean(binomialvariate(n, p) <= k for i in range(10_000))
1122    0.8406
1123
1124
1125Naive bayesian classifier
1126*************************
1127
1128Normal distributions commonly arise in machine learning problems.
1129
1130Wikipedia has a `nice example of a Naive Bayesian Classifier
1131<https://en.wikipedia.org/wiki/Naive_Bayes_classifier#Person_classification>`_.
1132The challenge is to predict a person's gender from measurements of normally
1133distributed features including height, weight, and foot size.
1134
1135We're given a training dataset with measurements for eight people.  The
1136measurements are assumed to be normally distributed, so we summarize the data
1137with :class:`NormalDist`:
1138
1139.. doctest::
1140
1141    >>> height_male = NormalDist.from_samples([6, 5.92, 5.58, 5.92])
1142    >>> height_female = NormalDist.from_samples([5, 5.5, 5.42, 5.75])
1143    >>> weight_male = NormalDist.from_samples([180, 190, 170, 165])
1144    >>> weight_female = NormalDist.from_samples([100, 150, 130, 150])
1145    >>> foot_size_male = NormalDist.from_samples([12, 11, 12, 10])
1146    >>> foot_size_female = NormalDist.from_samples([6, 8, 7, 9])
1147
1148Next, we encounter a new person whose feature measurements are known but whose
1149gender is unknown:
1150
1151.. doctest::
1152
1153    >>> ht = 6.0        # height
1154    >>> wt = 130        # weight
1155    >>> fs = 8          # foot size
1156
1157Starting with a 50% `prior probability
1158<https://en.wikipedia.org/wiki/Prior_probability>`_ of being male or female,
1159we compute the posterior as the prior times the product of likelihoods for the
1160feature measurements given the gender:
1161
1162.. doctest::
1163
1164   >>> prior_male = 0.5
1165   >>> prior_female = 0.5
1166   >>> posterior_male = (prior_male * height_male.pdf(ht) *
1167   ...                   weight_male.pdf(wt) * foot_size_male.pdf(fs))
1168
1169   >>> posterior_female = (prior_female * height_female.pdf(ht) *
1170   ...                     weight_female.pdf(wt) * foot_size_female.pdf(fs))
1171
1172The final prediction goes to the largest posterior. This is known as the
1173`maximum a posteriori
1174<https://en.wikipedia.org/wiki/Maximum_a_posteriori_estimation>`_ or MAP:
1175
1176.. doctest::
1177
1178  >>> 'male' if posterior_male > posterior_female else 'female'
1179  'female'
1180
1181
1182..
1183   # This modelines must appear within the last ten lines of the file.
1184   kate: indent-width 3; remove-trailing-space on; replace-tabs on; encoding utf-8;
1185