• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1:mod:`statistics` --- Mathematical statistics functions
2=======================================================
3
4.. module:: statistics
5   :synopsis: Mathematical statistics functions
6
7.. moduleauthor:: Steven D'Aprano <steve+python@pearwood.info>
8.. sectionauthor:: Steven D'Aprano <steve+python@pearwood.info>
9
10.. versionadded:: 3.4
11
12**Source code:** :source:`Lib/statistics.py`
13
14.. testsetup:: *
15
16   from statistics import *
17   __name__ = '<doctest>'
18
19--------------
20
21This module provides functions for calculating mathematical statistics of
22numeric (:class:`~numbers.Real`-valued) data.
23
24The module is not intended to be a competitor to third-party libraries such
25as `NumPy <https://numpy.org>`_, `SciPy <https://www.scipy.org/>`_, or
26proprietary full-featured statistics packages aimed at professional
27statisticians such as Minitab, SAS and Matlab. It is aimed at the level of
28graphing and scientific calculators.
29
30Unless explicitly noted, these functions support :class:`int`,
31:class:`float`, :class:`~decimal.Decimal` and :class:`~fractions.Fraction`.
32Behaviour with other types (whether in the numeric tower or not) is
33currently unsupported.  Collections with a mix of types are also undefined
34and implementation-dependent.  If your input data consists of mixed types,
35you may be able to use :func:`map` to ensure a consistent result, for
36example: ``map(float, input_data)``.
37
38Averages and measures of central location
39-----------------------------------------
40
41These functions calculate an average or typical value from a population
42or sample.
43
44=======================  ===============================================================
45:func:`mean`             Arithmetic mean ("average") of data.
46:func:`fmean`            Fast, floating point arithmetic mean.
47:func:`geometric_mean`   Geometric mean of data.
48:func:`harmonic_mean`    Harmonic mean of data.
49:func:`median`           Median (middle value) of data.
50:func:`median_low`       Low median of data.
51:func:`median_high`      High median of data.
52:func:`median_grouped`   Median, or 50th percentile, of grouped data.
53:func:`mode`             Single mode (most common value) of discrete or nominal data.
54:func:`multimode`        List of modes (most common values) of discrete or nominal data.
55:func:`quantiles`        Divide data into intervals with equal probability.
56=======================  ===============================================================
57
58Measures of spread
59------------------
60
61These functions calculate a measure of how much the population or sample
62tends to deviate from the typical or average values.
63
64=======================  =============================================
65:func:`pstdev`           Population standard deviation of data.
66:func:`pvariance`        Population variance of data.
67:func:`stdev`            Sample standard deviation of data.
68:func:`variance`         Sample variance of data.
69=======================  =============================================
70
71Statistics for relations between two inputs
72-------------------------------------------
73
74These functions calculate statistics regarding relations between two inputs.
75
76=========================  =====================================================
77:func:`covariance`         Sample covariance for two variables.
78:func:`correlation`        Pearson's correlation coefficient for two variables.
79:func:`linear_regression`  Slope and intercept for simple linear regression.
80=========================  =====================================================
81
82
83Function details
84----------------
85
86Note: The functions do not require the data given to them to be sorted.
87However, for reading convenience, most of the examples show sorted sequences.
88
89.. function:: mean(data)
90
91   Return the sample arithmetic mean of *data* which can be a sequence or iterable.
92
93   The arithmetic mean is the sum of the data divided by the number of data
94   points.  It is commonly called "the average", although it is only one of many
95   different mathematical averages.  It is a measure of the central location of
96   the data.
97
98   If *data* is empty, :exc:`StatisticsError` will be raised.
99
100   Some examples of use:
101
102   .. doctest::
103
104      >>> mean([1, 2, 3, 4, 4])
105      2.8
106      >>> mean([-1.0, 2.5, 3.25, 5.75])
107      2.625
108
109      >>> from fractions import Fraction as F
110      >>> mean([F(3, 7), F(1, 21), F(5, 3), F(1, 3)])
111      Fraction(13, 21)
112
113      >>> from decimal import Decimal as D
114      >>> mean([D("0.5"), D("0.75"), D("0.625"), D("0.375")])
115      Decimal('0.5625')
116
117   .. note::
118
119      The mean is strongly affected by `outliers
120      <https://en.wikipedia.org/wiki/Outlier>`_ and is not necessarily a
121      typical example of the data points. For a more robust, although less
122      efficient, measure of `central tendency
123      <https://en.wikipedia.org/wiki/Central_tendency>`_, see :func:`median`.
124
125      The sample mean gives an unbiased estimate of the true population mean,
126      so that when taken on average over all the possible samples,
127      ``mean(sample)`` converges on the true mean of the entire population.  If
128      *data* represents the entire population rather than a sample, then
129      ``mean(data)`` is equivalent to calculating the true population mean μ.
130
131
132.. function:: fmean(data)
133
134   Convert *data* to floats and compute the arithmetic mean.
135
136   This runs faster than the :func:`mean` function and it always returns a
137   :class:`float`.  The *data* may be a sequence or iterable.  If the input
138   dataset is empty, raises a :exc:`StatisticsError`.
139
140   .. doctest::
141
142      >>> fmean([3.5, 4.0, 5.25])
143      4.25
144
145   .. versionadded:: 3.8
146
147
148.. function:: geometric_mean(data)
149
150   Convert *data* to floats and compute the geometric mean.
151
152   The geometric mean indicates the central tendency or typical value of the
153   *data* using the product of the values (as opposed to the arithmetic mean
154   which uses their sum).
155
156   Raises a :exc:`StatisticsError` if the input dataset is empty,
157   if it contains a zero, or if it contains a negative value.
158   The *data* may be a sequence or iterable.
159
160   No special efforts are made to achieve exact results.
161   (However, this may change in the future.)
162
163   .. doctest::
164
165      >>> round(geometric_mean([54, 24, 36]), 1)
166      36.0
167
168   .. versionadded:: 3.8
169
170
171.. function:: harmonic_mean(data, weights=None)
172
173   Return the harmonic mean of *data*, a sequence or iterable of
174   real-valued numbers.  If *weights* is omitted or *None*, then
175   equal weighting is assumed.
176
177   The harmonic mean is the reciprocal of the arithmetic :func:`mean` of the
178   reciprocals of the data. For example, the harmonic mean of three values *a*,
179   *b* and *c* will be equivalent to ``3/(1/a + 1/b + 1/c)``.  If one of the
180   values is zero, the result will be zero.
181
182   The harmonic mean is a type of average, a measure of the central
183   location of the data.  It is often appropriate when averaging
184   ratios or rates, for example speeds.
185
186   Suppose a car travels 10 km at 40 km/hr, then another 10 km at 60 km/hr.
187   What is the average speed?
188
189   .. doctest::
190
191      >>> harmonic_mean([40, 60])
192      48.0
193
194   Suppose a car travels 40 km/hr for 5 km, and when traffic clears,
195   speeds-up to 60 km/hr for the remaining 30 km of the journey. What
196   is the average speed?
197
198   .. doctest::
199
200      >>> harmonic_mean([40, 60], weights=[5, 30])
201      56.0
202
203   :exc:`StatisticsError` is raised if *data* is empty, any element
204   is less than zero, or if the weighted sum isn't positive.
205
206   The current algorithm has an early-out when it encounters a zero
207   in the input.  This means that the subsequent inputs are not tested
208   for validity.  (This behavior may change in the future.)
209
210   .. versionadded:: 3.6
211
212   .. versionchanged:: 3.10
213      Added support for *weights*.
214
215.. function:: median(data)
216
217   Return the median (middle value) of numeric data, using the common "mean of
218   middle two" method.  If *data* is empty, :exc:`StatisticsError` is raised.
219   *data* can be a sequence or iterable.
220
221   The median is a robust measure of central location and is less affected by
222   the presence of outliers.  When the number of data points is odd, the
223   middle data point is returned:
224
225   .. doctest::
226
227      >>> median([1, 3, 5])
228      3
229
230   When the number of data points is even, the median is interpolated by taking
231   the average of the two middle values:
232
233   .. doctest::
234
235      >>> median([1, 3, 5, 7])
236      4.0
237
238   This is suited for when your data is discrete, and you don't mind that the
239   median may not be an actual data point.
240
241   If the data is ordinal (supports order operations) but not numeric (doesn't
242   support addition), consider using :func:`median_low` or :func:`median_high`
243   instead.
244
245.. function:: median_low(data)
246
247   Return the low median of numeric data.  If *data* is empty,
248   :exc:`StatisticsError` is raised.  *data* can be a sequence or iterable.
249
250   The low median is always a member of the data set.  When the number of data
251   points is odd, the middle value is returned.  When it is even, the smaller of
252   the two middle values is returned.
253
254   .. doctest::
255
256      >>> median_low([1, 3, 5])
257      3
258      >>> median_low([1, 3, 5, 7])
259      3
260
261   Use the low median when your data are discrete and you prefer the median to
262   be an actual data point rather than interpolated.
263
264
265.. function:: median_high(data)
266
267   Return the high median of data.  If *data* is empty, :exc:`StatisticsError`
268   is raised.  *data* can be a sequence or iterable.
269
270   The high median is always a member of the data set.  When the number of data
271   points is odd, the middle value is returned.  When it is even, the larger of
272   the two middle values is returned.
273
274   .. doctest::
275
276      >>> median_high([1, 3, 5])
277      3
278      >>> median_high([1, 3, 5, 7])
279      5
280
281   Use the high median when your data are discrete and you prefer the median to
282   be an actual data point rather than interpolated.
283
284
285.. function:: median_grouped(data, interval=1)
286
287   Return the median of grouped continuous data, calculated as the 50th
288   percentile, using interpolation.  If *data* is empty, :exc:`StatisticsError`
289   is raised.  *data* can be a sequence or iterable.
290
291   .. doctest::
292
293      >>> median_grouped([52, 52, 53, 54])
294      52.5
295
296   In the following example, the data are rounded, so that each value represents
297   the midpoint of data classes, e.g. 1 is the midpoint of the class 0.5--1.5, 2
298   is the midpoint of 1.5--2.5, 3 is the midpoint of 2.5--3.5, etc.  With the data
299   given, the middle value falls somewhere in the class 3.5--4.5, and
300   interpolation is used to estimate it:
301
302   .. doctest::
303
304      >>> median_grouped([1, 2, 2, 3, 4, 4, 4, 4, 4, 5])
305      3.7
306
307   Optional argument *interval* represents the class interval, and defaults
308   to 1.  Changing the class interval naturally will change the interpolation:
309
310   .. doctest::
311
312      >>> median_grouped([1, 3, 3, 5, 7], interval=1)
313      3.25
314      >>> median_grouped([1, 3, 3, 5, 7], interval=2)
315      3.5
316
317   This function does not check whether the data points are at least
318   *interval* apart.
319
320   .. impl-detail::
321
322      Under some circumstances, :func:`median_grouped` may coerce data points to
323      floats.  This behaviour is likely to change in the future.
324
325   .. seealso::
326
327      * "Statistics for the Behavioral Sciences", Frederick J Gravetter and
328        Larry B Wallnau (8th Edition).
329
330      * The `SSMEDIAN
331        <https://help.gnome.org/users/gnumeric/stable/gnumeric.html#gnumeric-function-SSMEDIAN>`_
332        function in the Gnome Gnumeric spreadsheet, including `this discussion
333        <https://mail.gnome.org/archives/gnumeric-list/2011-April/msg00018.html>`_.
334
335
336.. function:: mode(data)
337
338   Return the single most common data point from discrete or nominal *data*.
339   The mode (when it exists) is the most typical value and serves as a
340   measure of central location.
341
342   If there are multiple modes with the same frequency, returns the first one
343   encountered in the *data*.  If the smallest or largest of those is
344   desired instead, use ``min(multimode(data))`` or ``max(multimode(data))``.
345   If the input *data* is empty, :exc:`StatisticsError` is raised.
346
347   ``mode`` assumes discrete data and returns a single value. This is the
348   standard treatment of the mode as commonly taught in schools:
349
350   .. doctest::
351
352      >>> mode([1, 1, 2, 3, 3, 3, 3, 4])
353      3
354
355   The mode is unique in that it is the only statistic in this package that
356   also applies to nominal (non-numeric) data:
357
358   .. doctest::
359
360      >>> mode(["red", "blue", "blue", "red", "green", "red", "red"])
361      'red'
362
363   .. versionchanged:: 3.8
364      Now handles multimodal datasets by returning the first mode encountered.
365      Formerly, it raised :exc:`StatisticsError` when more than one mode was
366      found.
367
368
369.. function:: multimode(data)
370
371   Return a list of the most frequently occurring values in the order they
372   were first encountered in the *data*.  Will return more than one result if
373   there are multiple modes or an empty list if the *data* is empty:
374
375   .. doctest::
376
377        >>> multimode('aabbbbccddddeeffffgg')
378        ['b', 'd', 'f']
379        >>> multimode('')
380        []
381
382   .. versionadded:: 3.8
383
384
385.. function:: pstdev(data, mu=None)
386
387   Return the population standard deviation (the square root of the population
388   variance).  See :func:`pvariance` for arguments and other details.
389
390   .. doctest::
391
392      >>> pstdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75])
393      0.986893273527251
394
395
396.. function:: pvariance(data, mu=None)
397
398   Return the population variance of *data*, a non-empty sequence or iterable
399   of real-valued numbers.  Variance, or second moment about the mean, is a
400   measure of the variability (spread or dispersion) of data.  A large
401   variance indicates that the data is spread out; a small variance indicates
402   it is clustered closely around the mean.
403
404   If the optional second argument *mu* is given, it is typically the mean of
405   the *data*.  It can also be used to compute the second moment around a
406   point that is not the mean.  If it is missing or ``None`` (the default),
407   the arithmetic mean is automatically calculated.
408
409   Use this function to calculate the variance from the entire population.  To
410   estimate the variance from a sample, the :func:`variance` function is usually
411   a better choice.
412
413   Raises :exc:`StatisticsError` if *data* is empty.
414
415   Examples:
416
417   .. doctest::
418
419      >>> data = [0.0, 0.25, 0.25, 1.25, 1.5, 1.75, 2.75, 3.25]
420      >>> pvariance(data)
421      1.25
422
423   If you have already calculated the mean of your data, you can pass it as the
424   optional second argument *mu* to avoid recalculation:
425
426   .. doctest::
427
428      >>> mu = mean(data)
429      >>> pvariance(data, mu)
430      1.25
431
432   Decimals and Fractions are supported:
433
434   .. doctest::
435
436      >>> from decimal import Decimal as D
437      >>> pvariance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")])
438      Decimal('24.815')
439
440      >>> from fractions import Fraction as F
441      >>> pvariance([F(1, 4), F(5, 4), F(1, 2)])
442      Fraction(13, 72)
443
444   .. note::
445
446      When called with the entire population, this gives the population variance
447      σ².  When called on a sample instead, this is the biased sample variance
448      s², also known as variance with N degrees of freedom.
449
450      If you somehow know the true population mean μ, you may use this
451      function to calculate the variance of a sample, giving the known
452      population mean as the second argument.  Provided the data points are a
453      random sample of the population, the result will be an unbiased estimate
454      of the population variance.
455
456
457.. function:: stdev(data, xbar=None)
458
459   Return the sample standard deviation (the square root of the sample
460   variance).  See :func:`variance` for arguments and other details.
461
462   .. doctest::
463
464      >>> stdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75])
465      1.0810874155219827
466
467
468.. function:: variance(data, xbar=None)
469
470   Return the sample variance of *data*, an iterable of at least two real-valued
471   numbers.  Variance, or second moment about the mean, is a measure of the
472   variability (spread or dispersion) of data.  A large variance indicates that
473   the data is spread out; a small variance indicates it is clustered closely
474   around the mean.
475
476   If the optional second argument *xbar* is given, it should be the mean of
477   *data*.  If it is missing or ``None`` (the default), the mean is
478   automatically calculated.
479
480   Use this function when your data is a sample from a population. To calculate
481   the variance from the entire population, see :func:`pvariance`.
482
483   Raises :exc:`StatisticsError` if *data* has fewer than two values.
484
485   Examples:
486
487   .. doctest::
488
489      >>> data = [2.75, 1.75, 1.25, 0.25, 0.5, 1.25, 3.5]
490      >>> variance(data)
491      1.3720238095238095
492
493   If you have already calculated the mean of your data, you can pass it as the
494   optional second argument *xbar* to avoid recalculation:
495
496   .. doctest::
497
498      >>> m = mean(data)
499      >>> variance(data, m)
500      1.3720238095238095
501
502   This function does not attempt to verify that you have passed the actual mean
503   as *xbar*.  Using arbitrary values for *xbar* can lead to invalid or
504   impossible results.
505
506   Decimal and Fraction values are supported:
507
508   .. doctest::
509
510      >>> from decimal import Decimal as D
511      >>> variance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")])
512      Decimal('31.01875')
513
514      >>> from fractions import Fraction as F
515      >>> variance([F(1, 6), F(1, 2), F(5, 3)])
516      Fraction(67, 108)
517
518   .. note::
519
520      This is the sample variance s² with Bessel's correction, also known as
521      variance with N-1 degrees of freedom.  Provided that the data points are
522      representative (e.g. independent and identically distributed), the result
523      should be an unbiased estimate of the true population variance.
524
525      If you somehow know the actual population mean μ you should pass it to the
526      :func:`pvariance` function as the *mu* parameter to get the variance of a
527      sample.
528
529.. function:: quantiles(data, *, n=4, method='exclusive')
530
531   Divide *data* into *n* continuous intervals with equal probability.
532   Returns a list of ``n - 1`` cut points separating the intervals.
533
534   Set *n* to 4 for quartiles (the default).  Set *n* to 10 for deciles.  Set
535   *n* to 100 for percentiles which gives the 99 cuts points that separate
536   *data* into 100 equal sized groups.  Raises :exc:`StatisticsError` if *n*
537   is not least 1.
538
539   The *data* can be any iterable containing sample data.  For meaningful
540   results, the number of data points in *data* should be larger than *n*.
541   Raises :exc:`StatisticsError` if there are not at least two data points.
542
543   The cut points are linearly interpolated from the
544   two nearest data points.  For example, if a cut point falls one-third
545   of the distance between two sample values, ``100`` and ``112``, the
546   cut-point will evaluate to ``104``.
547
548   The *method* for computing quantiles can be varied depending on
549   whether the *data* includes or excludes the lowest and
550   highest possible values from the population.
551
552   The default *method* is "exclusive" and is used for data sampled from
553   a population that can have more extreme values than found in the
554   samples.  The portion of the population falling below the *i-th* of
555   *m* sorted data points is computed as ``i / (m + 1)``.  Given nine
556   sample values, the method sorts them and assigns the following
557   percentiles: 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%.
558
559   Setting the *method* to "inclusive" is used for describing population
560   data or for samples that are known to include the most extreme values
561   from the population.  The minimum value in *data* is treated as the 0th
562   percentile and the maximum value is treated as the 100th percentile.
563   The portion of the population falling below the *i-th* of *m* sorted
564   data points is computed as ``(i - 1) / (m - 1)``.  Given 11 sample
565   values, the method sorts them and assigns the following percentiles:
566   0%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 100%.
567
568   .. doctest::
569
570        # Decile cut points for empirically sampled data
571        >>> data = [105, 129, 87, 86, 111, 111, 89, 81, 108, 92, 110,
572        ...         100, 75, 105, 103, 109, 76, 119, 99, 91, 103, 129,
573        ...         106, 101, 84, 111, 74, 87, 86, 103, 103, 106, 86,
574        ...         111, 75, 87, 102, 121, 111, 88, 89, 101, 106, 95,
575        ...         103, 107, 101, 81, 109, 104]
576        >>> [round(q, 1) for q in quantiles(data, n=10)]
577        [81.0, 86.2, 89.0, 99.4, 102.5, 103.6, 106.0, 109.8, 111.0]
578
579   .. versionadded:: 3.8
580
581.. function:: covariance(x, y, /)
582
583   Return the sample covariance of two inputs *x* and *y*. Covariance
584   is a measure of the joint variability of two inputs.
585
586   Both inputs must be of the same length (no less than two), otherwise
587   :exc:`StatisticsError` is raised.
588
589   Examples:
590
591   .. doctest::
592
593      >>> x = [1, 2, 3, 4, 5, 6, 7, 8, 9]
594      >>> y = [1, 2, 3, 1, 2, 3, 1, 2, 3]
595      >>> covariance(x, y)
596      0.75
597      >>> z = [9, 8, 7, 6, 5, 4, 3, 2, 1]
598      >>> covariance(x, z)
599      -7.5
600      >>> covariance(z, x)
601      -7.5
602
603   .. versionadded:: 3.10
604
605.. function:: correlation(x, y, /)
606
607   Return the `Pearson's correlation coefficient
608   <https://en.wikipedia.org/wiki/Pearson_correlation_coefficient>`_
609   for two inputs. Pearson's correlation coefficient *r* takes values
610   between -1 and +1. It measures the strength and direction of the linear
611   relationship, where +1 means very strong, positive linear relationship,
612   -1 very strong, negative linear relationship, and 0 no linear relationship.
613
614   Both inputs must be of the same length (no less than two), and need
615   not to be constant, otherwise :exc:`StatisticsError` is raised.
616
617   Examples:
618
619   .. doctest::
620
621      >>> x = [1, 2, 3, 4, 5, 6, 7, 8, 9]
622      >>> y = [9, 8, 7, 6, 5, 4, 3, 2, 1]
623      >>> correlation(x, x)
624      1.0
625      >>> correlation(x, y)
626      -1.0
627
628   .. versionadded:: 3.10
629
630.. function:: linear_regression(x, y, /)
631
632   Return the slope and intercept of `simple linear regression
633   <https://en.wikipedia.org/wiki/Simple_linear_regression>`_
634   parameters estimated using ordinary least squares. Simple linear
635   regression describes the relationship between an independent variable *x* and
636   a dependent variable *y* in terms of this linear function:
637
638      *y = slope \* x + intercept + noise*
639
640   where ``slope`` and ``intercept`` are the regression parameters that are
641   estimated, and ``noise`` represents the
642   variability of the data that was not explained by the linear regression
643   (it is equal to the difference between predicted and actual values
644   of the dependent variable).
645
646   Both inputs must be of the same length (no less than two), and
647   the independent variable *x* cannot be constant;
648   otherwise a :exc:`StatisticsError` is raised.
649
650   For example, we can use the `release dates of the Monty
651   Python films <https://en.wikipedia.org/wiki/Monty_Python#Films>`_
652   to predict the cumulative number of Monty Python films
653   that would have been produced by 2019
654   assuming that they had kept the pace.
655
656   .. doctest::
657
658      >>> year = [1971, 1975, 1979, 1982, 1983]
659      >>> films_total = [1, 2, 3, 4, 5]
660      >>> slope, intercept = linear_regression(year, films_total)
661      >>> round(slope * 2019 + intercept)
662      16
663
664   .. versionadded:: 3.10
665
666
667Exceptions
668----------
669
670A single exception is defined:
671
672.. exception:: StatisticsError
673
674   Subclass of :exc:`ValueError` for statistics-related exceptions.
675
676
677:class:`NormalDist` objects
678---------------------------
679
680:class:`NormalDist` is a tool for creating and manipulating normal
681distributions of a `random variable
682<http://www.stat.yale.edu/Courses/1997-98/101/ranvar.htm>`_.  It is a
683class that treats the mean and standard deviation of data
684measurements as a single entity.
685
686Normal distributions arise from the `Central Limit Theorem
687<https://en.wikipedia.org/wiki/Central_limit_theorem>`_ and have a wide range
688of applications in statistics.
689
690.. class:: NormalDist(mu=0.0, sigma=1.0)
691
692    Returns a new *NormalDist* object where *mu* represents the `arithmetic
693    mean <https://en.wikipedia.org/wiki/Arithmetic_mean>`_ and *sigma*
694    represents the `standard deviation
695    <https://en.wikipedia.org/wiki/Standard_deviation>`_.
696
697    If *sigma* is negative, raises :exc:`StatisticsError`.
698
699    .. attribute:: mean
700
701       A read-only property for the `arithmetic mean
702       <https://en.wikipedia.org/wiki/Arithmetic_mean>`_ of a normal
703       distribution.
704
705    .. attribute:: median
706
707       A read-only property for the `median
708       <https://en.wikipedia.org/wiki/Median>`_ of a normal
709       distribution.
710
711    .. attribute:: mode
712
713       A read-only property for the `mode
714       <https://en.wikipedia.org/wiki/Mode_(statistics)>`_ of a normal
715       distribution.
716
717    .. attribute:: stdev
718
719       A read-only property for the `standard deviation
720       <https://en.wikipedia.org/wiki/Standard_deviation>`_ of a normal
721       distribution.
722
723    .. attribute:: variance
724
725       A read-only property for the `variance
726       <https://en.wikipedia.org/wiki/Variance>`_ of a normal
727       distribution. Equal to the square of the standard deviation.
728
729    .. classmethod:: NormalDist.from_samples(data)
730
731       Makes a normal distribution instance with *mu* and *sigma* parameters
732       estimated from the *data* using :func:`fmean` and :func:`stdev`.
733
734       The *data* can be any :term:`iterable` and should consist of values
735       that can be converted to type :class:`float`.  If *data* does not
736       contain at least two elements, raises :exc:`StatisticsError` because it
737       takes at least one point to estimate a central value and at least two
738       points to estimate dispersion.
739
740    .. method:: NormalDist.samples(n, *, seed=None)
741
742       Generates *n* random samples for a given mean and standard deviation.
743       Returns a :class:`list` of :class:`float` values.
744
745       If *seed* is given, creates a new instance of the underlying random
746       number generator.  This is useful for creating reproducible results,
747       even in a multi-threading context.
748
749    .. method:: NormalDist.pdf(x)
750
751       Using a `probability density function (pdf)
752       <https://en.wikipedia.org/wiki/Probability_density_function>`_, compute
753       the relative likelihood that a random variable *X* will be near the
754       given value *x*.  Mathematically, it is the limit of the ratio ``P(x <=
755       X < x+dx) / dx`` as *dx* approaches zero.
756
757       The relative likelihood is computed as the probability of a sample
758       occurring in a narrow range divided by the width of the range (hence
759       the word "density").  Since the likelihood is relative to other points,
760       its value can be greater than `1.0`.
761
762    .. method:: NormalDist.cdf(x)
763
764       Using a `cumulative distribution function (cdf)
765       <https://en.wikipedia.org/wiki/Cumulative_distribution_function>`_,
766       compute the probability that a random variable *X* will be less than or
767       equal to *x*.  Mathematically, it is written ``P(X <= x)``.
768
769    .. method:: NormalDist.inv_cdf(p)
770
771       Compute the inverse cumulative distribution function, also known as the
772       `quantile function <https://en.wikipedia.org/wiki/Quantile_function>`_
773       or the `percent-point
774       <https://www.statisticshowto.datasciencecentral.com/inverse-distribution-function/>`_
775       function.  Mathematically, it is written ``x : P(X <= x) = p``.
776
777       Finds the value *x* of the random variable *X* such that the
778       probability of the variable being less than or equal to that value
779       equals the given probability *p*.
780
781    .. method:: NormalDist.overlap(other)
782
783       Measures the agreement between two normal probability distributions.
784       Returns a value between 0.0 and 1.0 giving `the overlapping area for
785       the two probability density functions
786       <https://www.rasch.org/rmt/rmt101r.htm>`_.
787
788    .. method:: NormalDist.quantiles(n=4)
789
790        Divide the normal distribution into *n* continuous intervals with
791        equal probability.  Returns a list of (n - 1) cut points separating
792        the intervals.
793
794        Set *n* to 4 for quartiles (the default).  Set *n* to 10 for deciles.
795        Set *n* to 100 for percentiles which gives the 99 cuts points that
796        separate the normal distribution into 100 equal sized groups.
797
798    .. method:: NormalDist.zscore(x)
799
800        Compute the
801        `Standard Score <https://www.statisticshowto.com/probability-and-statistics/z-score/>`_
802        describing *x* in terms of the number of standard deviations
803        above or below the mean of the normal distribution:
804        ``(x - mean) / stdev``.
805
806        .. versionadded:: 3.9
807
808    Instances of :class:`NormalDist` support addition, subtraction,
809    multiplication and division by a constant.  These operations
810    are used for translation and scaling.  For example:
811
812    .. doctest::
813
814        >>> temperature_february = NormalDist(5, 2.5)             # Celsius
815        >>> temperature_february * (9/5) + 32                     # Fahrenheit
816        NormalDist(mu=41.0, sigma=4.5)
817
818    Dividing a constant by an instance of :class:`NormalDist` is not supported
819    because the result wouldn't be normally distributed.
820
821    Since normal distributions arise from additive effects of independent
822    variables, it is possible to `add and subtract two independent normally
823    distributed random variables
824    <https://en.wikipedia.org/wiki/Sum_of_normally_distributed_random_variables>`_
825    represented as instances of :class:`NormalDist`.  For example:
826
827    .. doctest::
828
829        >>> birth_weights = NormalDist.from_samples([2.5, 3.1, 2.1, 2.4, 2.7, 3.5])
830        >>> drug_effects = NormalDist(0.4, 0.15)
831        >>> combined = birth_weights + drug_effects
832        >>> round(combined.mean, 1)
833        3.1
834        >>> round(combined.stdev, 1)
835        0.5
836
837    .. versionadded:: 3.8
838
839
840:class:`NormalDist` Examples and Recipes
841^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
842
843:class:`NormalDist` readily solves classic probability problems.
844
845For example, given `historical data for SAT exams
846<https://nces.ed.gov/programs/digest/d17/tables/dt17_226.40.asp>`_ showing
847that scores are normally distributed with a mean of 1060 and a standard
848deviation of 195, determine the percentage of students with test scores
849between 1100 and 1200, after rounding to the nearest whole number:
850
851.. doctest::
852
853    >>> sat = NormalDist(1060, 195)
854    >>> fraction = sat.cdf(1200 + 0.5) - sat.cdf(1100 - 0.5)
855    >>> round(fraction * 100.0, 1)
856    18.4
857
858Find the `quartiles <https://en.wikipedia.org/wiki/Quartile>`_ and `deciles
859<https://en.wikipedia.org/wiki/Decile>`_ for the SAT scores:
860
861.. doctest::
862
863    >>> list(map(round, sat.quantiles()))
864    [928, 1060, 1192]
865    >>> list(map(round, sat.quantiles(n=10)))
866    [810, 896, 958, 1011, 1060, 1109, 1162, 1224, 1310]
867
868To estimate the distribution for a model than isn't easy to solve
869analytically, :class:`NormalDist` can generate input samples for a `Monte
870Carlo simulation <https://en.wikipedia.org/wiki/Monte_Carlo_method>`_:
871
872.. doctest::
873
874    >>> def model(x, y, z):
875    ...     return (3*x + 7*x*y - 5*y) / (11 * z)
876    ...
877    >>> n = 100_000
878    >>> X = NormalDist(10, 2.5).samples(n, seed=3652260728)
879    >>> Y = NormalDist(15, 1.75).samples(n, seed=4582495471)
880    >>> Z = NormalDist(50, 1.25).samples(n, seed=6582483453)
881    >>> quantiles(map(model, X, Y, Z))       # doctest: +SKIP
882    [1.4591308524824727, 1.8035946855390597, 2.175091447274739]
883
884Normal distributions can be used to approximate `Binomial
885distributions <http://mathworld.wolfram.com/BinomialDistribution.html>`_
886when the sample size is large and when the probability of a successful
887trial is near 50%.
888
889For example, an open source conference has 750 attendees and two rooms with a
890500 person capacity.  There is a talk about Python and another about Ruby.
891In previous conferences, 65% of the attendees preferred to listen to Python
892talks.  Assuming the population preferences haven't changed, what is the
893probability that the Python room will stay within its capacity limits?
894
895.. doctest::
896
897    >>> n = 750             # Sample size
898    >>> p = 0.65            # Preference for Python
899    >>> q = 1.0 - p         # Preference for Ruby
900    >>> k = 500             # Room capacity
901
902    >>> # Approximation using the cumulative normal distribution
903    >>> from math import sqrt
904    >>> round(NormalDist(mu=n*p, sigma=sqrt(n*p*q)).cdf(k + 0.5), 4)
905    0.8402
906
907    >>> # Solution using the cumulative binomial distribution
908    >>> from math import comb, fsum
909    >>> round(fsum(comb(n, r) * p**r * q**(n-r) for r in range(k+1)), 4)
910    0.8402
911
912    >>> # Approximation using a simulation
913    >>> from random import seed, choices
914    >>> seed(8675309)
915    >>> def trial():
916    ...     return choices(('Python', 'Ruby'), (p, q), k=n).count('Python')
917    >>> mean(trial() <= k for i in range(10_000))
918    0.8398
919
920Normal distributions commonly arise in machine learning problems.
921
922Wikipedia has a `nice example of a Naive Bayesian Classifier
923<https://en.wikipedia.org/wiki/Naive_Bayes_classifier#Sex_classification>`_.
924The challenge is to predict a person's gender from measurements of normally
925distributed features including height, weight, and foot size.
926
927We're given a training dataset with measurements for eight people.  The
928measurements are assumed to be normally distributed, so we summarize the data
929with :class:`NormalDist`:
930
931.. doctest::
932
933    >>> height_male = NormalDist.from_samples([6, 5.92, 5.58, 5.92])
934    >>> height_female = NormalDist.from_samples([5, 5.5, 5.42, 5.75])
935    >>> weight_male = NormalDist.from_samples([180, 190, 170, 165])
936    >>> weight_female = NormalDist.from_samples([100, 150, 130, 150])
937    >>> foot_size_male = NormalDist.from_samples([12, 11, 12, 10])
938    >>> foot_size_female = NormalDist.from_samples([6, 8, 7, 9])
939
940Next, we encounter a new person whose feature measurements are known but whose
941gender is unknown:
942
943.. doctest::
944
945    >>> ht = 6.0        # height
946    >>> wt = 130        # weight
947    >>> fs = 8          # foot size
948
949Starting with a 50% `prior probability
950<https://en.wikipedia.org/wiki/Prior_probability>`_ of being male or female,
951we compute the posterior as the prior times the product of likelihoods for the
952feature measurements given the gender:
953
954.. doctest::
955
956   >>> prior_male = 0.5
957   >>> prior_female = 0.5
958   >>> posterior_male = (prior_male * height_male.pdf(ht) *
959   ...                   weight_male.pdf(wt) * foot_size_male.pdf(fs))
960
961   >>> posterior_female = (prior_female * height_female.pdf(ht) *
962   ...                     weight_female.pdf(wt) * foot_size_female.pdf(fs))
963
964The final prediction goes to the largest posterior. This is known as the
965`maximum a posteriori
966<https://en.wikipedia.org/wiki/Maximum_a_posteriori_estimation>`_ or MAP:
967
968.. doctest::
969
970  >>> 'male' if posterior_male > posterior_female else 'female'
971  'female'
972
973
974..
975   # This modelines must appear within the last ten lines of the file.
976   kate: indent-width 3; remove-trailing-space on; replace-tabs on; encoding utf-8;
977