• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1:mod:`statistics` --- Mathematical statistics functions
2=======================================================
3
4.. module:: statistics
5   :synopsis: Mathematical statistics functions
6
7.. moduleauthor:: Steven D'Aprano <steve+python@pearwood.info>
8.. sectionauthor:: Steven D'Aprano <steve+python@pearwood.info>
9
10.. versionadded:: 3.4
11
12**Source code:** :source:`Lib/statistics.py`
13
14.. testsetup:: *
15
16   from statistics import *
17   __name__ = '<doctest>'
18
19--------------
20
21This module provides functions for calculating mathematical statistics of
22numeric (:class:`~numbers.Real`-valued) data.
23
24The module is not intended to be a competitor to third-party libraries such
25as `NumPy <https://numpy.org>`_, `SciPy <https://www.scipy.org/>`_, or
26proprietary full-featured statistics packages aimed at professional
27statisticians such as Minitab, SAS and Matlab. It is aimed at the level of
28graphing and scientific calculators.
29
30Unless explicitly noted, these functions support :class:`int`,
31:class:`float`, :class:`~decimal.Decimal` and :class:`~fractions.Fraction`.
32Behaviour with other types (whether in the numeric tower or not) is
33currently unsupported.  Collections with a mix of types are also undefined
34and implementation-dependent.  If your input data consists of mixed types,
35you may be able to use :func:`map` to ensure a consistent result, for
36example: ``map(float, input_data)``.
37
38Averages and measures of central location
39-----------------------------------------
40
41These functions calculate an average or typical value from a population
42or sample.
43
44=======================  ===============================================================
45:func:`mean`             Arithmetic mean ("average") of data.
46:func:`fmean`            Fast, floating point arithmetic mean.
47:func:`geometric_mean`   Geometric mean of data.
48:func:`harmonic_mean`    Harmonic mean of data.
49:func:`median`           Median (middle value) of data.
50:func:`median_low`       Low median of data.
51:func:`median_high`      High median of data.
52:func:`median_grouped`   Median, or 50th percentile, of grouped data.
53:func:`mode`             Single mode (most common value) of discrete or nominal data.
54:func:`multimode`        List of modes (most common values) of discrete or nomimal data.
55:func:`quantiles`        Divide data into intervals with equal probability.
56=======================  ===============================================================
57
58Measures of spread
59------------------
60
61These functions calculate a measure of how much the population or sample
62tends to deviate from the typical or average values.
63
64=======================  =============================================
65:func:`pstdev`           Population standard deviation of data.
66:func:`pvariance`        Population variance of data.
67:func:`stdev`            Sample standard deviation of data.
68:func:`variance`         Sample variance of data.
69=======================  =============================================
70
71
72Function details
73----------------
74
75Note: The functions do not require the data given to them to be sorted.
76However, for reading convenience, most of the examples show sorted sequences.
77
78.. function:: mean(data)
79
80   Return the sample arithmetic mean of *data* which can be a sequence or iterable.
81
82   The arithmetic mean is the sum of the data divided by the number of data
83   points.  It is commonly called "the average", although it is only one of many
84   different mathematical averages.  It is a measure of the central location of
85   the data.
86
87   If *data* is empty, :exc:`StatisticsError` will be raised.
88
89   Some examples of use:
90
91   .. doctest::
92
93      >>> mean([1, 2, 3, 4, 4])
94      2.8
95      >>> mean([-1.0, 2.5, 3.25, 5.75])
96      2.625
97
98      >>> from fractions import Fraction as F
99      >>> mean([F(3, 7), F(1, 21), F(5, 3), F(1, 3)])
100      Fraction(13, 21)
101
102      >>> from decimal import Decimal as D
103      >>> mean([D("0.5"), D("0.75"), D("0.625"), D("0.375")])
104      Decimal('0.5625')
105
106   .. note::
107
108      The mean is strongly affected by outliers and is not a robust estimator
109      for central location: the mean is not necessarily a typical example of
110      the data points.  For more robust measures of central location, see
111      :func:`median` and :func:`mode`.
112
113      The sample mean gives an unbiased estimate of the true population mean,
114      so that when taken on average over all the possible samples,
115      ``mean(sample)`` converges on the true mean of the entire population.  If
116      *data* represents the entire population rather than a sample, then
117      ``mean(data)`` is equivalent to calculating the true population mean μ.
118
119
120.. function:: fmean(data)
121
122   Convert *data* to floats and compute the arithmetic mean.
123
124   This runs faster than the :func:`mean` function and it always returns a
125   :class:`float`.  The *data* may be a sequence or iterable.  If the input
126   dataset is empty, raises a :exc:`StatisticsError`.
127
128   .. doctest::
129
130      >>> fmean([3.5, 4.0, 5.25])
131      4.25
132
133   .. versionadded:: 3.8
134
135
136.. function:: geometric_mean(data)
137
138   Convert *data* to floats and compute the geometric mean.
139
140   The geometric mean indicates the central tendency or typical value of the
141   *data* using the product of the values (as opposed to the arithmetic mean
142   which uses their sum).
143
144   Raises a :exc:`StatisticsError` if the input dataset is empty,
145   if it contains a zero, or if it contains a negative value.
146   The *data* may be a sequence or iterable.
147
148   No special efforts are made to achieve exact results.
149   (However, this may change in the future.)
150
151   .. doctest::
152
153      >>> round(geometric_mean([54, 24, 36]), 1)
154      36.0
155
156   .. versionadded:: 3.8
157
158
159.. function:: harmonic_mean(data)
160
161   Return the harmonic mean of *data*, a sequence or iterable of
162   real-valued numbers.
163
164   The harmonic mean, sometimes called the subcontrary mean, is the
165   reciprocal of the arithmetic :func:`mean` of the reciprocals of the
166   data. For example, the harmonic mean of three values *a*, *b* and *c*
167   will be equivalent to ``3/(1/a + 1/b + 1/c)``.  If one of the values
168   is zero, the result will be zero.
169
170   The harmonic mean is a type of average, a measure of the central
171   location of the data.  It is often appropriate when averaging
172   rates or ratios, for example speeds.
173
174   Suppose a car travels 10 km at 40 km/hr, then another 10 km at 60 km/hr.
175   What is the average speed?
176
177   .. doctest::
178
179      >>> harmonic_mean([40, 60])
180      48.0
181
182   Suppose an investor purchases an equal value of shares in each of
183   three companies, with P/E (price/earning) ratios of 2.5, 3 and 10.
184   What is the average P/E ratio for the investor's portfolio?
185
186   .. doctest::
187
188      >>> harmonic_mean([2.5, 3, 10])  # For an equal investment portfolio.
189      3.6
190
191   :exc:`StatisticsError` is raised if *data* is empty, or any element
192   is less than zero.
193
194   The current algorithm has an early-out when it encounters a zero
195   in the input.  This means that the subsequent inputs are not tested
196   for validity.  (This behavior may change in the future.)
197
198   .. versionadded:: 3.6
199
200
201.. function:: median(data)
202
203   Return the median (middle value) of numeric data, using the common "mean of
204   middle two" method.  If *data* is empty, :exc:`StatisticsError` is raised.
205   *data* can be a sequence or iterable.
206
207   The median is a robust measure of central location and is less affected by
208   the presence of outliers.  When the number of data points is odd, the
209   middle data point is returned:
210
211   .. doctest::
212
213      >>> median([1, 3, 5])
214      3
215
216   When the number of data points is even, the median is interpolated by taking
217   the average of the two middle values:
218
219   .. doctest::
220
221      >>> median([1, 3, 5, 7])
222      4.0
223
224   This is suited for when your data is discrete, and you don't mind that the
225   median may not be an actual data point.
226
227   If the data is ordinal (supports order operations) but not numeric (doesn't
228   support addition), consider using :func:`median_low` or :func:`median_high`
229   instead.
230
231.. function:: median_low(data)
232
233   Return the low median of numeric data.  If *data* is empty,
234   :exc:`StatisticsError` is raised.  *data* can be a sequence or iterable.
235
236   The low median is always a member of the data set.  When the number of data
237   points is odd, the middle value is returned.  When it is even, the smaller of
238   the two middle values is returned.
239
240   .. doctest::
241
242      >>> median_low([1, 3, 5])
243      3
244      >>> median_low([1, 3, 5, 7])
245      3
246
247   Use the low median when your data are discrete and you prefer the median to
248   be an actual data point rather than interpolated.
249
250
251.. function:: median_high(data)
252
253   Return the high median of data.  If *data* is empty, :exc:`StatisticsError`
254   is raised.  *data* can be a sequence or iterable.
255
256   The high median is always a member of the data set.  When the number of data
257   points is odd, the middle value is returned.  When it is even, the larger of
258   the two middle values is returned.
259
260   .. doctest::
261
262      >>> median_high([1, 3, 5])
263      3
264      >>> median_high([1, 3, 5, 7])
265      5
266
267   Use the high median when your data are discrete and you prefer the median to
268   be an actual data point rather than interpolated.
269
270
271.. function:: median_grouped(data, interval=1)
272
273   Return the median of grouped continuous data, calculated as the 50th
274   percentile, using interpolation.  If *data* is empty, :exc:`StatisticsError`
275   is raised.  *data* can be a sequence or iterable.
276
277   .. doctest::
278
279      >>> median_grouped([52, 52, 53, 54])
280      52.5
281
282   In the following example, the data are rounded, so that each value represents
283   the midpoint of data classes, e.g. 1 is the midpoint of the class 0.5--1.5, 2
284   is the midpoint of 1.5--2.5, 3 is the midpoint of 2.5--3.5, etc.  With the data
285   given, the middle value falls somewhere in the class 3.5--4.5, and
286   interpolation is used to estimate it:
287
288   .. doctest::
289
290      >>> median_grouped([1, 2, 2, 3, 4, 4, 4, 4, 4, 5])
291      3.7
292
293   Optional argument *interval* represents the class interval, and defaults
294   to 1.  Changing the class interval naturally will change the interpolation:
295
296   .. doctest::
297
298      >>> median_grouped([1, 3, 3, 5, 7], interval=1)
299      3.25
300      >>> median_grouped([1, 3, 3, 5, 7], interval=2)
301      3.5
302
303   This function does not check whether the data points are at least
304   *interval* apart.
305
306   .. impl-detail::
307
308      Under some circumstances, :func:`median_grouped` may coerce data points to
309      floats.  This behaviour is likely to change in the future.
310
311   .. seealso::
312
313      * "Statistics for the Behavioral Sciences", Frederick J Gravetter and
314        Larry B Wallnau (8th Edition).
315
316      * The `SSMEDIAN
317        <https://help.gnome.org/users/gnumeric/stable/gnumeric.html#gnumeric-function-SSMEDIAN>`_
318        function in the Gnome Gnumeric spreadsheet, including `this discussion
319        <https://mail.gnome.org/archives/gnumeric-list/2011-April/msg00018.html>`_.
320
321
322.. function:: mode(data)
323
324   Return the single most common data point from discrete or nominal *data*.
325   The mode (when it exists) is the most typical value and serves as a
326   measure of central location.
327
328   If there are multiple modes with the same frequency, returns the first one
329   encountered in the *data*.  If the smallest or largest of those is
330   desired instead, use ``min(multimode(data))`` or ``max(multimode(data))``.
331   If the input *data* is empty, :exc:`StatisticsError` is raised.
332
333   ``mode`` assumes discrete data and returns a single value. This is the
334   standard treatment of the mode as commonly taught in schools:
335
336   .. doctest::
337
338      >>> mode([1, 1, 2, 3, 3, 3, 3, 4])
339      3
340
341   The mode is unique in that it is the only statistic in this package that
342   also applies to nominal (non-numeric) data:
343
344   .. doctest::
345
346      >>> mode(["red", "blue", "blue", "red", "green", "red", "red"])
347      'red'
348
349   .. versionchanged:: 3.8
350      Now handles multimodal datasets by returning the first mode encountered.
351      Formerly, it raised :exc:`StatisticsError` when more than one mode was
352      found.
353
354
355.. function:: multimode(data)
356
357   Return a list of the most frequently occurring values in the order they
358   were first encountered in the *data*.  Will return more than one result if
359   there are multiple modes or an empty list if the *data* is empty:
360
361   .. doctest::
362
363        >>> multimode('aabbbbccddddeeffffgg')
364        ['b', 'd', 'f']
365        >>> multimode('')
366        []
367
368   .. versionadded:: 3.8
369
370
371.. function:: pstdev(data, mu=None)
372
373   Return the population standard deviation (the square root of the population
374   variance).  See :func:`pvariance` for arguments and other details.
375
376   .. doctest::
377
378      >>> pstdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75])
379      0.986893273527251
380
381
382.. function:: pvariance(data, mu=None)
383
384   Return the population variance of *data*, a non-empty sequence or iterable
385   of real-valued numbers.  Variance, or second moment about the mean, is a
386   measure of the variability (spread or dispersion) of data.  A large
387   variance indicates that the data is spread out; a small variance indicates
388   it is clustered closely around the mean.
389
390   If the optional second argument *mu* is given, it is typically the mean of
391   the *data*.  It can also be used to compute the second moment around a
392   point that is not the mean.  If it is missing or ``None`` (the default),
393   the arithmetic mean is automatically calculated.
394
395   Use this function to calculate the variance from the entire population.  To
396   estimate the variance from a sample, the :func:`variance` function is usually
397   a better choice.
398
399   Raises :exc:`StatisticsError` if *data* is empty.
400
401   Examples:
402
403   .. doctest::
404
405      >>> data = [0.0, 0.25, 0.25, 1.25, 1.5, 1.75, 2.75, 3.25]
406      >>> pvariance(data)
407      1.25
408
409   If you have already calculated the mean of your data, you can pass it as the
410   optional second argument *mu* to avoid recalculation:
411
412   .. doctest::
413
414      >>> mu = mean(data)
415      >>> pvariance(data, mu)
416      1.25
417
418   Decimals and Fractions are supported:
419
420   .. doctest::
421
422      >>> from decimal import Decimal as D
423      >>> pvariance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")])
424      Decimal('24.815')
425
426      >>> from fractions import Fraction as F
427      >>> pvariance([F(1, 4), F(5, 4), F(1, 2)])
428      Fraction(13, 72)
429
430   .. note::
431
432      When called with the entire population, this gives the population variance
433      σ².  When called on a sample instead, this is the biased sample variance
434      s², also known as variance with N degrees of freedom.
435
436      If you somehow know the true population mean μ, you may use this
437      function to calculate the variance of a sample, giving the known
438      population mean as the second argument.  Provided the data points are a
439      random sample of the population, the result will be an unbiased estimate
440      of the population variance.
441
442
443.. function:: stdev(data, xbar=None)
444
445   Return the sample standard deviation (the square root of the sample
446   variance).  See :func:`variance` for arguments and other details.
447
448   .. doctest::
449
450      >>> stdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75])
451      1.0810874155219827
452
453
454.. function:: variance(data, xbar=None)
455
456   Return the sample variance of *data*, an iterable of at least two real-valued
457   numbers.  Variance, or second moment about the mean, is a measure of the
458   variability (spread or dispersion) of data.  A large variance indicates that
459   the data is spread out; a small variance indicates it is clustered closely
460   around the mean.
461
462   If the optional second argument *xbar* is given, it should be the mean of
463   *data*.  If it is missing or ``None`` (the default), the mean is
464   automatically calculated.
465
466   Use this function when your data is a sample from a population. To calculate
467   the variance from the entire population, see :func:`pvariance`.
468
469   Raises :exc:`StatisticsError` if *data* has fewer than two values.
470
471   Examples:
472
473   .. doctest::
474
475      >>> data = [2.75, 1.75, 1.25, 0.25, 0.5, 1.25, 3.5]
476      >>> variance(data)
477      1.3720238095238095
478
479   If you have already calculated the mean of your data, you can pass it as the
480   optional second argument *xbar* to avoid recalculation:
481
482   .. doctest::
483
484      >>> m = mean(data)
485      >>> variance(data, m)
486      1.3720238095238095
487
488   This function does not attempt to verify that you have passed the actual mean
489   as *xbar*.  Using arbitrary values for *xbar* can lead to invalid or
490   impossible results.
491
492   Decimal and Fraction values are supported:
493
494   .. doctest::
495
496      >>> from decimal import Decimal as D
497      >>> variance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")])
498      Decimal('31.01875')
499
500      >>> from fractions import Fraction as F
501      >>> variance([F(1, 6), F(1, 2), F(5, 3)])
502      Fraction(67, 108)
503
504   .. note::
505
506      This is the sample variance s² with Bessel's correction, also known as
507      variance with N-1 degrees of freedom.  Provided that the data points are
508      representative (e.g. independent and identically distributed), the result
509      should be an unbiased estimate of the true population variance.
510
511      If you somehow know the actual population mean μ you should pass it to the
512      :func:`pvariance` function as the *mu* parameter to get the variance of a
513      sample.
514
515.. function:: quantiles(data, *, n=4, method='exclusive')
516
517   Divide *data* into *n* continuous intervals with equal probability.
518   Returns a list of ``n - 1`` cut points separating the intervals.
519
520   Set *n* to 4 for quartiles (the default).  Set *n* to 10 for deciles.  Set
521   *n* to 100 for percentiles which gives the 99 cuts points that separate
522   *data* into 100 equal sized groups.  Raises :exc:`StatisticsError` if *n*
523   is not least 1.
524
525   The *data* can be any iterable containing sample data.  For meaningful
526   results, the number of data points in *data* should be larger than *n*.
527   Raises :exc:`StatisticsError` if there are not at least two data points.
528
529   The cut points are linearly interpolated from the
530   two nearest data points.  For example, if a cut point falls one-third
531   of the distance between two sample values, ``100`` and ``112``, the
532   cut-point will evaluate to ``104``.
533
534   The *method* for computing quantiles can be varied depending on
535   whether the *data* includes or excludes the lowest and
536   highest possible values from the population.
537
538   The default *method* is "exclusive" and is used for data sampled from
539   a population that can have more extreme values than found in the
540   samples.  The portion of the population falling below the *i-th* of
541   *m* sorted data points is computed as ``i / (m + 1)``.  Given nine
542   sample values, the method sorts them and assigns the following
543   percentiles: 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%.
544
545   Setting the *method* to "inclusive" is used for describing population
546   data or for samples that are known to include the most extreme values
547   from the population.  The minimum value in *data* is treated as the 0th
548   percentile and the maximum value is treated as the 100th percentile.
549   The portion of the population falling below the *i-th* of *m* sorted
550   data points is computed as ``(i - 1) / (m - 1)``.  Given 11 sample
551   values, the method sorts them and assigns the following percentiles:
552   0%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 100%.
553
554   .. doctest::
555
556        # Decile cut points for empirically sampled data
557        >>> data = [105, 129, 87, 86, 111, 111, 89, 81, 108, 92, 110,
558        ...         100, 75, 105, 103, 109, 76, 119, 99, 91, 103, 129,
559        ...         106, 101, 84, 111, 74, 87, 86, 103, 103, 106, 86,
560        ...         111, 75, 87, 102, 121, 111, 88, 89, 101, 106, 95,
561        ...         103, 107, 101, 81, 109, 104]
562        >>> [round(q, 1) for q in quantiles(data, n=10)]
563        [81.0, 86.2, 89.0, 99.4, 102.5, 103.6, 106.0, 109.8, 111.0]
564
565   .. versionadded:: 3.8
566
567
568Exceptions
569----------
570
571A single exception is defined:
572
573.. exception:: StatisticsError
574
575   Subclass of :exc:`ValueError` for statistics-related exceptions.
576
577
578:class:`NormalDist` objects
579---------------------------
580
581:class:`NormalDist` is a tool for creating and manipulating normal
582distributions of a `random variable
583<http://www.stat.yale.edu/Courses/1997-98/101/ranvar.htm>`_.  It is a
584class that treats the mean and standard deviation of data
585measurements as a single entity.
586
587Normal distributions arise from the `Central Limit Theorem
588<https://en.wikipedia.org/wiki/Central_limit_theorem>`_ and have a wide range
589of applications in statistics.
590
591.. class:: NormalDist(mu=0.0, sigma=1.0)
592
593    Returns a new *NormalDist* object where *mu* represents the `arithmetic
594    mean <https://en.wikipedia.org/wiki/Arithmetic_mean>`_ and *sigma*
595    represents the `standard deviation
596    <https://en.wikipedia.org/wiki/Standard_deviation>`_.
597
598    If *sigma* is negative, raises :exc:`StatisticsError`.
599
600    .. attribute:: mean
601
602       A read-only property for the `arithmetic mean
603       <https://en.wikipedia.org/wiki/Arithmetic_mean>`_ of a normal
604       distribution.
605
606    .. attribute:: median
607
608       A read-only property for the `median
609       <https://en.wikipedia.org/wiki/Median>`_ of a normal
610       distribution.
611
612    .. attribute:: mode
613
614       A read-only property for the `mode
615       <https://en.wikipedia.org/wiki/Mode_(statistics)>`_ of a normal
616       distribution.
617
618    .. attribute:: stdev
619
620       A read-only property for the `standard deviation
621       <https://en.wikipedia.org/wiki/Standard_deviation>`_ of a normal
622       distribution.
623
624    .. attribute:: variance
625
626       A read-only property for the `variance
627       <https://en.wikipedia.org/wiki/Variance>`_ of a normal
628       distribution. Equal to the square of the standard deviation.
629
630    .. classmethod:: NormalDist.from_samples(data)
631
632       Makes a normal distribution instance with *mu* and *sigma* parameters
633       estimated from the *data* using :func:`fmean` and :func:`stdev`.
634
635       The *data* can be any :term:`iterable` and should consist of values
636       that can be converted to type :class:`float`.  If *data* does not
637       contain at least two elements, raises :exc:`StatisticsError` because it
638       takes at least one point to estimate a central value and at least two
639       points to estimate dispersion.
640
641    .. method:: NormalDist.samples(n, *, seed=None)
642
643       Generates *n* random samples for a given mean and standard deviation.
644       Returns a :class:`list` of :class:`float` values.
645
646       If *seed* is given, creates a new instance of the underlying random
647       number generator.  This is useful for creating reproducible results,
648       even in a multi-threading context.
649
650    .. method:: NormalDist.pdf(x)
651
652       Using a `probability density function (pdf)
653       <https://en.wikipedia.org/wiki/Probability_density_function>`_, compute
654       the relative likelihood that a random variable *X* will be near the
655       given value *x*.  Mathematically, it is the limit of the ratio ``P(x <=
656       X < x+dx) / dx`` as *dx* approaches zero.
657
658       The relative likelihood is computed as the probability of a sample
659       occurring in a narrow range divided by the width of the range (hence
660       the word "density").  Since the likelihood is relative to other points,
661       its value can be greater than `1.0`.
662
663    .. method:: NormalDist.cdf(x)
664
665       Using a `cumulative distribution function (cdf)
666       <https://en.wikipedia.org/wiki/Cumulative_distribution_function>`_,
667       compute the probability that a random variable *X* will be less than or
668       equal to *x*.  Mathematically, it is written ``P(X <= x)``.
669
670    .. method:: NormalDist.inv_cdf(p)
671
672       Compute the inverse cumulative distribution function, also known as the
673       `quantile function <https://en.wikipedia.org/wiki/Quantile_function>`_
674       or the `percent-point
675       <https://www.statisticshowto.datasciencecentral.com/inverse-distribution-function/>`_
676       function.  Mathematically, it is written ``x : P(X <= x) = p``.
677
678       Finds the value *x* of the random variable *X* such that the
679       probability of the variable being less than or equal to that value
680       equals the given probability *p*.
681
682    .. method:: NormalDist.overlap(other)
683
684       Measures the agreement between two normal probability distributions.
685       Returns a value between 0.0 and 1.0 giving `the overlapping area for
686       the two probability density functions
687       <https://www.rasch.org/rmt/rmt101r.htm>`_.
688
689    .. method:: NormalDist.quantiles(n=4)
690
691        Divide the normal distribution into *n* continuous intervals with
692        equal probability.  Returns a list of (n - 1) cut points separating
693        the intervals.
694
695        Set *n* to 4 for quartiles (the default).  Set *n* to 10 for deciles.
696        Set *n* to 100 for percentiles which gives the 99 cuts points that
697        separate the normal distribution into 100 equal sized groups.
698
699    .. method:: NormalDist.zscore(x)
700
701        Compute the
702        `Standard Score <https://www.statisticshowto.com/probability-and-statistics/z-score/>`_
703        describing *x* in terms of the number of standard deviations
704        above or below the mean of the normal distribution:
705        ``(x - mean) / stdev``.
706
707        .. versionadded:: 3.9
708
709    Instances of :class:`NormalDist` support addition, subtraction,
710    multiplication and division by a constant.  These operations
711    are used for translation and scaling.  For example:
712
713    .. doctest::
714
715        >>> temperature_february = NormalDist(5, 2.5)             # Celsius
716        >>> temperature_february * (9/5) + 32                     # Fahrenheit
717        NormalDist(mu=41.0, sigma=4.5)
718
719    Dividing a constant by an instance of :class:`NormalDist` is not supported
720    because the result wouldn't be normally distributed.
721
722    Since normal distributions arise from additive effects of independent
723    variables, it is possible to `add and subtract two independent normally
724    distributed random variables
725    <https://en.wikipedia.org/wiki/Sum_of_normally_distributed_random_variables>`_
726    represented as instances of :class:`NormalDist`.  For example:
727
728    .. doctest::
729
730        >>> birth_weights = NormalDist.from_samples([2.5, 3.1, 2.1, 2.4, 2.7, 3.5])
731        >>> drug_effects = NormalDist(0.4, 0.15)
732        >>> combined = birth_weights + drug_effects
733        >>> round(combined.mean, 1)
734        3.1
735        >>> round(combined.stdev, 1)
736        0.5
737
738    .. versionadded:: 3.8
739
740
741:class:`NormalDist` Examples and Recipes
742^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
743
744:class:`NormalDist` readily solves classic probability problems.
745
746For example, given `historical data for SAT exams
747<https://nces.ed.gov/programs/digest/d17/tables/dt17_226.40.asp>`_ showing
748that scores are normally distributed with a mean of 1060 and a standard
749deviation of 195, determine the percentage of students with test scores
750between 1100 and 1200, after rounding to the nearest whole number:
751
752.. doctest::
753
754    >>> sat = NormalDist(1060, 195)
755    >>> fraction = sat.cdf(1200 + 0.5) - sat.cdf(1100 - 0.5)
756    >>> round(fraction * 100.0, 1)
757    18.4
758
759Find the `quartiles <https://en.wikipedia.org/wiki/Quartile>`_ and `deciles
760<https://en.wikipedia.org/wiki/Decile>`_ for the SAT scores:
761
762.. doctest::
763
764    >>> list(map(round, sat.quantiles()))
765    [928, 1060, 1192]
766    >>> list(map(round, sat.quantiles(n=10)))
767    [810, 896, 958, 1011, 1060, 1109, 1162, 1224, 1310]
768
769To estimate the distribution for a model than isn't easy to solve
770analytically, :class:`NormalDist` can generate input samples for a `Monte
771Carlo simulation <https://en.wikipedia.org/wiki/Monte_Carlo_method>`_:
772
773.. doctest::
774
775    >>> def model(x, y, z):
776    ...     return (3*x + 7*x*y - 5*y) / (11 * z)
777    ...
778    >>> n = 100_000
779    >>> X = NormalDist(10, 2.5).samples(n, seed=3652260728)
780    >>> Y = NormalDist(15, 1.75).samples(n, seed=4582495471)
781    >>> Z = NormalDist(50, 1.25).samples(n, seed=6582483453)
782    >>> quantiles(map(model, X, Y, Z))       # doctest: +SKIP
783    [1.4591308524824727, 1.8035946855390597, 2.175091447274739]
784
785Normal distributions can be used to approximate `Binomial
786distributions <http://mathworld.wolfram.com/BinomialDistribution.html>`_
787when the sample size is large and when the probability of a successful
788trial is near 50%.
789
790For example, an open source conference has 750 attendees and two rooms with a
791500 person capacity.  There is a talk about Python and another about Ruby.
792In previous conferences, 65% of the attendees preferred to listen to Python
793talks.  Assuming the population preferences haven't changed, what is the
794probability that the Python room will stay within its capacity limits?
795
796.. doctest::
797
798    >>> n = 750             # Sample size
799    >>> p = 0.65            # Preference for Python
800    >>> q = 1.0 - p         # Preference for Ruby
801    >>> k = 500             # Room capacity
802
803    >>> # Approximation using the cumulative normal distribution
804    >>> from math import sqrt
805    >>> round(NormalDist(mu=n*p, sigma=sqrt(n*p*q)).cdf(k + 0.5), 4)
806    0.8402
807
808    >>> # Solution using the cumulative binomial distribution
809    >>> from math import comb, fsum
810    >>> round(fsum(comb(n, r) * p**r * q**(n-r) for r in range(k+1)), 4)
811    0.8402
812
813    >>> # Approximation using a simulation
814    >>> from random import seed, choices
815    >>> seed(8675309)
816    >>> def trial():
817    ...     return choices(('Python', 'Ruby'), (p, q), k=n).count('Python')
818    >>> mean(trial() <= k for i in range(10_000))
819    0.8398
820
821Normal distributions commonly arise in machine learning problems.
822
823Wikipedia has a `nice example of a Naive Bayesian Classifier
824<https://en.wikipedia.org/wiki/Naive_Bayes_classifier#Sex_classification>`_.
825The challenge is to predict a person's gender from measurements of normally
826distributed features including height, weight, and foot size.
827
828We're given a training dataset with measurements for eight people.  The
829measurements are assumed to be normally distributed, so we summarize the data
830with :class:`NormalDist`:
831
832.. doctest::
833
834    >>> height_male = NormalDist.from_samples([6, 5.92, 5.58, 5.92])
835    >>> height_female = NormalDist.from_samples([5, 5.5, 5.42, 5.75])
836    >>> weight_male = NormalDist.from_samples([180, 190, 170, 165])
837    >>> weight_female = NormalDist.from_samples([100, 150, 130, 150])
838    >>> foot_size_male = NormalDist.from_samples([12, 11, 12, 10])
839    >>> foot_size_female = NormalDist.from_samples([6, 8, 7, 9])
840
841Next, we encounter a new person whose feature measurements are known but whose
842gender is unknown:
843
844.. doctest::
845
846    >>> ht = 6.0        # height
847    >>> wt = 130        # weight
848    >>> fs = 8          # foot size
849
850Starting with a 50% `prior probability
851<https://en.wikipedia.org/wiki/Prior_probability>`_ of being male or female,
852we compute the posterior as the prior times the product of likelihoods for the
853feature measurements given the gender:
854
855.. doctest::
856
857   >>> prior_male = 0.5
858   >>> prior_female = 0.5
859   >>> posterior_male = (prior_male * height_male.pdf(ht) *
860   ...                   weight_male.pdf(wt) * foot_size_male.pdf(fs))
861
862   >>> posterior_female = (prior_female * height_female.pdf(ht) *
863   ...                     weight_female.pdf(wt) * foot_size_female.pdf(fs))
864
865The final prediction goes to the largest posterior. This is known as the
866`maximum a posteriori
867<https://en.wikipedia.org/wiki/Maximum_a_posteriori_estimation>`_ or MAP:
868
869.. doctest::
870
871  >>> 'male' if posterior_male > posterior_female else 'female'
872  'female'
873
874
875..
876   # This modelines must appear within the last ten lines of the file.
877   kate: indent-width 3; remove-trailing-space on; replace-tabs on; encoding utf-8;
878