1:mod:`statistics` --- Mathematical statistics functions 2======================================================= 3 4.. module:: statistics 5 :synopsis: Mathematical statistics functions 6 7.. moduleauthor:: Steven D'Aprano <steve+python@pearwood.info> 8.. sectionauthor:: Steven D'Aprano <steve+python@pearwood.info> 9 10.. versionadded:: 3.4 11 12**Source code:** :source:`Lib/statistics.py` 13 14.. testsetup:: * 15 16 from statistics import * 17 __name__ = '<doctest>' 18 19-------------- 20 21This module provides functions for calculating mathematical statistics of 22numeric (:class:`~numbers.Real`-valued) data. 23 24The module is not intended to be a competitor to third-party libraries such 25as `NumPy <https://numpy.org>`_, `SciPy <https://www.scipy.org/>`_, or 26proprietary full-featured statistics packages aimed at professional 27statisticians such as Minitab, SAS and Matlab. It is aimed at the level of 28graphing and scientific calculators. 29 30Unless explicitly noted, these functions support :class:`int`, 31:class:`float`, :class:`~decimal.Decimal` and :class:`~fractions.Fraction`. 32Behaviour with other types (whether in the numeric tower or not) is 33currently unsupported. Collections with a mix of types are also undefined 34and implementation-dependent. If your input data consists of mixed types, 35you may be able to use :func:`map` to ensure a consistent result, for 36example: ``map(float, input_data)``. 37 38Averages and measures of central location 39----------------------------------------- 40 41These functions calculate an average or typical value from a population 42or sample. 43 44======================= =============================================================== 45:func:`mean` Arithmetic mean ("average") of data. 46:func:`fmean` Fast, floating point arithmetic mean. 47:func:`geometric_mean` Geometric mean of data. 48:func:`harmonic_mean` Harmonic mean of data. 49:func:`median` Median (middle value) of data. 50:func:`median_low` Low median of data. 51:func:`median_high` High median of data. 52:func:`median_grouped` Median, or 50th percentile, of grouped data. 53:func:`mode` Single mode (most common value) of discrete or nominal data. 54:func:`multimode` List of modes (most common values) of discrete or nomimal data. 55:func:`quantiles` Divide data into intervals with equal probability. 56======================= =============================================================== 57 58Measures of spread 59------------------ 60 61These functions calculate a measure of how much the population or sample 62tends to deviate from the typical or average values. 63 64======================= ============================================= 65:func:`pstdev` Population standard deviation of data. 66:func:`pvariance` Population variance of data. 67:func:`stdev` Sample standard deviation of data. 68:func:`variance` Sample variance of data. 69======================= ============================================= 70 71 72Function details 73---------------- 74 75Note: The functions do not require the data given to them to be sorted. 76However, for reading convenience, most of the examples show sorted sequences. 77 78.. function:: mean(data) 79 80 Return the sample arithmetic mean of *data* which can be a sequence or iterable. 81 82 The arithmetic mean is the sum of the data divided by the number of data 83 points. It is commonly called "the average", although it is only one of many 84 different mathematical averages. It is a measure of the central location of 85 the data. 86 87 If *data* is empty, :exc:`StatisticsError` will be raised. 88 89 Some examples of use: 90 91 .. doctest:: 92 93 >>> mean([1, 2, 3, 4, 4]) 94 2.8 95 >>> mean([-1.0, 2.5, 3.25, 5.75]) 96 2.625 97 98 >>> from fractions import Fraction as F 99 >>> mean([F(3, 7), F(1, 21), F(5, 3), F(1, 3)]) 100 Fraction(13, 21) 101 102 >>> from decimal import Decimal as D 103 >>> mean([D("0.5"), D("0.75"), D("0.625"), D("0.375")]) 104 Decimal('0.5625') 105 106 .. note:: 107 108 The mean is strongly affected by outliers and is not a robust estimator 109 for central location: the mean is not necessarily a typical example of 110 the data points. For more robust measures of central location, see 111 :func:`median` and :func:`mode`. 112 113 The sample mean gives an unbiased estimate of the true population mean, 114 so that when taken on average over all the possible samples, 115 ``mean(sample)`` converges on the true mean of the entire population. If 116 *data* represents the entire population rather than a sample, then 117 ``mean(data)`` is equivalent to calculating the true population mean μ. 118 119 120.. function:: fmean(data) 121 122 Convert *data* to floats and compute the arithmetic mean. 123 124 This runs faster than the :func:`mean` function and it always returns a 125 :class:`float`. The *data* may be a sequence or iterable. If the input 126 dataset is empty, raises a :exc:`StatisticsError`. 127 128 .. doctest:: 129 130 >>> fmean([3.5, 4.0, 5.25]) 131 4.25 132 133 .. versionadded:: 3.8 134 135 136.. function:: geometric_mean(data) 137 138 Convert *data* to floats and compute the geometric mean. 139 140 The geometric mean indicates the central tendency or typical value of the 141 *data* using the product of the values (as opposed to the arithmetic mean 142 which uses their sum). 143 144 Raises a :exc:`StatisticsError` if the input dataset is empty, 145 if it contains a zero, or if it contains a negative value. 146 The *data* may be a sequence or iterable. 147 148 No special efforts are made to achieve exact results. 149 (However, this may change in the future.) 150 151 .. doctest:: 152 153 >>> round(geometric_mean([54, 24, 36]), 1) 154 36.0 155 156 .. versionadded:: 3.8 157 158 159.. function:: harmonic_mean(data) 160 161 Return the harmonic mean of *data*, a sequence or iterable of 162 real-valued numbers. 163 164 The harmonic mean, sometimes called the subcontrary mean, is the 165 reciprocal of the arithmetic :func:`mean` of the reciprocals of the 166 data. For example, the harmonic mean of three values *a*, *b* and *c* 167 will be equivalent to ``3/(1/a + 1/b + 1/c)``. If one of the values 168 is zero, the result will be zero. 169 170 The harmonic mean is a type of average, a measure of the central 171 location of the data. It is often appropriate when averaging 172 rates or ratios, for example speeds. 173 174 Suppose a car travels 10 km at 40 km/hr, then another 10 km at 60 km/hr. 175 What is the average speed? 176 177 .. doctest:: 178 179 >>> harmonic_mean([40, 60]) 180 48.0 181 182 Suppose an investor purchases an equal value of shares in each of 183 three companies, with P/E (price/earning) ratios of 2.5, 3 and 10. 184 What is the average P/E ratio for the investor's portfolio? 185 186 .. doctest:: 187 188 >>> harmonic_mean([2.5, 3, 10]) # For an equal investment portfolio. 189 3.6 190 191 :exc:`StatisticsError` is raised if *data* is empty, or any element 192 is less than zero. 193 194 The current algorithm has an early-out when it encounters a zero 195 in the input. This means that the subsequent inputs are not tested 196 for validity. (This behavior may change in the future.) 197 198 .. versionadded:: 3.6 199 200 201.. function:: median(data) 202 203 Return the median (middle value) of numeric data, using the common "mean of 204 middle two" method. If *data* is empty, :exc:`StatisticsError` is raised. 205 *data* can be a sequence or iterable. 206 207 The median is a robust measure of central location and is less affected by 208 the presence of outliers. When the number of data points is odd, the 209 middle data point is returned: 210 211 .. doctest:: 212 213 >>> median([1, 3, 5]) 214 3 215 216 When the number of data points is even, the median is interpolated by taking 217 the average of the two middle values: 218 219 .. doctest:: 220 221 >>> median([1, 3, 5, 7]) 222 4.0 223 224 This is suited for when your data is discrete, and you don't mind that the 225 median may not be an actual data point. 226 227 If the data is ordinal (supports order operations) but not numeric (doesn't 228 support addition), consider using :func:`median_low` or :func:`median_high` 229 instead. 230 231.. function:: median_low(data) 232 233 Return the low median of numeric data. If *data* is empty, 234 :exc:`StatisticsError` is raised. *data* can be a sequence or iterable. 235 236 The low median is always a member of the data set. When the number of data 237 points is odd, the middle value is returned. When it is even, the smaller of 238 the two middle values is returned. 239 240 .. doctest:: 241 242 >>> median_low([1, 3, 5]) 243 3 244 >>> median_low([1, 3, 5, 7]) 245 3 246 247 Use the low median when your data are discrete and you prefer the median to 248 be an actual data point rather than interpolated. 249 250 251.. function:: median_high(data) 252 253 Return the high median of data. If *data* is empty, :exc:`StatisticsError` 254 is raised. *data* can be a sequence or iterable. 255 256 The high median is always a member of the data set. When the number of data 257 points is odd, the middle value is returned. When it is even, the larger of 258 the two middle values is returned. 259 260 .. doctest:: 261 262 >>> median_high([1, 3, 5]) 263 3 264 >>> median_high([1, 3, 5, 7]) 265 5 266 267 Use the high median when your data are discrete and you prefer the median to 268 be an actual data point rather than interpolated. 269 270 271.. function:: median_grouped(data, interval=1) 272 273 Return the median of grouped continuous data, calculated as the 50th 274 percentile, using interpolation. If *data* is empty, :exc:`StatisticsError` 275 is raised. *data* can be a sequence or iterable. 276 277 .. doctest:: 278 279 >>> median_grouped([52, 52, 53, 54]) 280 52.5 281 282 In the following example, the data are rounded, so that each value represents 283 the midpoint of data classes, e.g. 1 is the midpoint of the class 0.5--1.5, 2 284 is the midpoint of 1.5--2.5, 3 is the midpoint of 2.5--3.5, etc. With the data 285 given, the middle value falls somewhere in the class 3.5--4.5, and 286 interpolation is used to estimate it: 287 288 .. doctest:: 289 290 >>> median_grouped([1, 2, 2, 3, 4, 4, 4, 4, 4, 5]) 291 3.7 292 293 Optional argument *interval* represents the class interval, and defaults 294 to 1. Changing the class interval naturally will change the interpolation: 295 296 .. doctest:: 297 298 >>> median_grouped([1, 3, 3, 5, 7], interval=1) 299 3.25 300 >>> median_grouped([1, 3, 3, 5, 7], interval=2) 301 3.5 302 303 This function does not check whether the data points are at least 304 *interval* apart. 305 306 .. impl-detail:: 307 308 Under some circumstances, :func:`median_grouped` may coerce data points to 309 floats. This behaviour is likely to change in the future. 310 311 .. seealso:: 312 313 * "Statistics for the Behavioral Sciences", Frederick J Gravetter and 314 Larry B Wallnau (8th Edition). 315 316 * The `SSMEDIAN 317 <https://help.gnome.org/users/gnumeric/stable/gnumeric.html#gnumeric-function-SSMEDIAN>`_ 318 function in the Gnome Gnumeric spreadsheet, including `this discussion 319 <https://mail.gnome.org/archives/gnumeric-list/2011-April/msg00018.html>`_. 320 321 322.. function:: mode(data) 323 324 Return the single most common data point from discrete or nominal *data*. 325 The mode (when it exists) is the most typical value and serves as a 326 measure of central location. 327 328 If there are multiple modes with the same frequency, returns the first one 329 encountered in the *data*. If the smallest or largest of those is 330 desired instead, use ``min(multimode(data))`` or ``max(multimode(data))``. 331 If the input *data* is empty, :exc:`StatisticsError` is raised. 332 333 ``mode`` assumes discrete data and returns a single value. This is the 334 standard treatment of the mode as commonly taught in schools: 335 336 .. doctest:: 337 338 >>> mode([1, 1, 2, 3, 3, 3, 3, 4]) 339 3 340 341 The mode is unique in that it is the only statistic in this package that 342 also applies to nominal (non-numeric) data: 343 344 .. doctest:: 345 346 >>> mode(["red", "blue", "blue", "red", "green", "red", "red"]) 347 'red' 348 349 .. versionchanged:: 3.8 350 Now handles multimodal datasets by returning the first mode encountered. 351 Formerly, it raised :exc:`StatisticsError` when more than one mode was 352 found. 353 354 355.. function:: multimode(data) 356 357 Return a list of the most frequently occurring values in the order they 358 were first encountered in the *data*. Will return more than one result if 359 there are multiple modes or an empty list if the *data* is empty: 360 361 .. doctest:: 362 363 >>> multimode('aabbbbccddddeeffffgg') 364 ['b', 'd', 'f'] 365 >>> multimode('') 366 [] 367 368 .. versionadded:: 3.8 369 370 371.. function:: pstdev(data, mu=None) 372 373 Return the population standard deviation (the square root of the population 374 variance). See :func:`pvariance` for arguments and other details. 375 376 .. doctest:: 377 378 >>> pstdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75]) 379 0.986893273527251 380 381 382.. function:: pvariance(data, mu=None) 383 384 Return the population variance of *data*, a non-empty sequence or iterable 385 of real-valued numbers. Variance, or second moment about the mean, is a 386 measure of the variability (spread or dispersion) of data. A large 387 variance indicates that the data is spread out; a small variance indicates 388 it is clustered closely around the mean. 389 390 If the optional second argument *mu* is given, it is typically the mean of 391 the *data*. It can also be used to compute the second moment around a 392 point that is not the mean. If it is missing or ``None`` (the default), 393 the arithmetic mean is automatically calculated. 394 395 Use this function to calculate the variance from the entire population. To 396 estimate the variance from a sample, the :func:`variance` function is usually 397 a better choice. 398 399 Raises :exc:`StatisticsError` if *data* is empty. 400 401 Examples: 402 403 .. doctest:: 404 405 >>> data = [0.0, 0.25, 0.25, 1.25, 1.5, 1.75, 2.75, 3.25] 406 >>> pvariance(data) 407 1.25 408 409 If you have already calculated the mean of your data, you can pass it as the 410 optional second argument *mu* to avoid recalculation: 411 412 .. doctest:: 413 414 >>> mu = mean(data) 415 >>> pvariance(data, mu) 416 1.25 417 418 Decimals and Fractions are supported: 419 420 .. doctest:: 421 422 >>> from decimal import Decimal as D 423 >>> pvariance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")]) 424 Decimal('24.815') 425 426 >>> from fractions import Fraction as F 427 >>> pvariance([F(1, 4), F(5, 4), F(1, 2)]) 428 Fraction(13, 72) 429 430 .. note:: 431 432 When called with the entire population, this gives the population variance 433 σ². When called on a sample instead, this is the biased sample variance 434 s², also known as variance with N degrees of freedom. 435 436 If you somehow know the true population mean μ, you may use this 437 function to calculate the variance of a sample, giving the known 438 population mean as the second argument. Provided the data points are a 439 random sample of the population, the result will be an unbiased estimate 440 of the population variance. 441 442 443.. function:: stdev(data, xbar=None) 444 445 Return the sample standard deviation (the square root of the sample 446 variance). See :func:`variance` for arguments and other details. 447 448 .. doctest:: 449 450 >>> stdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75]) 451 1.0810874155219827 452 453 454.. function:: variance(data, xbar=None) 455 456 Return the sample variance of *data*, an iterable of at least two real-valued 457 numbers. Variance, or second moment about the mean, is a measure of the 458 variability (spread or dispersion) of data. A large variance indicates that 459 the data is spread out; a small variance indicates it is clustered closely 460 around the mean. 461 462 If the optional second argument *xbar* is given, it should be the mean of 463 *data*. If it is missing or ``None`` (the default), the mean is 464 automatically calculated. 465 466 Use this function when your data is a sample from a population. To calculate 467 the variance from the entire population, see :func:`pvariance`. 468 469 Raises :exc:`StatisticsError` if *data* has fewer than two values. 470 471 Examples: 472 473 .. doctest:: 474 475 >>> data = [2.75, 1.75, 1.25, 0.25, 0.5, 1.25, 3.5] 476 >>> variance(data) 477 1.3720238095238095 478 479 If you have already calculated the mean of your data, you can pass it as the 480 optional second argument *xbar* to avoid recalculation: 481 482 .. doctest:: 483 484 >>> m = mean(data) 485 >>> variance(data, m) 486 1.3720238095238095 487 488 This function does not attempt to verify that you have passed the actual mean 489 as *xbar*. Using arbitrary values for *xbar* can lead to invalid or 490 impossible results. 491 492 Decimal and Fraction values are supported: 493 494 .. doctest:: 495 496 >>> from decimal import Decimal as D 497 >>> variance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")]) 498 Decimal('31.01875') 499 500 >>> from fractions import Fraction as F 501 >>> variance([F(1, 6), F(1, 2), F(5, 3)]) 502 Fraction(67, 108) 503 504 .. note:: 505 506 This is the sample variance s² with Bessel's correction, also known as 507 variance with N-1 degrees of freedom. Provided that the data points are 508 representative (e.g. independent and identically distributed), the result 509 should be an unbiased estimate of the true population variance. 510 511 If you somehow know the actual population mean μ you should pass it to the 512 :func:`pvariance` function as the *mu* parameter to get the variance of a 513 sample. 514 515.. function:: quantiles(data, *, n=4, method='exclusive') 516 517 Divide *data* into *n* continuous intervals with equal probability. 518 Returns a list of ``n - 1`` cut points separating the intervals. 519 520 Set *n* to 4 for quartiles (the default). Set *n* to 10 for deciles. Set 521 *n* to 100 for percentiles which gives the 99 cuts points that separate 522 *data* into 100 equal sized groups. Raises :exc:`StatisticsError` if *n* 523 is not least 1. 524 525 The *data* can be any iterable containing sample data. For meaningful 526 results, the number of data points in *data* should be larger than *n*. 527 Raises :exc:`StatisticsError` if there are not at least two data points. 528 529 The cut points are linearly interpolated from the 530 two nearest data points. For example, if a cut point falls one-third 531 of the distance between two sample values, ``100`` and ``112``, the 532 cut-point will evaluate to ``104``. 533 534 The *method* for computing quantiles can be varied depending on 535 whether the *data* includes or excludes the lowest and 536 highest possible values from the population. 537 538 The default *method* is "exclusive" and is used for data sampled from 539 a population that can have more extreme values than found in the 540 samples. The portion of the population falling below the *i-th* of 541 *m* sorted data points is computed as ``i / (m + 1)``. Given nine 542 sample values, the method sorts them and assigns the following 543 percentiles: 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%. 544 545 Setting the *method* to "inclusive" is used for describing population 546 data or for samples that are known to include the most extreme values 547 from the population. The minimum value in *data* is treated as the 0th 548 percentile and the maximum value is treated as the 100th percentile. 549 The portion of the population falling below the *i-th* of *m* sorted 550 data points is computed as ``(i - 1) / (m - 1)``. Given 11 sample 551 values, the method sorts them and assigns the following percentiles: 552 0%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 100%. 553 554 .. doctest:: 555 556 # Decile cut points for empirically sampled data 557 >>> data = [105, 129, 87, 86, 111, 111, 89, 81, 108, 92, 110, 558 ... 100, 75, 105, 103, 109, 76, 119, 99, 91, 103, 129, 559 ... 106, 101, 84, 111, 74, 87, 86, 103, 103, 106, 86, 560 ... 111, 75, 87, 102, 121, 111, 88, 89, 101, 106, 95, 561 ... 103, 107, 101, 81, 109, 104] 562 >>> [round(q, 1) for q in quantiles(data, n=10)] 563 [81.0, 86.2, 89.0, 99.4, 102.5, 103.6, 106.0, 109.8, 111.0] 564 565 .. versionadded:: 3.8 566 567 568Exceptions 569---------- 570 571A single exception is defined: 572 573.. exception:: StatisticsError 574 575 Subclass of :exc:`ValueError` for statistics-related exceptions. 576 577 578:class:`NormalDist` objects 579--------------------------- 580 581:class:`NormalDist` is a tool for creating and manipulating normal 582distributions of a `random variable 583<http://www.stat.yale.edu/Courses/1997-98/101/ranvar.htm>`_. It is a 584class that treats the mean and standard deviation of data 585measurements as a single entity. 586 587Normal distributions arise from the `Central Limit Theorem 588<https://en.wikipedia.org/wiki/Central_limit_theorem>`_ and have a wide range 589of applications in statistics. 590 591.. class:: NormalDist(mu=0.0, sigma=1.0) 592 593 Returns a new *NormalDist* object where *mu* represents the `arithmetic 594 mean <https://en.wikipedia.org/wiki/Arithmetic_mean>`_ and *sigma* 595 represents the `standard deviation 596 <https://en.wikipedia.org/wiki/Standard_deviation>`_. 597 598 If *sigma* is negative, raises :exc:`StatisticsError`. 599 600 .. attribute:: mean 601 602 A read-only property for the `arithmetic mean 603 <https://en.wikipedia.org/wiki/Arithmetic_mean>`_ of a normal 604 distribution. 605 606 .. attribute:: median 607 608 A read-only property for the `median 609 <https://en.wikipedia.org/wiki/Median>`_ of a normal 610 distribution. 611 612 .. attribute:: mode 613 614 A read-only property for the `mode 615 <https://en.wikipedia.org/wiki/Mode_(statistics)>`_ of a normal 616 distribution. 617 618 .. attribute:: stdev 619 620 A read-only property for the `standard deviation 621 <https://en.wikipedia.org/wiki/Standard_deviation>`_ of a normal 622 distribution. 623 624 .. attribute:: variance 625 626 A read-only property for the `variance 627 <https://en.wikipedia.org/wiki/Variance>`_ of a normal 628 distribution. Equal to the square of the standard deviation. 629 630 .. classmethod:: NormalDist.from_samples(data) 631 632 Makes a normal distribution instance with *mu* and *sigma* parameters 633 estimated from the *data* using :func:`fmean` and :func:`stdev`. 634 635 The *data* can be any :term:`iterable` and should consist of values 636 that can be converted to type :class:`float`. If *data* does not 637 contain at least two elements, raises :exc:`StatisticsError` because it 638 takes at least one point to estimate a central value and at least two 639 points to estimate dispersion. 640 641 .. method:: NormalDist.samples(n, *, seed=None) 642 643 Generates *n* random samples for a given mean and standard deviation. 644 Returns a :class:`list` of :class:`float` values. 645 646 If *seed* is given, creates a new instance of the underlying random 647 number generator. This is useful for creating reproducible results, 648 even in a multi-threading context. 649 650 .. method:: NormalDist.pdf(x) 651 652 Using a `probability density function (pdf) 653 <https://en.wikipedia.org/wiki/Probability_density_function>`_, compute 654 the relative likelihood that a random variable *X* will be near the 655 given value *x*. Mathematically, it is the limit of the ratio ``P(x <= 656 X < x+dx) / dx`` as *dx* approaches zero. 657 658 The relative likelihood is computed as the probability of a sample 659 occurring in a narrow range divided by the width of the range (hence 660 the word "density"). Since the likelihood is relative to other points, 661 its value can be greater than `1.0`. 662 663 .. method:: NormalDist.cdf(x) 664 665 Using a `cumulative distribution function (cdf) 666 <https://en.wikipedia.org/wiki/Cumulative_distribution_function>`_, 667 compute the probability that a random variable *X* will be less than or 668 equal to *x*. Mathematically, it is written ``P(X <= x)``. 669 670 .. method:: NormalDist.inv_cdf(p) 671 672 Compute the inverse cumulative distribution function, also known as the 673 `quantile function <https://en.wikipedia.org/wiki/Quantile_function>`_ 674 or the `percent-point 675 <https://www.statisticshowto.datasciencecentral.com/inverse-distribution-function/>`_ 676 function. Mathematically, it is written ``x : P(X <= x) = p``. 677 678 Finds the value *x* of the random variable *X* such that the 679 probability of the variable being less than or equal to that value 680 equals the given probability *p*. 681 682 .. method:: NormalDist.overlap(other) 683 684 Measures the agreement between two normal probability distributions. 685 Returns a value between 0.0 and 1.0 giving `the overlapping area for 686 the two probability density functions 687 <https://www.rasch.org/rmt/rmt101r.htm>`_. 688 689 .. method:: NormalDist.quantiles(n=4) 690 691 Divide the normal distribution into *n* continuous intervals with 692 equal probability. Returns a list of (n - 1) cut points separating 693 the intervals. 694 695 Set *n* to 4 for quartiles (the default). Set *n* to 10 for deciles. 696 Set *n* to 100 for percentiles which gives the 99 cuts points that 697 separate the normal distribution into 100 equal sized groups. 698 699 .. method:: NormalDist.zscore(x) 700 701 Compute the 702 `Standard Score <https://www.statisticshowto.com/probability-and-statistics/z-score/>`_ 703 describing *x* in terms of the number of standard deviations 704 above or below the mean of the normal distribution: 705 ``(x - mean) / stdev``. 706 707 .. versionadded:: 3.9 708 709 Instances of :class:`NormalDist` support addition, subtraction, 710 multiplication and division by a constant. These operations 711 are used for translation and scaling. For example: 712 713 .. doctest:: 714 715 >>> temperature_february = NormalDist(5, 2.5) # Celsius 716 >>> temperature_february * (9/5) + 32 # Fahrenheit 717 NormalDist(mu=41.0, sigma=4.5) 718 719 Dividing a constant by an instance of :class:`NormalDist` is not supported 720 because the result wouldn't be normally distributed. 721 722 Since normal distributions arise from additive effects of independent 723 variables, it is possible to `add and subtract two independent normally 724 distributed random variables 725 <https://en.wikipedia.org/wiki/Sum_of_normally_distributed_random_variables>`_ 726 represented as instances of :class:`NormalDist`. For example: 727 728 .. doctest:: 729 730 >>> birth_weights = NormalDist.from_samples([2.5, 3.1, 2.1, 2.4, 2.7, 3.5]) 731 >>> drug_effects = NormalDist(0.4, 0.15) 732 >>> combined = birth_weights + drug_effects 733 >>> round(combined.mean, 1) 734 3.1 735 >>> round(combined.stdev, 1) 736 0.5 737 738 .. versionadded:: 3.8 739 740 741:class:`NormalDist` Examples and Recipes 742^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 743 744:class:`NormalDist` readily solves classic probability problems. 745 746For example, given `historical data for SAT exams 747<https://nces.ed.gov/programs/digest/d17/tables/dt17_226.40.asp>`_ showing 748that scores are normally distributed with a mean of 1060 and a standard 749deviation of 195, determine the percentage of students with test scores 750between 1100 and 1200, after rounding to the nearest whole number: 751 752.. doctest:: 753 754 >>> sat = NormalDist(1060, 195) 755 >>> fraction = sat.cdf(1200 + 0.5) - sat.cdf(1100 - 0.5) 756 >>> round(fraction * 100.0, 1) 757 18.4 758 759Find the `quartiles <https://en.wikipedia.org/wiki/Quartile>`_ and `deciles 760<https://en.wikipedia.org/wiki/Decile>`_ for the SAT scores: 761 762.. doctest:: 763 764 >>> list(map(round, sat.quantiles())) 765 [928, 1060, 1192] 766 >>> list(map(round, sat.quantiles(n=10))) 767 [810, 896, 958, 1011, 1060, 1109, 1162, 1224, 1310] 768 769To estimate the distribution for a model than isn't easy to solve 770analytically, :class:`NormalDist` can generate input samples for a `Monte 771Carlo simulation <https://en.wikipedia.org/wiki/Monte_Carlo_method>`_: 772 773.. doctest:: 774 775 >>> def model(x, y, z): 776 ... return (3*x + 7*x*y - 5*y) / (11 * z) 777 ... 778 >>> n = 100_000 779 >>> X = NormalDist(10, 2.5).samples(n, seed=3652260728) 780 >>> Y = NormalDist(15, 1.75).samples(n, seed=4582495471) 781 >>> Z = NormalDist(50, 1.25).samples(n, seed=6582483453) 782 >>> quantiles(map(model, X, Y, Z)) # doctest: +SKIP 783 [1.4591308524824727, 1.8035946855390597, 2.175091447274739] 784 785Normal distributions can be used to approximate `Binomial 786distributions <http://mathworld.wolfram.com/BinomialDistribution.html>`_ 787when the sample size is large and when the probability of a successful 788trial is near 50%. 789 790For example, an open source conference has 750 attendees and two rooms with a 791500 person capacity. There is a talk about Python and another about Ruby. 792In previous conferences, 65% of the attendees preferred to listen to Python 793talks. Assuming the population preferences haven't changed, what is the 794probability that the Python room will stay within its capacity limits? 795 796.. doctest:: 797 798 >>> n = 750 # Sample size 799 >>> p = 0.65 # Preference for Python 800 >>> q = 1.0 - p # Preference for Ruby 801 >>> k = 500 # Room capacity 802 803 >>> # Approximation using the cumulative normal distribution 804 >>> from math import sqrt 805 >>> round(NormalDist(mu=n*p, sigma=sqrt(n*p*q)).cdf(k + 0.5), 4) 806 0.8402 807 808 >>> # Solution using the cumulative binomial distribution 809 >>> from math import comb, fsum 810 >>> round(fsum(comb(n, r) * p**r * q**(n-r) for r in range(k+1)), 4) 811 0.8402 812 813 >>> # Approximation using a simulation 814 >>> from random import seed, choices 815 >>> seed(8675309) 816 >>> def trial(): 817 ... return choices(('Python', 'Ruby'), (p, q), k=n).count('Python') 818 >>> mean(trial() <= k for i in range(10_000)) 819 0.8398 820 821Normal distributions commonly arise in machine learning problems. 822 823Wikipedia has a `nice example of a Naive Bayesian Classifier 824<https://en.wikipedia.org/wiki/Naive_Bayes_classifier#Sex_classification>`_. 825The challenge is to predict a person's gender from measurements of normally 826distributed features including height, weight, and foot size. 827 828We're given a training dataset with measurements for eight people. The 829measurements are assumed to be normally distributed, so we summarize the data 830with :class:`NormalDist`: 831 832.. doctest:: 833 834 >>> height_male = NormalDist.from_samples([6, 5.92, 5.58, 5.92]) 835 >>> height_female = NormalDist.from_samples([5, 5.5, 5.42, 5.75]) 836 >>> weight_male = NormalDist.from_samples([180, 190, 170, 165]) 837 >>> weight_female = NormalDist.from_samples([100, 150, 130, 150]) 838 >>> foot_size_male = NormalDist.from_samples([12, 11, 12, 10]) 839 >>> foot_size_female = NormalDist.from_samples([6, 8, 7, 9]) 840 841Next, we encounter a new person whose feature measurements are known but whose 842gender is unknown: 843 844.. doctest:: 845 846 >>> ht = 6.0 # height 847 >>> wt = 130 # weight 848 >>> fs = 8 # foot size 849 850Starting with a 50% `prior probability 851<https://en.wikipedia.org/wiki/Prior_probability>`_ of being male or female, 852we compute the posterior as the prior times the product of likelihoods for the 853feature measurements given the gender: 854 855.. doctest:: 856 857 >>> prior_male = 0.5 858 >>> prior_female = 0.5 859 >>> posterior_male = (prior_male * height_male.pdf(ht) * 860 ... weight_male.pdf(wt) * foot_size_male.pdf(fs)) 861 862 >>> posterior_female = (prior_female * height_female.pdf(ht) * 863 ... weight_female.pdf(wt) * foot_size_female.pdf(fs)) 864 865The final prediction goes to the largest posterior. This is known as the 866`maximum a posteriori 867<https://en.wikipedia.org/wiki/Maximum_a_posteriori_estimation>`_ or MAP: 868 869.. doctest:: 870 871 >>> 'male' if posterior_male > posterior_female else 'female' 872 'female' 873 874 875.. 876 # This modelines must appear within the last ten lines of the file. 877 kate: indent-width 3; remove-trailing-space on; replace-tabs on; encoding utf-8; 878