1:mod:`statistics` --- Mathematical statistics functions 2======================================================= 3 4.. module:: statistics 5 :synopsis: Mathematical statistics functions 6 7.. moduleauthor:: Steven D'Aprano <steve+python@pearwood.info> 8.. sectionauthor:: Steven D'Aprano <steve+python@pearwood.info> 9 10.. versionadded:: 3.4 11 12**Source code:** :source:`Lib/statistics.py` 13 14.. testsetup:: * 15 16 from statistics import * 17 __name__ = '<doctest>' 18 19-------------- 20 21This module provides functions for calculating mathematical statistics of 22numeric (:class:`~numbers.Real`-valued) data. 23 24The module is not intended to be a competitor to third-party libraries such 25as `NumPy <https://numpy.org>`_, `SciPy <https://www.scipy.org/>`_, or 26proprietary full-featured statistics packages aimed at professional 27statisticians such as Minitab, SAS and Matlab. It is aimed at the level of 28graphing and scientific calculators. 29 30Unless explicitly noted, these functions support :class:`int`, 31:class:`float`, :class:`~decimal.Decimal` and :class:`~fractions.Fraction`. 32Behaviour with other types (whether in the numeric tower or not) is 33currently unsupported. Collections with a mix of types are also undefined 34and implementation-dependent. If your input data consists of mixed types, 35you may be able to use :func:`map` to ensure a consistent result, for 36example: ``map(float, input_data)``. 37 38Averages and measures of central location 39----------------------------------------- 40 41These functions calculate an average or typical value from a population 42or sample. 43 44======================= =============================================================== 45:func:`mean` Arithmetic mean ("average") of data. 46:func:`fmean` Fast, floating point arithmetic mean. 47:func:`geometric_mean` Geometric mean of data. 48:func:`harmonic_mean` Harmonic mean of data. 49:func:`median` Median (middle value) of data. 50:func:`median_low` Low median of data. 51:func:`median_high` High median of data. 52:func:`median_grouped` Median, or 50th percentile, of grouped data. 53:func:`mode` Single mode (most common value) of discrete or nominal data. 54:func:`multimode` List of modes (most common values) of discrete or nominal data. 55:func:`quantiles` Divide data into intervals with equal probability. 56======================= =============================================================== 57 58Measures of spread 59------------------ 60 61These functions calculate a measure of how much the population or sample 62tends to deviate from the typical or average values. 63 64======================= ============================================= 65:func:`pstdev` Population standard deviation of data. 66:func:`pvariance` Population variance of data. 67:func:`stdev` Sample standard deviation of data. 68:func:`variance` Sample variance of data. 69======================= ============================================= 70 71Statistics for relations between two inputs 72------------------------------------------- 73 74These functions calculate statistics regarding relations between two inputs. 75 76========================= ===================================================== 77:func:`covariance` Sample covariance for two variables. 78:func:`correlation` Pearson's correlation coefficient for two variables. 79:func:`linear_regression` Slope and intercept for simple linear regression. 80========================= ===================================================== 81 82 83Function details 84---------------- 85 86Note: The functions do not require the data given to them to be sorted. 87However, for reading convenience, most of the examples show sorted sequences. 88 89.. function:: mean(data) 90 91 Return the sample arithmetic mean of *data* which can be a sequence or iterable. 92 93 The arithmetic mean is the sum of the data divided by the number of data 94 points. It is commonly called "the average", although it is only one of many 95 different mathematical averages. It is a measure of the central location of 96 the data. 97 98 If *data* is empty, :exc:`StatisticsError` will be raised. 99 100 Some examples of use: 101 102 .. doctest:: 103 104 >>> mean([1, 2, 3, 4, 4]) 105 2.8 106 >>> mean([-1.0, 2.5, 3.25, 5.75]) 107 2.625 108 109 >>> from fractions import Fraction as F 110 >>> mean([F(3, 7), F(1, 21), F(5, 3), F(1, 3)]) 111 Fraction(13, 21) 112 113 >>> from decimal import Decimal as D 114 >>> mean([D("0.5"), D("0.75"), D("0.625"), D("0.375")]) 115 Decimal('0.5625') 116 117 .. note:: 118 119 The mean is strongly affected by `outliers 120 <https://en.wikipedia.org/wiki/Outlier>`_ and is not necessarily a 121 typical example of the data points. For a more robust, although less 122 efficient, measure of `central tendency 123 <https://en.wikipedia.org/wiki/Central_tendency>`_, see :func:`median`. 124 125 The sample mean gives an unbiased estimate of the true population mean, 126 so that when taken on average over all the possible samples, 127 ``mean(sample)`` converges on the true mean of the entire population. If 128 *data* represents the entire population rather than a sample, then 129 ``mean(data)`` is equivalent to calculating the true population mean μ. 130 131 132.. function:: fmean(data) 133 134 Convert *data* to floats and compute the arithmetic mean. 135 136 This runs faster than the :func:`mean` function and it always returns a 137 :class:`float`. The *data* may be a sequence or iterable. If the input 138 dataset is empty, raises a :exc:`StatisticsError`. 139 140 .. doctest:: 141 142 >>> fmean([3.5, 4.0, 5.25]) 143 4.25 144 145 .. versionadded:: 3.8 146 147 148.. function:: geometric_mean(data) 149 150 Convert *data* to floats and compute the geometric mean. 151 152 The geometric mean indicates the central tendency or typical value of the 153 *data* using the product of the values (as opposed to the arithmetic mean 154 which uses their sum). 155 156 Raises a :exc:`StatisticsError` if the input dataset is empty, 157 if it contains a zero, or if it contains a negative value. 158 The *data* may be a sequence or iterable. 159 160 No special efforts are made to achieve exact results. 161 (However, this may change in the future.) 162 163 .. doctest:: 164 165 >>> round(geometric_mean([54, 24, 36]), 1) 166 36.0 167 168 .. versionadded:: 3.8 169 170 171.. function:: harmonic_mean(data, weights=None) 172 173 Return the harmonic mean of *data*, a sequence or iterable of 174 real-valued numbers. If *weights* is omitted or *None*, then 175 equal weighting is assumed. 176 177 The harmonic mean is the reciprocal of the arithmetic :func:`mean` of the 178 reciprocals of the data. For example, the harmonic mean of three values *a*, 179 *b* and *c* will be equivalent to ``3/(1/a + 1/b + 1/c)``. If one of the 180 values is zero, the result will be zero. 181 182 The harmonic mean is a type of average, a measure of the central 183 location of the data. It is often appropriate when averaging 184 ratios or rates, for example speeds. 185 186 Suppose a car travels 10 km at 40 km/hr, then another 10 km at 60 km/hr. 187 What is the average speed? 188 189 .. doctest:: 190 191 >>> harmonic_mean([40, 60]) 192 48.0 193 194 Suppose a car travels 40 km/hr for 5 km, and when traffic clears, 195 speeds-up to 60 km/hr for the remaining 30 km of the journey. What 196 is the average speed? 197 198 .. doctest:: 199 200 >>> harmonic_mean([40, 60], weights=[5, 30]) 201 56.0 202 203 :exc:`StatisticsError` is raised if *data* is empty, any element 204 is less than zero, or if the weighted sum isn't positive. 205 206 The current algorithm has an early-out when it encounters a zero 207 in the input. This means that the subsequent inputs are not tested 208 for validity. (This behavior may change in the future.) 209 210 .. versionadded:: 3.6 211 212 .. versionchanged:: 3.10 213 Added support for *weights*. 214 215.. function:: median(data) 216 217 Return the median (middle value) of numeric data, using the common "mean of 218 middle two" method. If *data* is empty, :exc:`StatisticsError` is raised. 219 *data* can be a sequence or iterable. 220 221 The median is a robust measure of central location and is less affected by 222 the presence of outliers. When the number of data points is odd, the 223 middle data point is returned: 224 225 .. doctest:: 226 227 >>> median([1, 3, 5]) 228 3 229 230 When the number of data points is even, the median is interpolated by taking 231 the average of the two middle values: 232 233 .. doctest:: 234 235 >>> median([1, 3, 5, 7]) 236 4.0 237 238 This is suited for when your data is discrete, and you don't mind that the 239 median may not be an actual data point. 240 241 If the data is ordinal (supports order operations) but not numeric (doesn't 242 support addition), consider using :func:`median_low` or :func:`median_high` 243 instead. 244 245.. function:: median_low(data) 246 247 Return the low median of numeric data. If *data* is empty, 248 :exc:`StatisticsError` is raised. *data* can be a sequence or iterable. 249 250 The low median is always a member of the data set. When the number of data 251 points is odd, the middle value is returned. When it is even, the smaller of 252 the two middle values is returned. 253 254 .. doctest:: 255 256 >>> median_low([1, 3, 5]) 257 3 258 >>> median_low([1, 3, 5, 7]) 259 3 260 261 Use the low median when your data are discrete and you prefer the median to 262 be an actual data point rather than interpolated. 263 264 265.. function:: median_high(data) 266 267 Return the high median of data. If *data* is empty, :exc:`StatisticsError` 268 is raised. *data* can be a sequence or iterable. 269 270 The high median is always a member of the data set. When the number of data 271 points is odd, the middle value is returned. When it is even, the larger of 272 the two middle values is returned. 273 274 .. doctest:: 275 276 >>> median_high([1, 3, 5]) 277 3 278 >>> median_high([1, 3, 5, 7]) 279 5 280 281 Use the high median when your data are discrete and you prefer the median to 282 be an actual data point rather than interpolated. 283 284 285.. function:: median_grouped(data, interval=1) 286 287 Return the median of grouped continuous data, calculated as the 50th 288 percentile, using interpolation. If *data* is empty, :exc:`StatisticsError` 289 is raised. *data* can be a sequence or iterable. 290 291 .. doctest:: 292 293 >>> median_grouped([52, 52, 53, 54]) 294 52.5 295 296 In the following example, the data are rounded, so that each value represents 297 the midpoint of data classes, e.g. 1 is the midpoint of the class 0.5--1.5, 2 298 is the midpoint of 1.5--2.5, 3 is the midpoint of 2.5--3.5, etc. With the data 299 given, the middle value falls somewhere in the class 3.5--4.5, and 300 interpolation is used to estimate it: 301 302 .. doctest:: 303 304 >>> median_grouped([1, 2, 2, 3, 4, 4, 4, 4, 4, 5]) 305 3.7 306 307 Optional argument *interval* represents the class interval, and defaults 308 to 1. Changing the class interval naturally will change the interpolation: 309 310 .. doctest:: 311 312 >>> median_grouped([1, 3, 3, 5, 7], interval=1) 313 3.25 314 >>> median_grouped([1, 3, 3, 5, 7], interval=2) 315 3.5 316 317 This function does not check whether the data points are at least 318 *interval* apart. 319 320 .. impl-detail:: 321 322 Under some circumstances, :func:`median_grouped` may coerce data points to 323 floats. This behaviour is likely to change in the future. 324 325 .. seealso:: 326 327 * "Statistics for the Behavioral Sciences", Frederick J Gravetter and 328 Larry B Wallnau (8th Edition). 329 330 * The `SSMEDIAN 331 <https://help.gnome.org/users/gnumeric/stable/gnumeric.html#gnumeric-function-SSMEDIAN>`_ 332 function in the Gnome Gnumeric spreadsheet, including `this discussion 333 <https://mail.gnome.org/archives/gnumeric-list/2011-April/msg00018.html>`_. 334 335 336.. function:: mode(data) 337 338 Return the single most common data point from discrete or nominal *data*. 339 The mode (when it exists) is the most typical value and serves as a 340 measure of central location. 341 342 If there are multiple modes with the same frequency, returns the first one 343 encountered in the *data*. If the smallest or largest of those is 344 desired instead, use ``min(multimode(data))`` or ``max(multimode(data))``. 345 If the input *data* is empty, :exc:`StatisticsError` is raised. 346 347 ``mode`` assumes discrete data and returns a single value. This is the 348 standard treatment of the mode as commonly taught in schools: 349 350 .. doctest:: 351 352 >>> mode([1, 1, 2, 3, 3, 3, 3, 4]) 353 3 354 355 The mode is unique in that it is the only statistic in this package that 356 also applies to nominal (non-numeric) data: 357 358 .. doctest:: 359 360 >>> mode(["red", "blue", "blue", "red", "green", "red", "red"]) 361 'red' 362 363 .. versionchanged:: 3.8 364 Now handles multimodal datasets by returning the first mode encountered. 365 Formerly, it raised :exc:`StatisticsError` when more than one mode was 366 found. 367 368 369.. function:: multimode(data) 370 371 Return a list of the most frequently occurring values in the order they 372 were first encountered in the *data*. Will return more than one result if 373 there are multiple modes or an empty list if the *data* is empty: 374 375 .. doctest:: 376 377 >>> multimode('aabbbbccddddeeffffgg') 378 ['b', 'd', 'f'] 379 >>> multimode('') 380 [] 381 382 .. versionadded:: 3.8 383 384 385.. function:: pstdev(data, mu=None) 386 387 Return the population standard deviation (the square root of the population 388 variance). See :func:`pvariance` for arguments and other details. 389 390 .. doctest:: 391 392 >>> pstdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75]) 393 0.986893273527251 394 395 396.. function:: pvariance(data, mu=None) 397 398 Return the population variance of *data*, a non-empty sequence or iterable 399 of real-valued numbers. Variance, or second moment about the mean, is a 400 measure of the variability (spread or dispersion) of data. A large 401 variance indicates that the data is spread out; a small variance indicates 402 it is clustered closely around the mean. 403 404 If the optional second argument *mu* is given, it is typically the mean of 405 the *data*. It can also be used to compute the second moment around a 406 point that is not the mean. If it is missing or ``None`` (the default), 407 the arithmetic mean is automatically calculated. 408 409 Use this function to calculate the variance from the entire population. To 410 estimate the variance from a sample, the :func:`variance` function is usually 411 a better choice. 412 413 Raises :exc:`StatisticsError` if *data* is empty. 414 415 Examples: 416 417 .. doctest:: 418 419 >>> data = [0.0, 0.25, 0.25, 1.25, 1.5, 1.75, 2.75, 3.25] 420 >>> pvariance(data) 421 1.25 422 423 If you have already calculated the mean of your data, you can pass it as the 424 optional second argument *mu* to avoid recalculation: 425 426 .. doctest:: 427 428 >>> mu = mean(data) 429 >>> pvariance(data, mu) 430 1.25 431 432 Decimals and Fractions are supported: 433 434 .. doctest:: 435 436 >>> from decimal import Decimal as D 437 >>> pvariance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")]) 438 Decimal('24.815') 439 440 >>> from fractions import Fraction as F 441 >>> pvariance([F(1, 4), F(5, 4), F(1, 2)]) 442 Fraction(13, 72) 443 444 .. note:: 445 446 When called with the entire population, this gives the population variance 447 σ². When called on a sample instead, this is the biased sample variance 448 s², also known as variance with N degrees of freedom. 449 450 If you somehow know the true population mean μ, you may use this 451 function to calculate the variance of a sample, giving the known 452 population mean as the second argument. Provided the data points are a 453 random sample of the population, the result will be an unbiased estimate 454 of the population variance. 455 456 457.. function:: stdev(data, xbar=None) 458 459 Return the sample standard deviation (the square root of the sample 460 variance). See :func:`variance` for arguments and other details. 461 462 .. doctest:: 463 464 >>> stdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75]) 465 1.0810874155219827 466 467 468.. function:: variance(data, xbar=None) 469 470 Return the sample variance of *data*, an iterable of at least two real-valued 471 numbers. Variance, or second moment about the mean, is a measure of the 472 variability (spread or dispersion) of data. A large variance indicates that 473 the data is spread out; a small variance indicates it is clustered closely 474 around the mean. 475 476 If the optional second argument *xbar* is given, it should be the mean of 477 *data*. If it is missing or ``None`` (the default), the mean is 478 automatically calculated. 479 480 Use this function when your data is a sample from a population. To calculate 481 the variance from the entire population, see :func:`pvariance`. 482 483 Raises :exc:`StatisticsError` if *data* has fewer than two values. 484 485 Examples: 486 487 .. doctest:: 488 489 >>> data = [2.75, 1.75, 1.25, 0.25, 0.5, 1.25, 3.5] 490 >>> variance(data) 491 1.3720238095238095 492 493 If you have already calculated the mean of your data, you can pass it as the 494 optional second argument *xbar* to avoid recalculation: 495 496 .. doctest:: 497 498 >>> m = mean(data) 499 >>> variance(data, m) 500 1.3720238095238095 501 502 This function does not attempt to verify that you have passed the actual mean 503 as *xbar*. Using arbitrary values for *xbar* can lead to invalid or 504 impossible results. 505 506 Decimal and Fraction values are supported: 507 508 .. doctest:: 509 510 >>> from decimal import Decimal as D 511 >>> variance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")]) 512 Decimal('31.01875') 513 514 >>> from fractions import Fraction as F 515 >>> variance([F(1, 6), F(1, 2), F(5, 3)]) 516 Fraction(67, 108) 517 518 .. note:: 519 520 This is the sample variance s² with Bessel's correction, also known as 521 variance with N-1 degrees of freedom. Provided that the data points are 522 representative (e.g. independent and identically distributed), the result 523 should be an unbiased estimate of the true population variance. 524 525 If you somehow know the actual population mean μ you should pass it to the 526 :func:`pvariance` function as the *mu* parameter to get the variance of a 527 sample. 528 529.. function:: quantiles(data, *, n=4, method='exclusive') 530 531 Divide *data* into *n* continuous intervals with equal probability. 532 Returns a list of ``n - 1`` cut points separating the intervals. 533 534 Set *n* to 4 for quartiles (the default). Set *n* to 10 for deciles. Set 535 *n* to 100 for percentiles which gives the 99 cuts points that separate 536 *data* into 100 equal sized groups. Raises :exc:`StatisticsError` if *n* 537 is not least 1. 538 539 The *data* can be any iterable containing sample data. For meaningful 540 results, the number of data points in *data* should be larger than *n*. 541 Raises :exc:`StatisticsError` if there are not at least two data points. 542 543 The cut points are linearly interpolated from the 544 two nearest data points. For example, if a cut point falls one-third 545 of the distance between two sample values, ``100`` and ``112``, the 546 cut-point will evaluate to ``104``. 547 548 The *method* for computing quantiles can be varied depending on 549 whether the *data* includes or excludes the lowest and 550 highest possible values from the population. 551 552 The default *method* is "exclusive" and is used for data sampled from 553 a population that can have more extreme values than found in the 554 samples. The portion of the population falling below the *i-th* of 555 *m* sorted data points is computed as ``i / (m + 1)``. Given nine 556 sample values, the method sorts them and assigns the following 557 percentiles: 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%. 558 559 Setting the *method* to "inclusive" is used for describing population 560 data or for samples that are known to include the most extreme values 561 from the population. The minimum value in *data* is treated as the 0th 562 percentile and the maximum value is treated as the 100th percentile. 563 The portion of the population falling below the *i-th* of *m* sorted 564 data points is computed as ``(i - 1) / (m - 1)``. Given 11 sample 565 values, the method sorts them and assigns the following percentiles: 566 0%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 100%. 567 568 .. doctest:: 569 570 # Decile cut points for empirically sampled data 571 >>> data = [105, 129, 87, 86, 111, 111, 89, 81, 108, 92, 110, 572 ... 100, 75, 105, 103, 109, 76, 119, 99, 91, 103, 129, 573 ... 106, 101, 84, 111, 74, 87, 86, 103, 103, 106, 86, 574 ... 111, 75, 87, 102, 121, 111, 88, 89, 101, 106, 95, 575 ... 103, 107, 101, 81, 109, 104] 576 >>> [round(q, 1) for q in quantiles(data, n=10)] 577 [81.0, 86.2, 89.0, 99.4, 102.5, 103.6, 106.0, 109.8, 111.0] 578 579 .. versionadded:: 3.8 580 581.. function:: covariance(x, y, /) 582 583 Return the sample covariance of two inputs *x* and *y*. Covariance 584 is a measure of the joint variability of two inputs. 585 586 Both inputs must be of the same length (no less than two), otherwise 587 :exc:`StatisticsError` is raised. 588 589 Examples: 590 591 .. doctest:: 592 593 >>> x = [1, 2, 3, 4, 5, 6, 7, 8, 9] 594 >>> y = [1, 2, 3, 1, 2, 3, 1, 2, 3] 595 >>> covariance(x, y) 596 0.75 597 >>> z = [9, 8, 7, 6, 5, 4, 3, 2, 1] 598 >>> covariance(x, z) 599 -7.5 600 >>> covariance(z, x) 601 -7.5 602 603 .. versionadded:: 3.10 604 605.. function:: correlation(x, y, /) 606 607 Return the `Pearson's correlation coefficient 608 <https://en.wikipedia.org/wiki/Pearson_correlation_coefficient>`_ 609 for two inputs. Pearson's correlation coefficient *r* takes values 610 between -1 and +1. It measures the strength and direction of the linear 611 relationship, where +1 means very strong, positive linear relationship, 612 -1 very strong, negative linear relationship, and 0 no linear relationship. 613 614 Both inputs must be of the same length (no less than two), and need 615 not to be constant, otherwise :exc:`StatisticsError` is raised. 616 617 Examples: 618 619 .. doctest:: 620 621 >>> x = [1, 2, 3, 4, 5, 6, 7, 8, 9] 622 >>> y = [9, 8, 7, 6, 5, 4, 3, 2, 1] 623 >>> correlation(x, x) 624 1.0 625 >>> correlation(x, y) 626 -1.0 627 628 .. versionadded:: 3.10 629 630.. function:: linear_regression(x, y, /) 631 632 Return the slope and intercept of `simple linear regression 633 <https://en.wikipedia.org/wiki/Simple_linear_regression>`_ 634 parameters estimated using ordinary least squares. Simple linear 635 regression describes the relationship between an independent variable *x* and 636 a dependent variable *y* in terms of this linear function: 637 638 *y = slope \* x + intercept + noise* 639 640 where ``slope`` and ``intercept`` are the regression parameters that are 641 estimated, and ``noise`` represents the 642 variability of the data that was not explained by the linear regression 643 (it is equal to the difference between predicted and actual values 644 of the dependent variable). 645 646 Both inputs must be of the same length (no less than two), and 647 the independent variable *x* cannot be constant; 648 otherwise a :exc:`StatisticsError` is raised. 649 650 For example, we can use the `release dates of the Monty 651 Python films <https://en.wikipedia.org/wiki/Monty_Python#Films>`_ 652 to predict the cumulative number of Monty Python films 653 that would have been produced by 2019 654 assuming that they had kept the pace. 655 656 .. doctest:: 657 658 >>> year = [1971, 1975, 1979, 1982, 1983] 659 >>> films_total = [1, 2, 3, 4, 5] 660 >>> slope, intercept = linear_regression(year, films_total) 661 >>> round(slope * 2019 + intercept) 662 16 663 664 .. versionadded:: 3.10 665 666 667Exceptions 668---------- 669 670A single exception is defined: 671 672.. exception:: StatisticsError 673 674 Subclass of :exc:`ValueError` for statistics-related exceptions. 675 676 677:class:`NormalDist` objects 678--------------------------- 679 680:class:`NormalDist` is a tool for creating and manipulating normal 681distributions of a `random variable 682<http://www.stat.yale.edu/Courses/1997-98/101/ranvar.htm>`_. It is a 683class that treats the mean and standard deviation of data 684measurements as a single entity. 685 686Normal distributions arise from the `Central Limit Theorem 687<https://en.wikipedia.org/wiki/Central_limit_theorem>`_ and have a wide range 688of applications in statistics. 689 690.. class:: NormalDist(mu=0.0, sigma=1.0) 691 692 Returns a new *NormalDist* object where *mu* represents the `arithmetic 693 mean <https://en.wikipedia.org/wiki/Arithmetic_mean>`_ and *sigma* 694 represents the `standard deviation 695 <https://en.wikipedia.org/wiki/Standard_deviation>`_. 696 697 If *sigma* is negative, raises :exc:`StatisticsError`. 698 699 .. attribute:: mean 700 701 A read-only property for the `arithmetic mean 702 <https://en.wikipedia.org/wiki/Arithmetic_mean>`_ of a normal 703 distribution. 704 705 .. attribute:: median 706 707 A read-only property for the `median 708 <https://en.wikipedia.org/wiki/Median>`_ of a normal 709 distribution. 710 711 .. attribute:: mode 712 713 A read-only property for the `mode 714 <https://en.wikipedia.org/wiki/Mode_(statistics)>`_ of a normal 715 distribution. 716 717 .. attribute:: stdev 718 719 A read-only property for the `standard deviation 720 <https://en.wikipedia.org/wiki/Standard_deviation>`_ of a normal 721 distribution. 722 723 .. attribute:: variance 724 725 A read-only property for the `variance 726 <https://en.wikipedia.org/wiki/Variance>`_ of a normal 727 distribution. Equal to the square of the standard deviation. 728 729 .. classmethod:: NormalDist.from_samples(data) 730 731 Makes a normal distribution instance with *mu* and *sigma* parameters 732 estimated from the *data* using :func:`fmean` and :func:`stdev`. 733 734 The *data* can be any :term:`iterable` and should consist of values 735 that can be converted to type :class:`float`. If *data* does not 736 contain at least two elements, raises :exc:`StatisticsError` because it 737 takes at least one point to estimate a central value and at least two 738 points to estimate dispersion. 739 740 .. method:: NormalDist.samples(n, *, seed=None) 741 742 Generates *n* random samples for a given mean and standard deviation. 743 Returns a :class:`list` of :class:`float` values. 744 745 If *seed* is given, creates a new instance of the underlying random 746 number generator. This is useful for creating reproducible results, 747 even in a multi-threading context. 748 749 .. method:: NormalDist.pdf(x) 750 751 Using a `probability density function (pdf) 752 <https://en.wikipedia.org/wiki/Probability_density_function>`_, compute 753 the relative likelihood that a random variable *X* will be near the 754 given value *x*. Mathematically, it is the limit of the ratio ``P(x <= 755 X < x+dx) / dx`` as *dx* approaches zero. 756 757 The relative likelihood is computed as the probability of a sample 758 occurring in a narrow range divided by the width of the range (hence 759 the word "density"). Since the likelihood is relative to other points, 760 its value can be greater than `1.0`. 761 762 .. method:: NormalDist.cdf(x) 763 764 Using a `cumulative distribution function (cdf) 765 <https://en.wikipedia.org/wiki/Cumulative_distribution_function>`_, 766 compute the probability that a random variable *X* will be less than or 767 equal to *x*. Mathematically, it is written ``P(X <= x)``. 768 769 .. method:: NormalDist.inv_cdf(p) 770 771 Compute the inverse cumulative distribution function, also known as the 772 `quantile function <https://en.wikipedia.org/wiki/Quantile_function>`_ 773 or the `percent-point 774 <https://www.statisticshowto.datasciencecentral.com/inverse-distribution-function/>`_ 775 function. Mathematically, it is written ``x : P(X <= x) = p``. 776 777 Finds the value *x* of the random variable *X* such that the 778 probability of the variable being less than or equal to that value 779 equals the given probability *p*. 780 781 .. method:: NormalDist.overlap(other) 782 783 Measures the agreement between two normal probability distributions. 784 Returns a value between 0.0 and 1.0 giving `the overlapping area for 785 the two probability density functions 786 <https://www.rasch.org/rmt/rmt101r.htm>`_. 787 788 .. method:: NormalDist.quantiles(n=4) 789 790 Divide the normal distribution into *n* continuous intervals with 791 equal probability. Returns a list of (n - 1) cut points separating 792 the intervals. 793 794 Set *n* to 4 for quartiles (the default). Set *n* to 10 for deciles. 795 Set *n* to 100 for percentiles which gives the 99 cuts points that 796 separate the normal distribution into 100 equal sized groups. 797 798 .. method:: NormalDist.zscore(x) 799 800 Compute the 801 `Standard Score <https://www.statisticshowto.com/probability-and-statistics/z-score/>`_ 802 describing *x* in terms of the number of standard deviations 803 above or below the mean of the normal distribution: 804 ``(x - mean) / stdev``. 805 806 .. versionadded:: 3.9 807 808 Instances of :class:`NormalDist` support addition, subtraction, 809 multiplication and division by a constant. These operations 810 are used for translation and scaling. For example: 811 812 .. doctest:: 813 814 >>> temperature_february = NormalDist(5, 2.5) # Celsius 815 >>> temperature_february * (9/5) + 32 # Fahrenheit 816 NormalDist(mu=41.0, sigma=4.5) 817 818 Dividing a constant by an instance of :class:`NormalDist` is not supported 819 because the result wouldn't be normally distributed. 820 821 Since normal distributions arise from additive effects of independent 822 variables, it is possible to `add and subtract two independent normally 823 distributed random variables 824 <https://en.wikipedia.org/wiki/Sum_of_normally_distributed_random_variables>`_ 825 represented as instances of :class:`NormalDist`. For example: 826 827 .. doctest:: 828 829 >>> birth_weights = NormalDist.from_samples([2.5, 3.1, 2.1, 2.4, 2.7, 3.5]) 830 >>> drug_effects = NormalDist(0.4, 0.15) 831 >>> combined = birth_weights + drug_effects 832 >>> round(combined.mean, 1) 833 3.1 834 >>> round(combined.stdev, 1) 835 0.5 836 837 .. versionadded:: 3.8 838 839 840:class:`NormalDist` Examples and Recipes 841^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 842 843:class:`NormalDist` readily solves classic probability problems. 844 845For example, given `historical data for SAT exams 846<https://nces.ed.gov/programs/digest/d17/tables/dt17_226.40.asp>`_ showing 847that scores are normally distributed with a mean of 1060 and a standard 848deviation of 195, determine the percentage of students with test scores 849between 1100 and 1200, after rounding to the nearest whole number: 850 851.. doctest:: 852 853 >>> sat = NormalDist(1060, 195) 854 >>> fraction = sat.cdf(1200 + 0.5) - sat.cdf(1100 - 0.5) 855 >>> round(fraction * 100.0, 1) 856 18.4 857 858Find the `quartiles <https://en.wikipedia.org/wiki/Quartile>`_ and `deciles 859<https://en.wikipedia.org/wiki/Decile>`_ for the SAT scores: 860 861.. doctest:: 862 863 >>> list(map(round, sat.quantiles())) 864 [928, 1060, 1192] 865 >>> list(map(round, sat.quantiles(n=10))) 866 [810, 896, 958, 1011, 1060, 1109, 1162, 1224, 1310] 867 868To estimate the distribution for a model than isn't easy to solve 869analytically, :class:`NormalDist` can generate input samples for a `Monte 870Carlo simulation <https://en.wikipedia.org/wiki/Monte_Carlo_method>`_: 871 872.. doctest:: 873 874 >>> def model(x, y, z): 875 ... return (3*x + 7*x*y - 5*y) / (11 * z) 876 ... 877 >>> n = 100_000 878 >>> X = NormalDist(10, 2.5).samples(n, seed=3652260728) 879 >>> Y = NormalDist(15, 1.75).samples(n, seed=4582495471) 880 >>> Z = NormalDist(50, 1.25).samples(n, seed=6582483453) 881 >>> quantiles(map(model, X, Y, Z)) # doctest: +SKIP 882 [1.4591308524824727, 1.8035946855390597, 2.175091447274739] 883 884Normal distributions can be used to approximate `Binomial 885distributions <http://mathworld.wolfram.com/BinomialDistribution.html>`_ 886when the sample size is large and when the probability of a successful 887trial is near 50%. 888 889For example, an open source conference has 750 attendees and two rooms with a 890500 person capacity. There is a talk about Python and another about Ruby. 891In previous conferences, 65% of the attendees preferred to listen to Python 892talks. Assuming the population preferences haven't changed, what is the 893probability that the Python room will stay within its capacity limits? 894 895.. doctest:: 896 897 >>> n = 750 # Sample size 898 >>> p = 0.65 # Preference for Python 899 >>> q = 1.0 - p # Preference for Ruby 900 >>> k = 500 # Room capacity 901 902 >>> # Approximation using the cumulative normal distribution 903 >>> from math import sqrt 904 >>> round(NormalDist(mu=n*p, sigma=sqrt(n*p*q)).cdf(k + 0.5), 4) 905 0.8402 906 907 >>> # Solution using the cumulative binomial distribution 908 >>> from math import comb, fsum 909 >>> round(fsum(comb(n, r) * p**r * q**(n-r) for r in range(k+1)), 4) 910 0.8402 911 912 >>> # Approximation using a simulation 913 >>> from random import seed, choices 914 >>> seed(8675309) 915 >>> def trial(): 916 ... return choices(('Python', 'Ruby'), (p, q), k=n).count('Python') 917 >>> mean(trial() <= k for i in range(10_000)) 918 0.8398 919 920Normal distributions commonly arise in machine learning problems. 921 922Wikipedia has a `nice example of a Naive Bayesian Classifier 923<https://en.wikipedia.org/wiki/Naive_Bayes_classifier#Sex_classification>`_. 924The challenge is to predict a person's gender from measurements of normally 925distributed features including height, weight, and foot size. 926 927We're given a training dataset with measurements for eight people. The 928measurements are assumed to be normally distributed, so we summarize the data 929with :class:`NormalDist`: 930 931.. doctest:: 932 933 >>> height_male = NormalDist.from_samples([6, 5.92, 5.58, 5.92]) 934 >>> height_female = NormalDist.from_samples([5, 5.5, 5.42, 5.75]) 935 >>> weight_male = NormalDist.from_samples([180, 190, 170, 165]) 936 >>> weight_female = NormalDist.from_samples([100, 150, 130, 150]) 937 >>> foot_size_male = NormalDist.from_samples([12, 11, 12, 10]) 938 >>> foot_size_female = NormalDist.from_samples([6, 8, 7, 9]) 939 940Next, we encounter a new person whose feature measurements are known but whose 941gender is unknown: 942 943.. doctest:: 944 945 >>> ht = 6.0 # height 946 >>> wt = 130 # weight 947 >>> fs = 8 # foot size 948 949Starting with a 50% `prior probability 950<https://en.wikipedia.org/wiki/Prior_probability>`_ of being male or female, 951we compute the posterior as the prior times the product of likelihoods for the 952feature measurements given the gender: 953 954.. doctest:: 955 956 >>> prior_male = 0.5 957 >>> prior_female = 0.5 958 >>> posterior_male = (prior_male * height_male.pdf(ht) * 959 ... weight_male.pdf(wt) * foot_size_male.pdf(fs)) 960 961 >>> posterior_female = (prior_female * height_female.pdf(ht) * 962 ... weight_female.pdf(wt) * foot_size_female.pdf(fs)) 963 964The final prediction goes to the largest posterior. This is known as the 965`maximum a posteriori 966<https://en.wikipedia.org/wiki/Maximum_a_posteriori_estimation>`_ or MAP: 967 968.. doctest:: 969 970 >>> 'male' if posterior_male > posterior_female else 'female' 971 'female' 972 973 974.. 975 # This modelines must appear within the last ten lines of the file. 976 kate: indent-width 3; remove-trailing-space on; replace-tabs on; encoding utf-8; 977