1:mod:`!statistics` --- Mathematical statistics functions 2======================================================== 3 4.. module:: statistics 5 :synopsis: Mathematical statistics functions 6 7.. moduleauthor:: Steven D'Aprano <steve+python@pearwood.info> 8.. sectionauthor:: Steven D'Aprano <steve+python@pearwood.info> 9 10.. versionadded:: 3.4 11 12**Source code:** :source:`Lib/statistics.py` 13 14.. testsetup:: * 15 16 from statistics import * 17 import math 18 __name__ = '<doctest>' 19 20-------------- 21 22This module provides functions for calculating mathematical statistics of 23numeric (:class:`~numbers.Real`-valued) data. 24 25The module is not intended to be a competitor to third-party libraries such 26as `NumPy <https://numpy.org>`_, `SciPy <https://scipy.org/>`_, or 27proprietary full-featured statistics packages aimed at professional 28statisticians such as Minitab, SAS and Matlab. It is aimed at the level of 29graphing and scientific calculators. 30 31Unless explicitly noted, these functions support :class:`int`, 32:class:`float`, :class:`~decimal.Decimal` and :class:`~fractions.Fraction`. 33Behaviour with other types (whether in the numeric tower or not) is 34currently unsupported. Collections with a mix of types are also undefined 35and implementation-dependent. If your input data consists of mixed types, 36you may be able to use :func:`map` to ensure a consistent result, for 37example: ``map(float, input_data)``. 38 39Some datasets use ``NaN`` (not a number) values to represent missing data. 40Since NaNs have unusual comparison semantics, they cause surprising or 41undefined behaviors in the statistics functions that sort data or that count 42occurrences. The functions affected are ``median()``, ``median_low()``, 43``median_high()``, ``median_grouped()``, ``mode()``, ``multimode()``, and 44``quantiles()``. The ``NaN`` values should be stripped before calling these 45functions:: 46 47 >>> from statistics import median 48 >>> from math import isnan 49 >>> from itertools import filterfalse 50 51 >>> data = [20.7, float('NaN'),19.2, 18.3, float('NaN'), 14.4] 52 >>> sorted(data) # This has surprising behavior 53 [20.7, nan, 14.4, 18.3, 19.2, nan] 54 >>> median(data) # This result is unexpected 55 16.35 56 57 >>> sum(map(isnan, data)) # Number of missing values 58 2 59 >>> clean = list(filterfalse(isnan, data)) # Strip NaN values 60 >>> clean 61 [20.7, 19.2, 18.3, 14.4] 62 >>> sorted(clean) # Sorting now works as expected 63 [14.4, 18.3, 19.2, 20.7] 64 >>> median(clean) # This result is now well defined 65 18.75 66 67 68Averages and measures of central location 69----------------------------------------- 70 71These functions calculate an average or typical value from a population 72or sample. 73 74======================= =============================================================== 75:func:`mean` Arithmetic mean ("average") of data. 76:func:`fmean` Fast, floating-point arithmetic mean, with optional weighting. 77:func:`geometric_mean` Geometric mean of data. 78:func:`harmonic_mean` Harmonic mean of data. 79:func:`kde` Estimate the probability density distribution of the data. 80:func:`kde_random` Random sampling from the PDF generated by kde(). 81:func:`median` Median (middle value) of data. 82:func:`median_low` Low median of data. 83:func:`median_high` High median of data. 84:func:`median_grouped` Median (50th percentile) of grouped data. 85:func:`mode` Single mode (most common value) of discrete or nominal data. 86:func:`multimode` List of modes (most common values) of discrete or nominal data. 87:func:`quantiles` Divide data into intervals with equal probability. 88======================= =============================================================== 89 90Measures of spread 91------------------ 92 93These functions calculate a measure of how much the population or sample 94tends to deviate from the typical or average values. 95 96======================= ============================================= 97:func:`pstdev` Population standard deviation of data. 98:func:`pvariance` Population variance of data. 99:func:`stdev` Sample standard deviation of data. 100:func:`variance` Sample variance of data. 101======================= ============================================= 102 103Statistics for relations between two inputs 104------------------------------------------- 105 106These functions calculate statistics regarding relations between two inputs. 107 108========================= ===================================================== 109:func:`covariance` Sample covariance for two variables. 110:func:`correlation` Pearson and Spearman's correlation coefficients. 111:func:`linear_regression` Slope and intercept for simple linear regression. 112========================= ===================================================== 113 114 115Function details 116---------------- 117 118Note: The functions do not require the data given to them to be sorted. 119However, for reading convenience, most of the examples show sorted sequences. 120 121.. function:: mean(data) 122 123 Return the sample arithmetic mean of *data* which can be a sequence or iterable. 124 125 The arithmetic mean is the sum of the data divided by the number of data 126 points. It is commonly called "the average", although it is only one of many 127 different mathematical averages. It is a measure of the central location of 128 the data. 129 130 If *data* is empty, :exc:`StatisticsError` will be raised. 131 132 Some examples of use: 133 134 .. doctest:: 135 136 >>> mean([1, 2, 3, 4, 4]) 137 2.8 138 >>> mean([-1.0, 2.5, 3.25, 5.75]) 139 2.625 140 141 >>> from fractions import Fraction as F 142 >>> mean([F(3, 7), F(1, 21), F(5, 3), F(1, 3)]) 143 Fraction(13, 21) 144 145 >>> from decimal import Decimal as D 146 >>> mean([D("0.5"), D("0.75"), D("0.625"), D("0.375")]) 147 Decimal('0.5625') 148 149 .. note:: 150 151 The mean is strongly affected by `outliers 152 <https://en.wikipedia.org/wiki/Outlier>`_ and is not necessarily a 153 typical example of the data points. For a more robust, although less 154 efficient, measure of `central tendency 155 <https://en.wikipedia.org/wiki/Central_tendency>`_, see :func:`median`. 156 157 The sample mean gives an unbiased estimate of the true population mean, 158 so that when taken on average over all the possible samples, 159 ``mean(sample)`` converges on the true mean of the entire population. If 160 *data* represents the entire population rather than a sample, then 161 ``mean(data)`` is equivalent to calculating the true population mean μ. 162 163 164.. function:: fmean(data, weights=None) 165 166 Convert *data* to floats and compute the arithmetic mean. 167 168 This runs faster than the :func:`mean` function and it always returns a 169 :class:`float`. The *data* may be a sequence or iterable. If the input 170 dataset is empty, raises a :exc:`StatisticsError`. 171 172 .. doctest:: 173 174 >>> fmean([3.5, 4.0, 5.25]) 175 4.25 176 177 Optional weighting is supported. For example, a professor assigns a 178 grade for a course by weighting quizzes at 20%, homework at 20%, a 179 midterm exam at 30%, and a final exam at 30%: 180 181 .. doctest:: 182 183 >>> grades = [85, 92, 83, 91] 184 >>> weights = [0.20, 0.20, 0.30, 0.30] 185 >>> fmean(grades, weights) 186 87.6 187 188 If *weights* is supplied, it must be the same length as the *data* or 189 a :exc:`ValueError` will be raised. 190 191 .. versionadded:: 3.8 192 193 .. versionchanged:: 3.11 194 Added support for *weights*. 195 196 197.. function:: geometric_mean(data) 198 199 Convert *data* to floats and compute the geometric mean. 200 201 The geometric mean indicates the central tendency or typical value of the 202 *data* using the product of the values (as opposed to the arithmetic mean 203 which uses their sum). 204 205 Raises a :exc:`StatisticsError` if the input dataset is empty, 206 if it contains a zero, or if it contains a negative value. 207 The *data* may be a sequence or iterable. 208 209 No special efforts are made to achieve exact results. 210 (However, this may change in the future.) 211 212 .. doctest:: 213 214 >>> round(geometric_mean([54, 24, 36]), 1) 215 36.0 216 217 .. versionadded:: 3.8 218 219 220.. function:: harmonic_mean(data, weights=None) 221 222 Return the harmonic mean of *data*, a sequence or iterable of 223 real-valued numbers. If *weights* is omitted or ``None``, then 224 equal weighting is assumed. 225 226 The harmonic mean is the reciprocal of the arithmetic :func:`mean` of the 227 reciprocals of the data. For example, the harmonic mean of three values *a*, 228 *b* and *c* will be equivalent to ``3/(1/a + 1/b + 1/c)``. If one of the 229 values is zero, the result will be zero. 230 231 The harmonic mean is a type of average, a measure of the central 232 location of the data. It is often appropriate when averaging 233 ratios or rates, for example speeds. 234 235 Suppose a car travels 10 km at 40 km/hr, then another 10 km at 60 km/hr. 236 What is the average speed? 237 238 .. doctest:: 239 240 >>> harmonic_mean([40, 60]) 241 48.0 242 243 Suppose a car travels 40 km/hr for 5 km, and when traffic clears, 244 speeds-up to 60 km/hr for the remaining 30 km of the journey. What 245 is the average speed? 246 247 .. doctest:: 248 249 >>> harmonic_mean([40, 60], weights=[5, 30]) 250 56.0 251 252 :exc:`StatisticsError` is raised if *data* is empty, any element 253 is less than zero, or if the weighted sum isn't positive. 254 255 The current algorithm has an early-out when it encounters a zero 256 in the input. This means that the subsequent inputs are not tested 257 for validity. (This behavior may change in the future.) 258 259 .. versionadded:: 3.6 260 261 .. versionchanged:: 3.10 262 Added support for *weights*. 263 264 265.. function:: kde(data, h, kernel='normal', *, cumulative=False) 266 267 `Kernel Density Estimation (KDE) 268 <https://www.itm-conferences.org/articles/itmconf/pdf/2018/08/itmconf_sam2018_00037.pdf>`_: 269 Create a continuous probability density function or cumulative 270 distribution function from discrete samples. 271 272 The basic idea is to smooth the data using `a kernel function 273 <https://en.wikipedia.org/wiki/Kernel_(statistics)>`_. 274 to help draw inferences about a population from a sample. 275 276 The degree of smoothing is controlled by the scaling parameter *h* 277 which is called the bandwidth. Smaller values emphasize local 278 features while larger values give smoother results. 279 280 The *kernel* determines the relative weights of the sample data 281 points. Generally, the choice of kernel shape does not matter 282 as much as the more influential bandwidth smoothing parameter. 283 284 Kernels that give some weight to every sample point include 285 *normal* (*gauss*), *logistic*, and *sigmoid*. 286 287 Kernels that only give weight to sample points within the bandwidth 288 include *rectangular* (*uniform*), *triangular*, *parabolic* 289 (*epanechnikov*), *quartic* (*biweight*), *triweight*, and *cosine*. 290 291 If *cumulative* is true, will return a cumulative distribution function. 292 293 A :exc:`StatisticsError` will be raised if the *data* sequence is empty. 294 295 `Wikipedia has an example 296 <https://en.wikipedia.org/wiki/Kernel_density_estimation#Example>`_ 297 where we can use :func:`kde` to generate and plot a probability 298 density function estimated from a small sample: 299 300 .. doctest:: 301 302 >>> sample = [-2.1, -1.3, -0.4, 1.9, 5.1, 6.2] 303 >>> f_hat = kde(sample, h=1.5) 304 >>> xarr = [i/100 for i in range(-750, 1100)] 305 >>> yarr = [f_hat(x) for x in xarr] 306 307 The points in ``xarr`` and ``yarr`` can be used to make a PDF plot: 308 309 .. image:: kde_example.png 310 :alt: Scatter plot of the estimated probability density function. 311 312 .. versionadded:: 3.13 313 314 315.. function:: kde_random(data, h, kernel='normal', *, seed=None) 316 317 Return a function that makes a random selection from the estimated 318 probability density function produced by ``kde(data, h, kernel)``. 319 320 Providing a *seed* allows reproducible selections. In the future, the 321 values may change slightly as more accurate kernel inverse CDF estimates 322 are implemented. The seed may be an integer, float, str, or bytes. 323 324 A :exc:`StatisticsError` will be raised if the *data* sequence is empty. 325 326 Continuing the example for :func:`kde`, we can use 327 :func:`kde_random` to generate new random selections from an 328 estimated probability density function: 329 330 >>> data = [-2.1, -1.3, -0.4, 1.9, 5.1, 6.2] 331 >>> rand = kde_random(data, h=1.5, seed=8675309) 332 >>> new_selections = [rand() for i in range(10)] 333 >>> [round(x, 1) for x in new_selections] 334 [0.7, 6.2, 1.2, 6.9, 7.0, 1.8, 2.5, -0.5, -1.8, 5.6] 335 336 .. versionadded:: 3.13 337 338 339.. function:: median(data) 340 341 Return the median (middle value) of numeric data, using the common "mean of 342 middle two" method. If *data* is empty, :exc:`StatisticsError` is raised. 343 *data* can be a sequence or iterable. 344 345 The median is a robust measure of central location and is less affected by 346 the presence of outliers. When the number of data points is odd, the 347 middle data point is returned: 348 349 .. doctest:: 350 351 >>> median([1, 3, 5]) 352 3 353 354 When the number of data points is even, the median is interpolated by taking 355 the average of the two middle values: 356 357 .. doctest:: 358 359 >>> median([1, 3, 5, 7]) 360 4.0 361 362 This is suited for when your data is discrete, and you don't mind that the 363 median may not be an actual data point. 364 365 If the data is ordinal (supports order operations) but not numeric (doesn't 366 support addition), consider using :func:`median_low` or :func:`median_high` 367 instead. 368 369.. function:: median_low(data) 370 371 Return the low median of numeric data. If *data* is empty, 372 :exc:`StatisticsError` is raised. *data* can be a sequence or iterable. 373 374 The low median is always a member of the data set. When the number of data 375 points is odd, the middle value is returned. When it is even, the smaller of 376 the two middle values is returned. 377 378 .. doctest:: 379 380 >>> median_low([1, 3, 5]) 381 3 382 >>> median_low([1, 3, 5, 7]) 383 3 384 385 Use the low median when your data are discrete and you prefer the median to 386 be an actual data point rather than interpolated. 387 388 389.. function:: median_high(data) 390 391 Return the high median of data. If *data* is empty, :exc:`StatisticsError` 392 is raised. *data* can be a sequence or iterable. 393 394 The high median is always a member of the data set. When the number of data 395 points is odd, the middle value is returned. When it is even, the larger of 396 the two middle values is returned. 397 398 .. doctest:: 399 400 >>> median_high([1, 3, 5]) 401 3 402 >>> median_high([1, 3, 5, 7]) 403 5 404 405 Use the high median when your data are discrete and you prefer the median to 406 be an actual data point rather than interpolated. 407 408 409.. function:: median_grouped(data, interval=1.0) 410 411 Estimates the median for numeric data that has been `grouped or binned 412 <https://en.wikipedia.org/wiki/Data_binning>`_ around the midpoints 413 of consecutive, fixed-width intervals. 414 415 The *data* can be any iterable of numeric data with each value being 416 exactly the midpoint of a bin. At least one value must be present. 417 418 The *interval* is the width of each bin. 419 420 For example, demographic information may have been summarized into 421 consecutive ten-year age groups with each group being represented 422 by the 5-year midpoints of the intervals: 423 424 .. doctest:: 425 426 >>> from collections import Counter 427 >>> demographics = Counter({ 428 ... 25: 172, # 20 to 30 years old 429 ... 35: 484, # 30 to 40 years old 430 ... 45: 387, # 40 to 50 years old 431 ... 55: 22, # 50 to 60 years old 432 ... 65: 6, # 60 to 70 years old 433 ... }) 434 ... 435 436 The 50th percentile (median) is the 536th person out of the 1071 437 member cohort. That person is in the 30 to 40 year old age group. 438 439 The regular :func:`median` function would assume that everyone in the 440 tricenarian age group was exactly 35 years old. A more tenable 441 assumption is that the 484 members of that age group are evenly 442 distributed between 30 and 40. For that, we use 443 :func:`median_grouped`: 444 445 .. doctest:: 446 447 >>> data = list(demographics.elements()) 448 >>> median(data) 449 35 450 >>> round(median_grouped(data, interval=10), 1) 451 37.5 452 453 The caller is responsible for making sure the data points are separated 454 by exact multiples of *interval*. This is essential for getting a 455 correct result. The function does not check this precondition. 456 457 Inputs may be any numeric type that can be coerced to a float during 458 the interpolation step. 459 460 461.. function:: mode(data) 462 463 Return the single most common data point from discrete or nominal *data*. 464 The mode (when it exists) is the most typical value and serves as a 465 measure of central location. 466 467 If there are multiple modes with the same frequency, returns the first one 468 encountered in the *data*. If the smallest or largest of those is 469 desired instead, use ``min(multimode(data))`` or ``max(multimode(data))``. 470 If the input *data* is empty, :exc:`StatisticsError` is raised. 471 472 ``mode`` assumes discrete data and returns a single value. This is the 473 standard treatment of the mode as commonly taught in schools: 474 475 .. doctest:: 476 477 >>> mode([1, 1, 2, 3, 3, 3, 3, 4]) 478 3 479 480 The mode is unique in that it is the only statistic in this package that 481 also applies to nominal (non-numeric) data: 482 483 .. doctest:: 484 485 >>> mode(["red", "blue", "blue", "red", "green", "red", "red"]) 486 'red' 487 488 Only hashable inputs are supported. To handle type :class:`set`, 489 consider casting to :class:`frozenset`. To handle type :class:`list`, 490 consider casting to :class:`tuple`. For mixed or nested inputs, consider 491 using this slower quadratic algorithm that only depends on equality tests: 492 ``max(data, key=data.count)``. 493 494 .. versionchanged:: 3.8 495 Now handles multimodal datasets by returning the first mode encountered. 496 Formerly, it raised :exc:`StatisticsError` when more than one mode was 497 found. 498 499 500.. function:: multimode(data) 501 502 Return a list of the most frequently occurring values in the order they 503 were first encountered in the *data*. Will return more than one result if 504 there are multiple modes or an empty list if the *data* is empty: 505 506 .. doctest:: 507 508 >>> multimode('aabbbbccddddeeffffgg') 509 ['b', 'd', 'f'] 510 >>> multimode('') 511 [] 512 513 .. versionadded:: 3.8 514 515 516.. function:: pstdev(data, mu=None) 517 518 Return the population standard deviation (the square root of the population 519 variance). See :func:`pvariance` for arguments and other details. 520 521 .. doctest:: 522 523 >>> pstdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75]) 524 0.986893273527251 525 526 527.. function:: pvariance(data, mu=None) 528 529 Return the population variance of *data*, a non-empty sequence or iterable 530 of real-valued numbers. Variance, or second moment about the mean, is a 531 measure of the variability (spread or dispersion) of data. A large 532 variance indicates that the data is spread out; a small variance indicates 533 it is clustered closely around the mean. 534 535 If the optional second argument *mu* is given, it should be the *population* 536 mean of the *data*. It can also be used to compute the second moment around 537 a point that is not the mean. If it is missing or ``None`` (the default), 538 the arithmetic mean is automatically calculated. 539 540 Use this function to calculate the variance from the entire population. To 541 estimate the variance from a sample, the :func:`variance` function is usually 542 a better choice. 543 544 Raises :exc:`StatisticsError` if *data* is empty. 545 546 Examples: 547 548 .. doctest:: 549 550 >>> data = [0.0, 0.25, 0.25, 1.25, 1.5, 1.75, 2.75, 3.25] 551 >>> pvariance(data) 552 1.25 553 554 If you have already calculated the mean of your data, you can pass it as the 555 optional second argument *mu* to avoid recalculation: 556 557 .. doctest:: 558 559 >>> mu = mean(data) 560 >>> pvariance(data, mu) 561 1.25 562 563 Decimals and Fractions are supported: 564 565 .. doctest:: 566 567 >>> from decimal import Decimal as D 568 >>> pvariance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")]) 569 Decimal('24.815') 570 571 >>> from fractions import Fraction as F 572 >>> pvariance([F(1, 4), F(5, 4), F(1, 2)]) 573 Fraction(13, 72) 574 575 .. note:: 576 577 When called with the entire population, this gives the population variance 578 σ². When called on a sample instead, this is the biased sample variance 579 s², also known as variance with N degrees of freedom. 580 581 If you somehow know the true population mean μ, you may use this 582 function to calculate the variance of a sample, giving the known 583 population mean as the second argument. Provided the data points are a 584 random sample of the population, the result will be an unbiased estimate 585 of the population variance. 586 587 588.. function:: stdev(data, xbar=None) 589 590 Return the sample standard deviation (the square root of the sample 591 variance). See :func:`variance` for arguments and other details. 592 593 .. doctest:: 594 595 >>> stdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75]) 596 1.0810874155219827 597 598 599.. function:: variance(data, xbar=None) 600 601 Return the sample variance of *data*, an iterable of at least two real-valued 602 numbers. Variance, or second moment about the mean, is a measure of the 603 variability (spread or dispersion) of data. A large variance indicates that 604 the data is spread out; a small variance indicates it is clustered closely 605 around the mean. 606 607 If the optional second argument *xbar* is given, it should be the *sample* 608 mean of *data*. If it is missing or ``None`` (the default), the mean is 609 automatically calculated. 610 611 Use this function when your data is a sample from a population. To calculate 612 the variance from the entire population, see :func:`pvariance`. 613 614 Raises :exc:`StatisticsError` if *data* has fewer than two values. 615 616 Examples: 617 618 .. doctest:: 619 620 >>> data = [2.75, 1.75, 1.25, 0.25, 0.5, 1.25, 3.5] 621 >>> variance(data) 622 1.3720238095238095 623 624 If you have already calculated the sample mean of your data, you can pass it 625 as the optional second argument *xbar* to avoid recalculation: 626 627 .. doctest:: 628 629 >>> m = mean(data) 630 >>> variance(data, m) 631 1.3720238095238095 632 633 This function does not attempt to verify that you have passed the actual mean 634 as *xbar*. Using arbitrary values for *xbar* can lead to invalid or 635 impossible results. 636 637 Decimal and Fraction values are supported: 638 639 .. doctest:: 640 641 >>> from decimal import Decimal as D 642 >>> variance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")]) 643 Decimal('31.01875') 644 645 >>> from fractions import Fraction as F 646 >>> variance([F(1, 6), F(1, 2), F(5, 3)]) 647 Fraction(67, 108) 648 649 .. note:: 650 651 This is the sample variance s² with Bessel's correction, also known as 652 variance with N-1 degrees of freedom. Provided that the data points are 653 representative (e.g. independent and identically distributed), the result 654 should be an unbiased estimate of the true population variance. 655 656 If you somehow know the actual population mean μ you should pass it to the 657 :func:`pvariance` function as the *mu* parameter to get the variance of a 658 sample. 659 660.. function:: quantiles(data, *, n=4, method='exclusive') 661 662 Divide *data* into *n* continuous intervals with equal probability. 663 Returns a list of ``n - 1`` cut points separating the intervals. 664 665 Set *n* to 4 for quartiles (the default). Set *n* to 10 for deciles. Set 666 *n* to 100 for percentiles which gives the 99 cuts points that separate 667 *data* into 100 equal sized groups. Raises :exc:`StatisticsError` if *n* 668 is not least 1. 669 670 The *data* can be any iterable containing sample data. For meaningful 671 results, the number of data points in *data* should be larger than *n*. 672 Raises :exc:`StatisticsError` if there is not at least one data point. 673 674 The cut points are linearly interpolated from the 675 two nearest data points. For example, if a cut point falls one-third 676 of the distance between two sample values, ``100`` and ``112``, the 677 cut-point will evaluate to ``104``. 678 679 The *method* for computing quantiles can be varied depending on 680 whether the *data* includes or excludes the lowest and 681 highest possible values from the population. 682 683 The default *method* is "exclusive" and is used for data sampled from 684 a population that can have more extreme values than found in the 685 samples. The portion of the population falling below the *i-th* of 686 *m* sorted data points is computed as ``i / (m + 1)``. Given nine 687 sample values, the method sorts them and assigns the following 688 percentiles: 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%. 689 690 Setting the *method* to "inclusive" is used for describing population 691 data or for samples that are known to include the most extreme values 692 from the population. The minimum value in *data* is treated as the 0th 693 percentile and the maximum value is treated as the 100th percentile. 694 The portion of the population falling below the *i-th* of *m* sorted 695 data points is computed as ``(i - 1) / (m - 1)``. Given 11 sample 696 values, the method sorts them and assigns the following percentiles: 697 0%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 100%. 698 699 .. doctest:: 700 701 # Decile cut points for empirically sampled data 702 >>> data = [105, 129, 87, 86, 111, 111, 89, 81, 108, 92, 110, 703 ... 100, 75, 105, 103, 109, 76, 119, 99, 91, 103, 129, 704 ... 106, 101, 84, 111, 74, 87, 86, 103, 103, 106, 86, 705 ... 111, 75, 87, 102, 121, 111, 88, 89, 101, 106, 95, 706 ... 103, 107, 101, 81, 109, 104] 707 >>> [round(q, 1) for q in quantiles(data, n=10)] 708 [81.0, 86.2, 89.0, 99.4, 102.5, 103.6, 106.0, 109.8, 111.0] 709 710 .. versionadded:: 3.8 711 712 .. versionchanged:: 3.13 713 No longer raises an exception for an input with only a single data point. 714 This allows quantile estimates to be built up one sample point 715 at a time becoming gradually more refined with each new data point. 716 717.. function:: covariance(x, y, /) 718 719 Return the sample covariance of two inputs *x* and *y*. Covariance 720 is a measure of the joint variability of two inputs. 721 722 Both inputs must be of the same length (no less than two), otherwise 723 :exc:`StatisticsError` is raised. 724 725 Examples: 726 727 .. doctest:: 728 729 >>> x = [1, 2, 3, 4, 5, 6, 7, 8, 9] 730 >>> y = [1, 2, 3, 1, 2, 3, 1, 2, 3] 731 >>> covariance(x, y) 732 0.75 733 >>> z = [9, 8, 7, 6, 5, 4, 3, 2, 1] 734 >>> covariance(x, z) 735 -7.5 736 >>> covariance(z, x) 737 -7.5 738 739 .. versionadded:: 3.10 740 741.. function:: correlation(x, y, /, *, method='linear') 742 743 Return the `Pearson's correlation coefficient 744 <https://en.wikipedia.org/wiki/Pearson_correlation_coefficient>`_ 745 for two inputs. Pearson's correlation coefficient *r* takes values 746 between -1 and +1. It measures the strength and direction of a linear 747 relationship. 748 749 If *method* is "ranked", computes `Spearman's rank correlation coefficient 750 <https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient>`_ 751 for two inputs. The data is replaced by ranks. Ties are averaged so that 752 equal values receive the same rank. The resulting coefficient measures the 753 strength of a monotonic relationship. 754 755 Spearman's correlation coefficient is appropriate for ordinal data or for 756 continuous data that doesn't meet the linear proportion requirement for 757 Pearson's correlation coefficient. 758 759 Both inputs must be of the same length (no less than two), and need 760 not to be constant, otherwise :exc:`StatisticsError` is raised. 761 762 Example with `Kepler's laws of planetary motion 763 <https://en.wikipedia.org/wiki/Kepler's_laws_of_planetary_motion>`_: 764 765 .. doctest:: 766 767 >>> # Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, and Neptune 768 >>> orbital_period = [88, 225, 365, 687, 4331, 10_756, 30_687, 60_190] # days 769 >>> dist_from_sun = [58, 108, 150, 228, 778, 1_400, 2_900, 4_500] # million km 770 771 >>> # Show that a perfect monotonic relationship exists 772 >>> correlation(orbital_period, dist_from_sun, method='ranked') 773 1.0 774 775 >>> # Observe that a linear relationship is imperfect 776 >>> round(correlation(orbital_period, dist_from_sun), 4) 777 0.9882 778 779 >>> # Demonstrate Kepler's third law: There is a linear correlation 780 >>> # between the square of the orbital period and the cube of the 781 >>> # distance from the sun. 782 >>> period_squared = [p * p for p in orbital_period] 783 >>> dist_cubed = [d * d * d for d in dist_from_sun] 784 >>> round(correlation(period_squared, dist_cubed), 4) 785 1.0 786 787 .. versionadded:: 3.10 788 789 .. versionchanged:: 3.12 790 Added support for Spearman's rank correlation coefficient. 791 792.. function:: linear_regression(x, y, /, *, proportional=False) 793 794 Return the slope and intercept of `simple linear regression 795 <https://en.wikipedia.org/wiki/Simple_linear_regression>`_ 796 parameters estimated using ordinary least squares. Simple linear 797 regression describes the relationship between an independent variable *x* and 798 a dependent variable *y* in terms of this linear function: 799 800 *y = slope \* x + intercept + noise* 801 802 where ``slope`` and ``intercept`` are the regression parameters that are 803 estimated, and ``noise`` represents the 804 variability of the data that was not explained by the linear regression 805 (it is equal to the difference between predicted and actual values 806 of the dependent variable). 807 808 Both inputs must be of the same length (no less than two), and 809 the independent variable *x* cannot be constant; 810 otherwise a :exc:`StatisticsError` is raised. 811 812 For example, we can use the `release dates of the Monty 813 Python films <https://en.wikipedia.org/wiki/Monty_Python#Films>`_ 814 to predict the cumulative number of Monty Python films 815 that would have been produced by 2019 816 assuming that they had kept the pace. 817 818 .. doctest:: 819 820 >>> year = [1971, 1975, 1979, 1982, 1983] 821 >>> films_total = [1, 2, 3, 4, 5] 822 >>> slope, intercept = linear_regression(year, films_total) 823 >>> round(slope * 2019 + intercept) 824 16 825 826 If *proportional* is true, the independent variable *x* and the 827 dependent variable *y* are assumed to be directly proportional. 828 The data is fit to a line passing through the origin. 829 Since the *intercept* will always be 0.0, the underlying linear 830 function simplifies to: 831 832 *y = slope \* x + noise* 833 834 Continuing the example from :func:`correlation`, we look to see 835 how well a model based on major planets can predict the orbital 836 distances for dwarf planets: 837 838 .. doctest:: 839 840 >>> model = linear_regression(period_squared, dist_cubed, proportional=True) 841 >>> slope = model.slope 842 843 >>> # Dwarf planets: Pluto, Eris, Makemake, Haumea, Ceres 844 >>> orbital_periods = [90_560, 204_199, 111_845, 103_410, 1_680] # days 845 >>> predicted_dist = [math.cbrt(slope * (p * p)) for p in orbital_periods] 846 >>> list(map(round, predicted_dist)) 847 [5912, 10166, 6806, 6459, 414] 848 849 >>> [5_906, 10_152, 6_796, 6_450, 414] # actual distance in million km 850 [5906, 10152, 6796, 6450, 414] 851 852 .. versionadded:: 3.10 853 854 .. versionchanged:: 3.11 855 Added support for *proportional*. 856 857Exceptions 858---------- 859 860A single exception is defined: 861 862.. exception:: StatisticsError 863 864 Subclass of :exc:`ValueError` for statistics-related exceptions. 865 866 867:class:`NormalDist` objects 868--------------------------- 869 870:class:`NormalDist` is a tool for creating and manipulating normal 871distributions of a `random variable 872<http://www.stat.yale.edu/Courses/1997-98/101/ranvar.htm>`_. It is a 873class that treats the mean and standard deviation of data 874measurements as a single entity. 875 876Normal distributions arise from the `Central Limit Theorem 877<https://en.wikipedia.org/wiki/Central_limit_theorem>`_ and have a wide range 878of applications in statistics. 879 880.. class:: NormalDist(mu=0.0, sigma=1.0) 881 882 Returns a new *NormalDist* object where *mu* represents the `arithmetic 883 mean <https://en.wikipedia.org/wiki/Arithmetic_mean>`_ and *sigma* 884 represents the `standard deviation 885 <https://en.wikipedia.org/wiki/Standard_deviation>`_. 886 887 If *sigma* is negative, raises :exc:`StatisticsError`. 888 889 .. attribute:: mean 890 891 A read-only property for the `arithmetic mean 892 <https://en.wikipedia.org/wiki/Arithmetic_mean>`_ of a normal 893 distribution. 894 895 .. attribute:: median 896 897 A read-only property for the `median 898 <https://en.wikipedia.org/wiki/Median>`_ of a normal 899 distribution. 900 901 .. attribute:: mode 902 903 A read-only property for the `mode 904 <https://en.wikipedia.org/wiki/Mode_(statistics)>`_ of a normal 905 distribution. 906 907 .. attribute:: stdev 908 909 A read-only property for the `standard deviation 910 <https://en.wikipedia.org/wiki/Standard_deviation>`_ of a normal 911 distribution. 912 913 .. attribute:: variance 914 915 A read-only property for the `variance 916 <https://en.wikipedia.org/wiki/Variance>`_ of a normal 917 distribution. Equal to the square of the standard deviation. 918 919 .. classmethod:: NormalDist.from_samples(data) 920 921 Makes a normal distribution instance with *mu* and *sigma* parameters 922 estimated from the *data* using :func:`fmean` and :func:`stdev`. 923 924 The *data* can be any :term:`iterable` and should consist of values 925 that can be converted to type :class:`float`. If *data* does not 926 contain at least two elements, raises :exc:`StatisticsError` because it 927 takes at least one point to estimate a central value and at least two 928 points to estimate dispersion. 929 930 .. method:: NormalDist.samples(n, *, seed=None) 931 932 Generates *n* random samples for a given mean and standard deviation. 933 Returns a :class:`list` of :class:`float` values. 934 935 If *seed* is given, creates a new instance of the underlying random 936 number generator. This is useful for creating reproducible results, 937 even in a multi-threading context. 938 939 .. versionchanged:: 3.13 940 941 Switched to a faster algorithm. To reproduce samples from previous 942 versions, use :func:`random.seed` and :func:`random.gauss`. 943 944 .. method:: NormalDist.pdf(x) 945 946 Using a `probability density function (pdf) 947 <https://en.wikipedia.org/wiki/Probability_density_function>`_, compute 948 the relative likelihood that a random variable *X* will be near the 949 given value *x*. Mathematically, it is the limit of the ratio ``P(x <= 950 X < x+dx) / dx`` as *dx* approaches zero. 951 952 The relative likelihood is computed as the probability of a sample 953 occurring in a narrow range divided by the width of the range (hence 954 the word "density"). Since the likelihood is relative to other points, 955 its value can be greater than ``1.0``. 956 957 .. method:: NormalDist.cdf(x) 958 959 Using a `cumulative distribution function (cdf) 960 <https://en.wikipedia.org/wiki/Cumulative_distribution_function>`_, 961 compute the probability that a random variable *X* will be less than or 962 equal to *x*. Mathematically, it is written ``P(X <= x)``. 963 964 .. method:: NormalDist.inv_cdf(p) 965 966 Compute the inverse cumulative distribution function, also known as the 967 `quantile function <https://en.wikipedia.org/wiki/Quantile_function>`_ 968 or the `percent-point 969 <https://web.archive.org/web/20190203145224/https://www.statisticshowto.datasciencecentral.com/inverse-distribution-function/>`_ 970 function. Mathematically, it is written ``x : P(X <= x) = p``. 971 972 Finds the value *x* of the random variable *X* such that the 973 probability of the variable being less than or equal to that value 974 equals the given probability *p*. 975 976 .. method:: NormalDist.overlap(other) 977 978 Measures the agreement between two normal probability distributions. 979 Returns a value between 0.0 and 1.0 giving `the overlapping area for 980 the two probability density functions 981 <https://www.rasch.org/rmt/rmt101r.htm>`_. 982 983 .. method:: NormalDist.quantiles(n=4) 984 985 Divide the normal distribution into *n* continuous intervals with 986 equal probability. Returns a list of (n - 1) cut points separating 987 the intervals. 988 989 Set *n* to 4 for quartiles (the default). Set *n* to 10 for deciles. 990 Set *n* to 100 for percentiles which gives the 99 cuts points that 991 separate the normal distribution into 100 equal sized groups. 992 993 .. method:: NormalDist.zscore(x) 994 995 Compute the 996 `Standard Score <https://www.statisticshowto.com/probability-and-statistics/z-score/>`_ 997 describing *x* in terms of the number of standard deviations 998 above or below the mean of the normal distribution: 999 ``(x - mean) / stdev``. 1000 1001 .. versionadded:: 3.9 1002 1003 Instances of :class:`NormalDist` support addition, subtraction, 1004 multiplication and division by a constant. These operations 1005 are used for translation and scaling. For example: 1006 1007 .. doctest:: 1008 1009 >>> temperature_february = NormalDist(5, 2.5) # Celsius 1010 >>> temperature_february * (9/5) + 32 # Fahrenheit 1011 NormalDist(mu=41.0, sigma=4.5) 1012 1013 Dividing a constant by an instance of :class:`NormalDist` is not supported 1014 because the result wouldn't be normally distributed. 1015 1016 Since normal distributions arise from additive effects of independent 1017 variables, it is possible to `add and subtract two independent normally 1018 distributed random variables 1019 <https://en.wikipedia.org/wiki/Sum_of_normally_distributed_random_variables>`_ 1020 represented as instances of :class:`NormalDist`. For example: 1021 1022 .. doctest:: 1023 1024 >>> birth_weights = NormalDist.from_samples([2.5, 3.1, 2.1, 2.4, 2.7, 3.5]) 1025 >>> drug_effects = NormalDist(0.4, 0.15) 1026 >>> combined = birth_weights + drug_effects 1027 >>> round(combined.mean, 1) 1028 3.1 1029 >>> round(combined.stdev, 1) 1030 0.5 1031 1032 .. versionadded:: 3.8 1033 1034 1035Examples and Recipes 1036-------------------- 1037 1038 1039Classic probability problems 1040**************************** 1041 1042:class:`NormalDist` readily solves classic probability problems. 1043 1044For example, given `historical data for SAT exams 1045<https://nces.ed.gov/programs/digest/d17/tables/dt17_226.40.asp>`_ showing 1046that scores are normally distributed with a mean of 1060 and a standard 1047deviation of 195, determine the percentage of students with test scores 1048between 1100 and 1200, after rounding to the nearest whole number: 1049 1050.. doctest:: 1051 1052 >>> sat = NormalDist(1060, 195) 1053 >>> fraction = sat.cdf(1200 + 0.5) - sat.cdf(1100 - 0.5) 1054 >>> round(fraction * 100.0, 1) 1055 18.4 1056 1057Find the `quartiles <https://en.wikipedia.org/wiki/Quartile>`_ and `deciles 1058<https://en.wikipedia.org/wiki/Decile>`_ for the SAT scores: 1059 1060.. doctest:: 1061 1062 >>> list(map(round, sat.quantiles())) 1063 [928, 1060, 1192] 1064 >>> list(map(round, sat.quantiles(n=10))) 1065 [810, 896, 958, 1011, 1060, 1109, 1162, 1224, 1310] 1066 1067 1068Monte Carlo inputs for simulations 1069********************************** 1070 1071To estimate the distribution for a model that isn't easy to solve 1072analytically, :class:`NormalDist` can generate input samples for a `Monte 1073Carlo simulation <https://en.wikipedia.org/wiki/Monte_Carlo_method>`_: 1074 1075.. doctest:: 1076 1077 >>> def model(x, y, z): 1078 ... return (3*x + 7*x*y - 5*y) / (11 * z) 1079 ... 1080 >>> n = 100_000 1081 >>> X = NormalDist(10, 2.5).samples(n, seed=3652260728) 1082 >>> Y = NormalDist(15, 1.75).samples(n, seed=4582495471) 1083 >>> Z = NormalDist(50, 1.25).samples(n, seed=6582483453) 1084 >>> quantiles(map(model, X, Y, Z)) # doctest: +SKIP 1085 [1.4591308524824727, 1.8035946855390597, 2.175091447274739] 1086 1087Approximating binomial distributions 1088************************************ 1089 1090Normal distributions can be used to approximate `Binomial 1091distributions <https://mathworld.wolfram.com/BinomialDistribution.html>`_ 1092when the sample size is large and when the probability of a successful 1093trial is near 50%. 1094 1095For example, an open source conference has 750 attendees and two rooms with a 1096500 person capacity. There is a talk about Python and another about Ruby. 1097In previous conferences, 65% of the attendees preferred to listen to Python 1098talks. Assuming the population preferences haven't changed, what is the 1099probability that the Python room will stay within its capacity limits? 1100 1101.. doctest:: 1102 1103 >>> n = 750 # Sample size 1104 >>> p = 0.65 # Preference for Python 1105 >>> q = 1.0 - p # Preference for Ruby 1106 >>> k = 500 # Room capacity 1107 1108 >>> # Approximation using the cumulative normal distribution 1109 >>> from math import sqrt 1110 >>> round(NormalDist(mu=n*p, sigma=sqrt(n*p*q)).cdf(k + 0.5), 4) 1111 0.8402 1112 1113 >>> # Exact solution using the cumulative binomial distribution 1114 >>> from math import comb, fsum 1115 >>> round(fsum(comb(n, r) * p**r * q**(n-r) for r in range(k+1)), 4) 1116 0.8402 1117 1118 >>> # Approximation using a simulation 1119 >>> from random import seed, binomialvariate 1120 >>> seed(8675309) 1121 >>> mean(binomialvariate(n, p) <= k for i in range(10_000)) 1122 0.8406 1123 1124 1125Naive bayesian classifier 1126************************* 1127 1128Normal distributions commonly arise in machine learning problems. 1129 1130Wikipedia has a `nice example of a Naive Bayesian Classifier 1131<https://en.wikipedia.org/wiki/Naive_Bayes_classifier#Person_classification>`_. 1132The challenge is to predict a person's gender from measurements of normally 1133distributed features including height, weight, and foot size. 1134 1135We're given a training dataset with measurements for eight people. The 1136measurements are assumed to be normally distributed, so we summarize the data 1137with :class:`NormalDist`: 1138 1139.. doctest:: 1140 1141 >>> height_male = NormalDist.from_samples([6, 5.92, 5.58, 5.92]) 1142 >>> height_female = NormalDist.from_samples([5, 5.5, 5.42, 5.75]) 1143 >>> weight_male = NormalDist.from_samples([180, 190, 170, 165]) 1144 >>> weight_female = NormalDist.from_samples([100, 150, 130, 150]) 1145 >>> foot_size_male = NormalDist.from_samples([12, 11, 12, 10]) 1146 >>> foot_size_female = NormalDist.from_samples([6, 8, 7, 9]) 1147 1148Next, we encounter a new person whose feature measurements are known but whose 1149gender is unknown: 1150 1151.. doctest:: 1152 1153 >>> ht = 6.0 # height 1154 >>> wt = 130 # weight 1155 >>> fs = 8 # foot size 1156 1157Starting with a 50% `prior probability 1158<https://en.wikipedia.org/wiki/Prior_probability>`_ of being male or female, 1159we compute the posterior as the prior times the product of likelihoods for the 1160feature measurements given the gender: 1161 1162.. doctest:: 1163 1164 >>> prior_male = 0.5 1165 >>> prior_female = 0.5 1166 >>> posterior_male = (prior_male * height_male.pdf(ht) * 1167 ... weight_male.pdf(wt) * foot_size_male.pdf(fs)) 1168 1169 >>> posterior_female = (prior_female * height_female.pdf(ht) * 1170 ... weight_female.pdf(wt) * foot_size_female.pdf(fs)) 1171 1172The final prediction goes to the largest posterior. This is known as the 1173`maximum a posteriori 1174<https://en.wikipedia.org/wiki/Maximum_a_posteriori_estimation>`_ or MAP: 1175 1176.. doctest:: 1177 1178 >>> 'male' if posterior_male > posterior_female else 'female' 1179 'female' 1180 1181 1182.. 1183 # This modelines must appear within the last ten lines of the file. 1184 kate: indent-width 3; remove-trailing-space on; replace-tabs on; encoding utf-8; 1185