17db96d56Sopenharmony_ci:mod:`statistics` --- Mathematical statistics functions 27db96d56Sopenharmony_ci======================================================= 37db96d56Sopenharmony_ci 47db96d56Sopenharmony_ci.. module:: statistics 57db96d56Sopenharmony_ci :synopsis: Mathematical statistics functions 67db96d56Sopenharmony_ci 77db96d56Sopenharmony_ci.. moduleauthor:: Steven D'Aprano <steve+python@pearwood.info> 87db96d56Sopenharmony_ci.. sectionauthor:: Steven D'Aprano <steve+python@pearwood.info> 97db96d56Sopenharmony_ci 107db96d56Sopenharmony_ci.. versionadded:: 3.4 117db96d56Sopenharmony_ci 127db96d56Sopenharmony_ci**Source code:** :source:`Lib/statistics.py` 137db96d56Sopenharmony_ci 147db96d56Sopenharmony_ci.. testsetup:: * 157db96d56Sopenharmony_ci 167db96d56Sopenharmony_ci from statistics import * 177db96d56Sopenharmony_ci __name__ = '<doctest>' 187db96d56Sopenharmony_ci 197db96d56Sopenharmony_ci-------------- 207db96d56Sopenharmony_ci 217db96d56Sopenharmony_ciThis module provides functions for calculating mathematical statistics of 227db96d56Sopenharmony_cinumeric (:class:`~numbers.Real`-valued) data. 237db96d56Sopenharmony_ci 247db96d56Sopenharmony_ciThe module is not intended to be a competitor to third-party libraries such 257db96d56Sopenharmony_cias `NumPy <https://numpy.org>`_, `SciPy <https://scipy.org/>`_, or 267db96d56Sopenharmony_ciproprietary full-featured statistics packages aimed at professional 277db96d56Sopenharmony_cistatisticians such as Minitab, SAS and Matlab. It is aimed at the level of 287db96d56Sopenharmony_cigraphing and scientific calculators. 297db96d56Sopenharmony_ci 307db96d56Sopenharmony_ciUnless explicitly noted, these functions support :class:`int`, 317db96d56Sopenharmony_ci:class:`float`, :class:`~decimal.Decimal` and :class:`~fractions.Fraction`. 327db96d56Sopenharmony_ciBehaviour with other types (whether in the numeric tower or not) is 337db96d56Sopenharmony_cicurrently unsupported. Collections with a mix of types are also undefined 347db96d56Sopenharmony_ciand implementation-dependent. If your input data consists of mixed types, 357db96d56Sopenharmony_ciyou may be able to use :func:`map` to ensure a consistent result, for 367db96d56Sopenharmony_ciexample: ``map(float, input_data)``. 377db96d56Sopenharmony_ci 387db96d56Sopenharmony_ciSome datasets use ``NaN`` (not a number) values to represent missing data. 397db96d56Sopenharmony_ciSince NaNs have unusual comparison semantics, they cause surprising or 407db96d56Sopenharmony_ciundefined behaviors in the statistics functions that sort data or that count 417db96d56Sopenharmony_cioccurrences. The functions affected are ``median()``, ``median_low()``, 427db96d56Sopenharmony_ci``median_high()``, ``median_grouped()``, ``mode()``, ``multimode()``, and 437db96d56Sopenharmony_ci``quantiles()``. The ``NaN`` values should be stripped before calling these 447db96d56Sopenharmony_cifunctions:: 457db96d56Sopenharmony_ci 467db96d56Sopenharmony_ci >>> from statistics import median 477db96d56Sopenharmony_ci >>> from math import isnan 487db96d56Sopenharmony_ci >>> from itertools import filterfalse 497db96d56Sopenharmony_ci 507db96d56Sopenharmony_ci >>> data = [20.7, float('NaN'),19.2, 18.3, float('NaN'), 14.4] 517db96d56Sopenharmony_ci >>> sorted(data) # This has surprising behavior 527db96d56Sopenharmony_ci [20.7, nan, 14.4, 18.3, 19.2, nan] 537db96d56Sopenharmony_ci >>> median(data) # This result is unexpected 547db96d56Sopenharmony_ci 16.35 557db96d56Sopenharmony_ci 567db96d56Sopenharmony_ci >>> sum(map(isnan, data)) # Number of missing values 577db96d56Sopenharmony_ci 2 587db96d56Sopenharmony_ci >>> clean = list(filterfalse(isnan, data)) # Strip NaN values 597db96d56Sopenharmony_ci >>> clean 607db96d56Sopenharmony_ci [20.7, 19.2, 18.3, 14.4] 617db96d56Sopenharmony_ci >>> sorted(clean) # Sorting now works as expected 627db96d56Sopenharmony_ci [14.4, 18.3, 19.2, 20.7] 637db96d56Sopenharmony_ci >>> median(clean) # This result is now well defined 647db96d56Sopenharmony_ci 18.75 657db96d56Sopenharmony_ci 667db96d56Sopenharmony_ci 677db96d56Sopenharmony_ciAverages and measures of central location 687db96d56Sopenharmony_ci----------------------------------------- 697db96d56Sopenharmony_ci 707db96d56Sopenharmony_ciThese functions calculate an average or typical value from a population 717db96d56Sopenharmony_cior sample. 727db96d56Sopenharmony_ci 737db96d56Sopenharmony_ci======================= =============================================================== 747db96d56Sopenharmony_ci:func:`mean` Arithmetic mean ("average") of data. 757db96d56Sopenharmony_ci:func:`fmean` Fast, floating point arithmetic mean, with optional weighting. 767db96d56Sopenharmony_ci:func:`geometric_mean` Geometric mean of data. 777db96d56Sopenharmony_ci:func:`harmonic_mean` Harmonic mean of data. 787db96d56Sopenharmony_ci:func:`median` Median (middle value) of data. 797db96d56Sopenharmony_ci:func:`median_low` Low median of data. 807db96d56Sopenharmony_ci:func:`median_high` High median of data. 817db96d56Sopenharmony_ci:func:`median_grouped` Median, or 50th percentile, of grouped data. 827db96d56Sopenharmony_ci:func:`mode` Single mode (most common value) of discrete or nominal data. 837db96d56Sopenharmony_ci:func:`multimode` List of modes (most common values) of discrete or nominal data. 847db96d56Sopenharmony_ci:func:`quantiles` Divide data into intervals with equal probability. 857db96d56Sopenharmony_ci======================= =============================================================== 867db96d56Sopenharmony_ci 877db96d56Sopenharmony_ciMeasures of spread 887db96d56Sopenharmony_ci------------------ 897db96d56Sopenharmony_ci 907db96d56Sopenharmony_ciThese functions calculate a measure of how much the population or sample 917db96d56Sopenharmony_citends to deviate from the typical or average values. 927db96d56Sopenharmony_ci 937db96d56Sopenharmony_ci======================= ============================================= 947db96d56Sopenharmony_ci:func:`pstdev` Population standard deviation of data. 957db96d56Sopenharmony_ci:func:`pvariance` Population variance of data. 967db96d56Sopenharmony_ci:func:`stdev` Sample standard deviation of data. 977db96d56Sopenharmony_ci:func:`variance` Sample variance of data. 987db96d56Sopenharmony_ci======================= ============================================= 997db96d56Sopenharmony_ci 1007db96d56Sopenharmony_ciStatistics for relations between two inputs 1017db96d56Sopenharmony_ci------------------------------------------- 1027db96d56Sopenharmony_ci 1037db96d56Sopenharmony_ciThese functions calculate statistics regarding relations between two inputs. 1047db96d56Sopenharmony_ci 1057db96d56Sopenharmony_ci========================= ===================================================== 1067db96d56Sopenharmony_ci:func:`covariance` Sample covariance for two variables. 1077db96d56Sopenharmony_ci:func:`correlation` Pearson's correlation coefficient for two variables. 1087db96d56Sopenharmony_ci:func:`linear_regression` Slope and intercept for simple linear regression. 1097db96d56Sopenharmony_ci========================= ===================================================== 1107db96d56Sopenharmony_ci 1117db96d56Sopenharmony_ci 1127db96d56Sopenharmony_ciFunction details 1137db96d56Sopenharmony_ci---------------- 1147db96d56Sopenharmony_ci 1157db96d56Sopenharmony_ciNote: The functions do not require the data given to them to be sorted. 1167db96d56Sopenharmony_ciHowever, for reading convenience, most of the examples show sorted sequences. 1177db96d56Sopenharmony_ci 1187db96d56Sopenharmony_ci.. function:: mean(data) 1197db96d56Sopenharmony_ci 1207db96d56Sopenharmony_ci Return the sample arithmetic mean of *data* which can be a sequence or iterable. 1217db96d56Sopenharmony_ci 1227db96d56Sopenharmony_ci The arithmetic mean is the sum of the data divided by the number of data 1237db96d56Sopenharmony_ci points. It is commonly called "the average", although it is only one of many 1247db96d56Sopenharmony_ci different mathematical averages. It is a measure of the central location of 1257db96d56Sopenharmony_ci the data. 1267db96d56Sopenharmony_ci 1277db96d56Sopenharmony_ci If *data* is empty, :exc:`StatisticsError` will be raised. 1287db96d56Sopenharmony_ci 1297db96d56Sopenharmony_ci Some examples of use: 1307db96d56Sopenharmony_ci 1317db96d56Sopenharmony_ci .. doctest:: 1327db96d56Sopenharmony_ci 1337db96d56Sopenharmony_ci >>> mean([1, 2, 3, 4, 4]) 1347db96d56Sopenharmony_ci 2.8 1357db96d56Sopenharmony_ci >>> mean([-1.0, 2.5, 3.25, 5.75]) 1367db96d56Sopenharmony_ci 2.625 1377db96d56Sopenharmony_ci 1387db96d56Sopenharmony_ci >>> from fractions import Fraction as F 1397db96d56Sopenharmony_ci >>> mean([F(3, 7), F(1, 21), F(5, 3), F(1, 3)]) 1407db96d56Sopenharmony_ci Fraction(13, 21) 1417db96d56Sopenharmony_ci 1427db96d56Sopenharmony_ci >>> from decimal import Decimal as D 1437db96d56Sopenharmony_ci >>> mean([D("0.5"), D("0.75"), D("0.625"), D("0.375")]) 1447db96d56Sopenharmony_ci Decimal('0.5625') 1457db96d56Sopenharmony_ci 1467db96d56Sopenharmony_ci .. note:: 1477db96d56Sopenharmony_ci 1487db96d56Sopenharmony_ci The mean is strongly affected by `outliers 1497db96d56Sopenharmony_ci <https://en.wikipedia.org/wiki/Outlier>`_ and is not necessarily a 1507db96d56Sopenharmony_ci typical example of the data points. For a more robust, although less 1517db96d56Sopenharmony_ci efficient, measure of `central tendency 1527db96d56Sopenharmony_ci <https://en.wikipedia.org/wiki/Central_tendency>`_, see :func:`median`. 1537db96d56Sopenharmony_ci 1547db96d56Sopenharmony_ci The sample mean gives an unbiased estimate of the true population mean, 1557db96d56Sopenharmony_ci so that when taken on average over all the possible samples, 1567db96d56Sopenharmony_ci ``mean(sample)`` converges on the true mean of the entire population. If 1577db96d56Sopenharmony_ci *data* represents the entire population rather than a sample, then 1587db96d56Sopenharmony_ci ``mean(data)`` is equivalent to calculating the true population mean μ. 1597db96d56Sopenharmony_ci 1607db96d56Sopenharmony_ci 1617db96d56Sopenharmony_ci.. function:: fmean(data, weights=None) 1627db96d56Sopenharmony_ci 1637db96d56Sopenharmony_ci Convert *data* to floats and compute the arithmetic mean. 1647db96d56Sopenharmony_ci 1657db96d56Sopenharmony_ci This runs faster than the :func:`mean` function and it always returns a 1667db96d56Sopenharmony_ci :class:`float`. The *data* may be a sequence or iterable. If the input 1677db96d56Sopenharmony_ci dataset is empty, raises a :exc:`StatisticsError`. 1687db96d56Sopenharmony_ci 1697db96d56Sopenharmony_ci .. doctest:: 1707db96d56Sopenharmony_ci 1717db96d56Sopenharmony_ci >>> fmean([3.5, 4.0, 5.25]) 1727db96d56Sopenharmony_ci 4.25 1737db96d56Sopenharmony_ci 1747db96d56Sopenharmony_ci Optional weighting is supported. For example, a professor assigns a 1757db96d56Sopenharmony_ci grade for a course by weighting quizzes at 20%, homework at 20%, a 1767db96d56Sopenharmony_ci midterm exam at 30%, and a final exam at 30%: 1777db96d56Sopenharmony_ci 1787db96d56Sopenharmony_ci .. doctest:: 1797db96d56Sopenharmony_ci 1807db96d56Sopenharmony_ci >>> grades = [85, 92, 83, 91] 1817db96d56Sopenharmony_ci >>> weights = [0.20, 0.20, 0.30, 0.30] 1827db96d56Sopenharmony_ci >>> fmean(grades, weights) 1837db96d56Sopenharmony_ci 87.6 1847db96d56Sopenharmony_ci 1857db96d56Sopenharmony_ci If *weights* is supplied, it must be the same length as the *data* or 1867db96d56Sopenharmony_ci a :exc:`ValueError` will be raised. 1877db96d56Sopenharmony_ci 1887db96d56Sopenharmony_ci .. versionadded:: 3.8 1897db96d56Sopenharmony_ci 1907db96d56Sopenharmony_ci .. versionchanged:: 3.11 1917db96d56Sopenharmony_ci Added support for *weights*. 1927db96d56Sopenharmony_ci 1937db96d56Sopenharmony_ci 1947db96d56Sopenharmony_ci.. function:: geometric_mean(data) 1957db96d56Sopenharmony_ci 1967db96d56Sopenharmony_ci Convert *data* to floats and compute the geometric mean. 1977db96d56Sopenharmony_ci 1987db96d56Sopenharmony_ci The geometric mean indicates the central tendency or typical value of the 1997db96d56Sopenharmony_ci *data* using the product of the values (as opposed to the arithmetic mean 2007db96d56Sopenharmony_ci which uses their sum). 2017db96d56Sopenharmony_ci 2027db96d56Sopenharmony_ci Raises a :exc:`StatisticsError` if the input dataset is empty, 2037db96d56Sopenharmony_ci if it contains a zero, or if it contains a negative value. 2047db96d56Sopenharmony_ci The *data* may be a sequence or iterable. 2057db96d56Sopenharmony_ci 2067db96d56Sopenharmony_ci No special efforts are made to achieve exact results. 2077db96d56Sopenharmony_ci (However, this may change in the future.) 2087db96d56Sopenharmony_ci 2097db96d56Sopenharmony_ci .. doctest:: 2107db96d56Sopenharmony_ci 2117db96d56Sopenharmony_ci >>> round(geometric_mean([54, 24, 36]), 1) 2127db96d56Sopenharmony_ci 36.0 2137db96d56Sopenharmony_ci 2147db96d56Sopenharmony_ci .. versionadded:: 3.8 2157db96d56Sopenharmony_ci 2167db96d56Sopenharmony_ci 2177db96d56Sopenharmony_ci.. function:: harmonic_mean(data, weights=None) 2187db96d56Sopenharmony_ci 2197db96d56Sopenharmony_ci Return the harmonic mean of *data*, a sequence or iterable of 2207db96d56Sopenharmony_ci real-valued numbers. If *weights* is omitted or *None*, then 2217db96d56Sopenharmony_ci equal weighting is assumed. 2227db96d56Sopenharmony_ci 2237db96d56Sopenharmony_ci The harmonic mean is the reciprocal of the arithmetic :func:`mean` of the 2247db96d56Sopenharmony_ci reciprocals of the data. For example, the harmonic mean of three values *a*, 2257db96d56Sopenharmony_ci *b* and *c* will be equivalent to ``3/(1/a + 1/b + 1/c)``. If one of the 2267db96d56Sopenharmony_ci values is zero, the result will be zero. 2277db96d56Sopenharmony_ci 2287db96d56Sopenharmony_ci The harmonic mean is a type of average, a measure of the central 2297db96d56Sopenharmony_ci location of the data. It is often appropriate when averaging 2307db96d56Sopenharmony_ci ratios or rates, for example speeds. 2317db96d56Sopenharmony_ci 2327db96d56Sopenharmony_ci Suppose a car travels 10 km at 40 km/hr, then another 10 km at 60 km/hr. 2337db96d56Sopenharmony_ci What is the average speed? 2347db96d56Sopenharmony_ci 2357db96d56Sopenharmony_ci .. doctest:: 2367db96d56Sopenharmony_ci 2377db96d56Sopenharmony_ci >>> harmonic_mean([40, 60]) 2387db96d56Sopenharmony_ci 48.0 2397db96d56Sopenharmony_ci 2407db96d56Sopenharmony_ci Suppose a car travels 40 km/hr for 5 km, and when traffic clears, 2417db96d56Sopenharmony_ci speeds-up to 60 km/hr for the remaining 30 km of the journey. What 2427db96d56Sopenharmony_ci is the average speed? 2437db96d56Sopenharmony_ci 2447db96d56Sopenharmony_ci .. doctest:: 2457db96d56Sopenharmony_ci 2467db96d56Sopenharmony_ci >>> harmonic_mean([40, 60], weights=[5, 30]) 2477db96d56Sopenharmony_ci 56.0 2487db96d56Sopenharmony_ci 2497db96d56Sopenharmony_ci :exc:`StatisticsError` is raised if *data* is empty, any element 2507db96d56Sopenharmony_ci is less than zero, or if the weighted sum isn't positive. 2517db96d56Sopenharmony_ci 2527db96d56Sopenharmony_ci The current algorithm has an early-out when it encounters a zero 2537db96d56Sopenharmony_ci in the input. This means that the subsequent inputs are not tested 2547db96d56Sopenharmony_ci for validity. (This behavior may change in the future.) 2557db96d56Sopenharmony_ci 2567db96d56Sopenharmony_ci .. versionadded:: 3.6 2577db96d56Sopenharmony_ci 2587db96d56Sopenharmony_ci .. versionchanged:: 3.10 2597db96d56Sopenharmony_ci Added support for *weights*. 2607db96d56Sopenharmony_ci 2617db96d56Sopenharmony_ci.. function:: median(data) 2627db96d56Sopenharmony_ci 2637db96d56Sopenharmony_ci Return the median (middle value) of numeric data, using the common "mean of 2647db96d56Sopenharmony_ci middle two" method. If *data* is empty, :exc:`StatisticsError` is raised. 2657db96d56Sopenharmony_ci *data* can be a sequence or iterable. 2667db96d56Sopenharmony_ci 2677db96d56Sopenharmony_ci The median is a robust measure of central location and is less affected by 2687db96d56Sopenharmony_ci the presence of outliers. When the number of data points is odd, the 2697db96d56Sopenharmony_ci middle data point is returned: 2707db96d56Sopenharmony_ci 2717db96d56Sopenharmony_ci .. doctest:: 2727db96d56Sopenharmony_ci 2737db96d56Sopenharmony_ci >>> median([1, 3, 5]) 2747db96d56Sopenharmony_ci 3 2757db96d56Sopenharmony_ci 2767db96d56Sopenharmony_ci When the number of data points is even, the median is interpolated by taking 2777db96d56Sopenharmony_ci the average of the two middle values: 2787db96d56Sopenharmony_ci 2797db96d56Sopenharmony_ci .. doctest:: 2807db96d56Sopenharmony_ci 2817db96d56Sopenharmony_ci >>> median([1, 3, 5, 7]) 2827db96d56Sopenharmony_ci 4.0 2837db96d56Sopenharmony_ci 2847db96d56Sopenharmony_ci This is suited for when your data is discrete, and you don't mind that the 2857db96d56Sopenharmony_ci median may not be an actual data point. 2867db96d56Sopenharmony_ci 2877db96d56Sopenharmony_ci If the data is ordinal (supports order operations) but not numeric (doesn't 2887db96d56Sopenharmony_ci support addition), consider using :func:`median_low` or :func:`median_high` 2897db96d56Sopenharmony_ci instead. 2907db96d56Sopenharmony_ci 2917db96d56Sopenharmony_ci.. function:: median_low(data) 2927db96d56Sopenharmony_ci 2937db96d56Sopenharmony_ci Return the low median of numeric data. If *data* is empty, 2947db96d56Sopenharmony_ci :exc:`StatisticsError` is raised. *data* can be a sequence or iterable. 2957db96d56Sopenharmony_ci 2967db96d56Sopenharmony_ci The low median is always a member of the data set. When the number of data 2977db96d56Sopenharmony_ci points is odd, the middle value is returned. When it is even, the smaller of 2987db96d56Sopenharmony_ci the two middle values is returned. 2997db96d56Sopenharmony_ci 3007db96d56Sopenharmony_ci .. doctest:: 3017db96d56Sopenharmony_ci 3027db96d56Sopenharmony_ci >>> median_low([1, 3, 5]) 3037db96d56Sopenharmony_ci 3 3047db96d56Sopenharmony_ci >>> median_low([1, 3, 5, 7]) 3057db96d56Sopenharmony_ci 3 3067db96d56Sopenharmony_ci 3077db96d56Sopenharmony_ci Use the low median when your data are discrete and you prefer the median to 3087db96d56Sopenharmony_ci be an actual data point rather than interpolated. 3097db96d56Sopenharmony_ci 3107db96d56Sopenharmony_ci 3117db96d56Sopenharmony_ci.. function:: median_high(data) 3127db96d56Sopenharmony_ci 3137db96d56Sopenharmony_ci Return the high median of data. If *data* is empty, :exc:`StatisticsError` 3147db96d56Sopenharmony_ci is raised. *data* can be a sequence or iterable. 3157db96d56Sopenharmony_ci 3167db96d56Sopenharmony_ci The high median is always a member of the data set. When the number of data 3177db96d56Sopenharmony_ci points is odd, the middle value is returned. When it is even, the larger of 3187db96d56Sopenharmony_ci the two middle values is returned. 3197db96d56Sopenharmony_ci 3207db96d56Sopenharmony_ci .. doctest:: 3217db96d56Sopenharmony_ci 3227db96d56Sopenharmony_ci >>> median_high([1, 3, 5]) 3237db96d56Sopenharmony_ci 3 3247db96d56Sopenharmony_ci >>> median_high([1, 3, 5, 7]) 3257db96d56Sopenharmony_ci 5 3267db96d56Sopenharmony_ci 3277db96d56Sopenharmony_ci Use the high median when your data are discrete and you prefer the median to 3287db96d56Sopenharmony_ci be an actual data point rather than interpolated. 3297db96d56Sopenharmony_ci 3307db96d56Sopenharmony_ci 3317db96d56Sopenharmony_ci.. function:: median_grouped(data, interval=1) 3327db96d56Sopenharmony_ci 3337db96d56Sopenharmony_ci Return the median of grouped continuous data, calculated as the 50th 3347db96d56Sopenharmony_ci percentile, using interpolation. If *data* is empty, :exc:`StatisticsError` 3357db96d56Sopenharmony_ci is raised. *data* can be a sequence or iterable. 3367db96d56Sopenharmony_ci 3377db96d56Sopenharmony_ci .. doctest:: 3387db96d56Sopenharmony_ci 3397db96d56Sopenharmony_ci >>> median_grouped([52, 52, 53, 54]) 3407db96d56Sopenharmony_ci 52.5 3417db96d56Sopenharmony_ci 3427db96d56Sopenharmony_ci In the following example, the data are rounded, so that each value represents 3437db96d56Sopenharmony_ci the midpoint of data classes, e.g. 1 is the midpoint of the class 0.5--1.5, 2 3447db96d56Sopenharmony_ci is the midpoint of 1.5--2.5, 3 is the midpoint of 2.5--3.5, etc. With the data 3457db96d56Sopenharmony_ci given, the middle value falls somewhere in the class 3.5--4.5, and 3467db96d56Sopenharmony_ci interpolation is used to estimate it: 3477db96d56Sopenharmony_ci 3487db96d56Sopenharmony_ci .. doctest:: 3497db96d56Sopenharmony_ci 3507db96d56Sopenharmony_ci >>> median_grouped([1, 2, 2, 3, 4, 4, 4, 4, 4, 5]) 3517db96d56Sopenharmony_ci 3.7 3527db96d56Sopenharmony_ci 3537db96d56Sopenharmony_ci Optional argument *interval* represents the class interval, and defaults 3547db96d56Sopenharmony_ci to 1. Changing the class interval naturally will change the interpolation: 3557db96d56Sopenharmony_ci 3567db96d56Sopenharmony_ci .. doctest:: 3577db96d56Sopenharmony_ci 3587db96d56Sopenharmony_ci >>> median_grouped([1, 3, 3, 5, 7], interval=1) 3597db96d56Sopenharmony_ci 3.25 3607db96d56Sopenharmony_ci >>> median_grouped([1, 3, 3, 5, 7], interval=2) 3617db96d56Sopenharmony_ci 3.5 3627db96d56Sopenharmony_ci 3637db96d56Sopenharmony_ci This function does not check whether the data points are at least 3647db96d56Sopenharmony_ci *interval* apart. 3657db96d56Sopenharmony_ci 3667db96d56Sopenharmony_ci .. impl-detail:: 3677db96d56Sopenharmony_ci 3687db96d56Sopenharmony_ci Under some circumstances, :func:`median_grouped` may coerce data points to 3697db96d56Sopenharmony_ci floats. This behaviour is likely to change in the future. 3707db96d56Sopenharmony_ci 3717db96d56Sopenharmony_ci .. seealso:: 3727db96d56Sopenharmony_ci 3737db96d56Sopenharmony_ci * "Statistics for the Behavioral Sciences", Frederick J Gravetter and 3747db96d56Sopenharmony_ci Larry B Wallnau (8th Edition). 3757db96d56Sopenharmony_ci 3767db96d56Sopenharmony_ci * The `SSMEDIAN 3777db96d56Sopenharmony_ci <https://help.gnome.org/users/gnumeric/stable/gnumeric.html#gnumeric-function-SSMEDIAN>`_ 3787db96d56Sopenharmony_ci function in the Gnome Gnumeric spreadsheet, including `this discussion 3797db96d56Sopenharmony_ci <https://mail.gnome.org/archives/gnumeric-list/2011-April/msg00018.html>`_. 3807db96d56Sopenharmony_ci 3817db96d56Sopenharmony_ci 3827db96d56Sopenharmony_ci.. function:: mode(data) 3837db96d56Sopenharmony_ci 3847db96d56Sopenharmony_ci Return the single most common data point from discrete or nominal *data*. 3857db96d56Sopenharmony_ci The mode (when it exists) is the most typical value and serves as a 3867db96d56Sopenharmony_ci measure of central location. 3877db96d56Sopenharmony_ci 3887db96d56Sopenharmony_ci If there are multiple modes with the same frequency, returns the first one 3897db96d56Sopenharmony_ci encountered in the *data*. If the smallest or largest of those is 3907db96d56Sopenharmony_ci desired instead, use ``min(multimode(data))`` or ``max(multimode(data))``. 3917db96d56Sopenharmony_ci If the input *data* is empty, :exc:`StatisticsError` is raised. 3927db96d56Sopenharmony_ci 3937db96d56Sopenharmony_ci ``mode`` assumes discrete data and returns a single value. This is the 3947db96d56Sopenharmony_ci standard treatment of the mode as commonly taught in schools: 3957db96d56Sopenharmony_ci 3967db96d56Sopenharmony_ci .. doctest:: 3977db96d56Sopenharmony_ci 3987db96d56Sopenharmony_ci >>> mode([1, 1, 2, 3, 3, 3, 3, 4]) 3997db96d56Sopenharmony_ci 3 4007db96d56Sopenharmony_ci 4017db96d56Sopenharmony_ci The mode is unique in that it is the only statistic in this package that 4027db96d56Sopenharmony_ci also applies to nominal (non-numeric) data: 4037db96d56Sopenharmony_ci 4047db96d56Sopenharmony_ci .. doctest:: 4057db96d56Sopenharmony_ci 4067db96d56Sopenharmony_ci >>> mode(["red", "blue", "blue", "red", "green", "red", "red"]) 4077db96d56Sopenharmony_ci 'red' 4087db96d56Sopenharmony_ci 4097db96d56Sopenharmony_ci .. versionchanged:: 3.8 4107db96d56Sopenharmony_ci Now handles multimodal datasets by returning the first mode encountered. 4117db96d56Sopenharmony_ci Formerly, it raised :exc:`StatisticsError` when more than one mode was 4127db96d56Sopenharmony_ci found. 4137db96d56Sopenharmony_ci 4147db96d56Sopenharmony_ci 4157db96d56Sopenharmony_ci.. function:: multimode(data) 4167db96d56Sopenharmony_ci 4177db96d56Sopenharmony_ci Return a list of the most frequently occurring values in the order they 4187db96d56Sopenharmony_ci were first encountered in the *data*. Will return more than one result if 4197db96d56Sopenharmony_ci there are multiple modes or an empty list if the *data* is empty: 4207db96d56Sopenharmony_ci 4217db96d56Sopenharmony_ci .. doctest:: 4227db96d56Sopenharmony_ci 4237db96d56Sopenharmony_ci >>> multimode('aabbbbccddddeeffffgg') 4247db96d56Sopenharmony_ci ['b', 'd', 'f'] 4257db96d56Sopenharmony_ci >>> multimode('') 4267db96d56Sopenharmony_ci [] 4277db96d56Sopenharmony_ci 4287db96d56Sopenharmony_ci .. versionadded:: 3.8 4297db96d56Sopenharmony_ci 4307db96d56Sopenharmony_ci 4317db96d56Sopenharmony_ci.. function:: pstdev(data, mu=None) 4327db96d56Sopenharmony_ci 4337db96d56Sopenharmony_ci Return the population standard deviation (the square root of the population 4347db96d56Sopenharmony_ci variance). See :func:`pvariance` for arguments and other details. 4357db96d56Sopenharmony_ci 4367db96d56Sopenharmony_ci .. doctest:: 4377db96d56Sopenharmony_ci 4387db96d56Sopenharmony_ci >>> pstdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75]) 4397db96d56Sopenharmony_ci 0.986893273527251 4407db96d56Sopenharmony_ci 4417db96d56Sopenharmony_ci 4427db96d56Sopenharmony_ci.. function:: pvariance(data, mu=None) 4437db96d56Sopenharmony_ci 4447db96d56Sopenharmony_ci Return the population variance of *data*, a non-empty sequence or iterable 4457db96d56Sopenharmony_ci of real-valued numbers. Variance, or second moment about the mean, is a 4467db96d56Sopenharmony_ci measure of the variability (spread or dispersion) of data. A large 4477db96d56Sopenharmony_ci variance indicates that the data is spread out; a small variance indicates 4487db96d56Sopenharmony_ci it is clustered closely around the mean. 4497db96d56Sopenharmony_ci 4507db96d56Sopenharmony_ci If the optional second argument *mu* is given, it is typically the mean of 4517db96d56Sopenharmony_ci the *data*. It can also be used to compute the second moment around a 4527db96d56Sopenharmony_ci point that is not the mean. If it is missing or ``None`` (the default), 4537db96d56Sopenharmony_ci the arithmetic mean is automatically calculated. 4547db96d56Sopenharmony_ci 4557db96d56Sopenharmony_ci Use this function to calculate the variance from the entire population. To 4567db96d56Sopenharmony_ci estimate the variance from a sample, the :func:`variance` function is usually 4577db96d56Sopenharmony_ci a better choice. 4587db96d56Sopenharmony_ci 4597db96d56Sopenharmony_ci Raises :exc:`StatisticsError` if *data* is empty. 4607db96d56Sopenharmony_ci 4617db96d56Sopenharmony_ci Examples: 4627db96d56Sopenharmony_ci 4637db96d56Sopenharmony_ci .. doctest:: 4647db96d56Sopenharmony_ci 4657db96d56Sopenharmony_ci >>> data = [0.0, 0.25, 0.25, 1.25, 1.5, 1.75, 2.75, 3.25] 4667db96d56Sopenharmony_ci >>> pvariance(data) 4677db96d56Sopenharmony_ci 1.25 4687db96d56Sopenharmony_ci 4697db96d56Sopenharmony_ci If you have already calculated the mean of your data, you can pass it as the 4707db96d56Sopenharmony_ci optional second argument *mu* to avoid recalculation: 4717db96d56Sopenharmony_ci 4727db96d56Sopenharmony_ci .. doctest:: 4737db96d56Sopenharmony_ci 4747db96d56Sopenharmony_ci >>> mu = mean(data) 4757db96d56Sopenharmony_ci >>> pvariance(data, mu) 4767db96d56Sopenharmony_ci 1.25 4777db96d56Sopenharmony_ci 4787db96d56Sopenharmony_ci Decimals and Fractions are supported: 4797db96d56Sopenharmony_ci 4807db96d56Sopenharmony_ci .. doctest:: 4817db96d56Sopenharmony_ci 4827db96d56Sopenharmony_ci >>> from decimal import Decimal as D 4837db96d56Sopenharmony_ci >>> pvariance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")]) 4847db96d56Sopenharmony_ci Decimal('24.815') 4857db96d56Sopenharmony_ci 4867db96d56Sopenharmony_ci >>> from fractions import Fraction as F 4877db96d56Sopenharmony_ci >>> pvariance([F(1, 4), F(5, 4), F(1, 2)]) 4887db96d56Sopenharmony_ci Fraction(13, 72) 4897db96d56Sopenharmony_ci 4907db96d56Sopenharmony_ci .. note:: 4917db96d56Sopenharmony_ci 4927db96d56Sopenharmony_ci When called with the entire population, this gives the population variance 4937db96d56Sopenharmony_ci σ². When called on a sample instead, this is the biased sample variance 4947db96d56Sopenharmony_ci s², also known as variance with N degrees of freedom. 4957db96d56Sopenharmony_ci 4967db96d56Sopenharmony_ci If you somehow know the true population mean μ, you may use this 4977db96d56Sopenharmony_ci function to calculate the variance of a sample, giving the known 4987db96d56Sopenharmony_ci population mean as the second argument. Provided the data points are a 4997db96d56Sopenharmony_ci random sample of the population, the result will be an unbiased estimate 5007db96d56Sopenharmony_ci of the population variance. 5017db96d56Sopenharmony_ci 5027db96d56Sopenharmony_ci 5037db96d56Sopenharmony_ci.. function:: stdev(data, xbar=None) 5047db96d56Sopenharmony_ci 5057db96d56Sopenharmony_ci Return the sample standard deviation (the square root of the sample 5067db96d56Sopenharmony_ci variance). See :func:`variance` for arguments and other details. 5077db96d56Sopenharmony_ci 5087db96d56Sopenharmony_ci .. doctest:: 5097db96d56Sopenharmony_ci 5107db96d56Sopenharmony_ci >>> stdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75]) 5117db96d56Sopenharmony_ci 1.0810874155219827 5127db96d56Sopenharmony_ci 5137db96d56Sopenharmony_ci 5147db96d56Sopenharmony_ci.. function:: variance(data, xbar=None) 5157db96d56Sopenharmony_ci 5167db96d56Sopenharmony_ci Return the sample variance of *data*, an iterable of at least two real-valued 5177db96d56Sopenharmony_ci numbers. Variance, or second moment about the mean, is a measure of the 5187db96d56Sopenharmony_ci variability (spread or dispersion) of data. A large variance indicates that 5197db96d56Sopenharmony_ci the data is spread out; a small variance indicates it is clustered closely 5207db96d56Sopenharmony_ci around the mean. 5217db96d56Sopenharmony_ci 5227db96d56Sopenharmony_ci If the optional second argument *xbar* is given, it should be the mean of 5237db96d56Sopenharmony_ci *data*. If it is missing or ``None`` (the default), the mean is 5247db96d56Sopenharmony_ci automatically calculated. 5257db96d56Sopenharmony_ci 5267db96d56Sopenharmony_ci Use this function when your data is a sample from a population. To calculate 5277db96d56Sopenharmony_ci the variance from the entire population, see :func:`pvariance`. 5287db96d56Sopenharmony_ci 5297db96d56Sopenharmony_ci Raises :exc:`StatisticsError` if *data* has fewer than two values. 5307db96d56Sopenharmony_ci 5317db96d56Sopenharmony_ci Examples: 5327db96d56Sopenharmony_ci 5337db96d56Sopenharmony_ci .. doctest:: 5347db96d56Sopenharmony_ci 5357db96d56Sopenharmony_ci >>> data = [2.75, 1.75, 1.25, 0.25, 0.5, 1.25, 3.5] 5367db96d56Sopenharmony_ci >>> variance(data) 5377db96d56Sopenharmony_ci 1.3720238095238095 5387db96d56Sopenharmony_ci 5397db96d56Sopenharmony_ci If you have already calculated the mean of your data, you can pass it as the 5407db96d56Sopenharmony_ci optional second argument *xbar* to avoid recalculation: 5417db96d56Sopenharmony_ci 5427db96d56Sopenharmony_ci .. doctest:: 5437db96d56Sopenharmony_ci 5447db96d56Sopenharmony_ci >>> m = mean(data) 5457db96d56Sopenharmony_ci >>> variance(data, m) 5467db96d56Sopenharmony_ci 1.3720238095238095 5477db96d56Sopenharmony_ci 5487db96d56Sopenharmony_ci This function does not attempt to verify that you have passed the actual mean 5497db96d56Sopenharmony_ci as *xbar*. Using arbitrary values for *xbar* can lead to invalid or 5507db96d56Sopenharmony_ci impossible results. 5517db96d56Sopenharmony_ci 5527db96d56Sopenharmony_ci Decimal and Fraction values are supported: 5537db96d56Sopenharmony_ci 5547db96d56Sopenharmony_ci .. doctest:: 5557db96d56Sopenharmony_ci 5567db96d56Sopenharmony_ci >>> from decimal import Decimal as D 5577db96d56Sopenharmony_ci >>> variance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")]) 5587db96d56Sopenharmony_ci Decimal('31.01875') 5597db96d56Sopenharmony_ci 5607db96d56Sopenharmony_ci >>> from fractions import Fraction as F 5617db96d56Sopenharmony_ci >>> variance([F(1, 6), F(1, 2), F(5, 3)]) 5627db96d56Sopenharmony_ci Fraction(67, 108) 5637db96d56Sopenharmony_ci 5647db96d56Sopenharmony_ci .. note:: 5657db96d56Sopenharmony_ci 5667db96d56Sopenharmony_ci This is the sample variance s² with Bessel's correction, also known as 5677db96d56Sopenharmony_ci variance with N-1 degrees of freedom. Provided that the data points are 5687db96d56Sopenharmony_ci representative (e.g. independent and identically distributed), the result 5697db96d56Sopenharmony_ci should be an unbiased estimate of the true population variance. 5707db96d56Sopenharmony_ci 5717db96d56Sopenharmony_ci If you somehow know the actual population mean μ you should pass it to the 5727db96d56Sopenharmony_ci :func:`pvariance` function as the *mu* parameter to get the variance of a 5737db96d56Sopenharmony_ci sample. 5747db96d56Sopenharmony_ci 5757db96d56Sopenharmony_ci.. function:: quantiles(data, *, n=4, method='exclusive') 5767db96d56Sopenharmony_ci 5777db96d56Sopenharmony_ci Divide *data* into *n* continuous intervals with equal probability. 5787db96d56Sopenharmony_ci Returns a list of ``n - 1`` cut points separating the intervals. 5797db96d56Sopenharmony_ci 5807db96d56Sopenharmony_ci Set *n* to 4 for quartiles (the default). Set *n* to 10 for deciles. Set 5817db96d56Sopenharmony_ci *n* to 100 for percentiles which gives the 99 cuts points that separate 5827db96d56Sopenharmony_ci *data* into 100 equal sized groups. Raises :exc:`StatisticsError` if *n* 5837db96d56Sopenharmony_ci is not least 1. 5847db96d56Sopenharmony_ci 5857db96d56Sopenharmony_ci The *data* can be any iterable containing sample data. For meaningful 5867db96d56Sopenharmony_ci results, the number of data points in *data* should be larger than *n*. 5877db96d56Sopenharmony_ci Raises :exc:`StatisticsError` if there are not at least two data points. 5887db96d56Sopenharmony_ci 5897db96d56Sopenharmony_ci The cut points are linearly interpolated from the 5907db96d56Sopenharmony_ci two nearest data points. For example, if a cut point falls one-third 5917db96d56Sopenharmony_ci of the distance between two sample values, ``100`` and ``112``, the 5927db96d56Sopenharmony_ci cut-point will evaluate to ``104``. 5937db96d56Sopenharmony_ci 5947db96d56Sopenharmony_ci The *method* for computing quantiles can be varied depending on 5957db96d56Sopenharmony_ci whether the *data* includes or excludes the lowest and 5967db96d56Sopenharmony_ci highest possible values from the population. 5977db96d56Sopenharmony_ci 5987db96d56Sopenharmony_ci The default *method* is "exclusive" and is used for data sampled from 5997db96d56Sopenharmony_ci a population that can have more extreme values than found in the 6007db96d56Sopenharmony_ci samples. The portion of the population falling below the *i-th* of 6017db96d56Sopenharmony_ci *m* sorted data points is computed as ``i / (m + 1)``. Given nine 6027db96d56Sopenharmony_ci sample values, the method sorts them and assigns the following 6037db96d56Sopenharmony_ci percentiles: 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%. 6047db96d56Sopenharmony_ci 6057db96d56Sopenharmony_ci Setting the *method* to "inclusive" is used for describing population 6067db96d56Sopenharmony_ci data or for samples that are known to include the most extreme values 6077db96d56Sopenharmony_ci from the population. The minimum value in *data* is treated as the 0th 6087db96d56Sopenharmony_ci percentile and the maximum value is treated as the 100th percentile. 6097db96d56Sopenharmony_ci The portion of the population falling below the *i-th* of *m* sorted 6107db96d56Sopenharmony_ci data points is computed as ``(i - 1) / (m - 1)``. Given 11 sample 6117db96d56Sopenharmony_ci values, the method sorts them and assigns the following percentiles: 6127db96d56Sopenharmony_ci 0%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 100%. 6137db96d56Sopenharmony_ci 6147db96d56Sopenharmony_ci .. doctest:: 6157db96d56Sopenharmony_ci 6167db96d56Sopenharmony_ci # Decile cut points for empirically sampled data 6177db96d56Sopenharmony_ci >>> data = [105, 129, 87, 86, 111, 111, 89, 81, 108, 92, 110, 6187db96d56Sopenharmony_ci ... 100, 75, 105, 103, 109, 76, 119, 99, 91, 103, 129, 6197db96d56Sopenharmony_ci ... 106, 101, 84, 111, 74, 87, 86, 103, 103, 106, 86, 6207db96d56Sopenharmony_ci ... 111, 75, 87, 102, 121, 111, 88, 89, 101, 106, 95, 6217db96d56Sopenharmony_ci ... 103, 107, 101, 81, 109, 104] 6227db96d56Sopenharmony_ci >>> [round(q, 1) for q in quantiles(data, n=10)] 6237db96d56Sopenharmony_ci [81.0, 86.2, 89.0, 99.4, 102.5, 103.6, 106.0, 109.8, 111.0] 6247db96d56Sopenharmony_ci 6257db96d56Sopenharmony_ci .. versionadded:: 3.8 6267db96d56Sopenharmony_ci 6277db96d56Sopenharmony_ci.. function:: covariance(x, y, /) 6287db96d56Sopenharmony_ci 6297db96d56Sopenharmony_ci Return the sample covariance of two inputs *x* and *y*. Covariance 6307db96d56Sopenharmony_ci is a measure of the joint variability of two inputs. 6317db96d56Sopenharmony_ci 6327db96d56Sopenharmony_ci Both inputs must be of the same length (no less than two), otherwise 6337db96d56Sopenharmony_ci :exc:`StatisticsError` is raised. 6347db96d56Sopenharmony_ci 6357db96d56Sopenharmony_ci Examples: 6367db96d56Sopenharmony_ci 6377db96d56Sopenharmony_ci .. doctest:: 6387db96d56Sopenharmony_ci 6397db96d56Sopenharmony_ci >>> x = [1, 2, 3, 4, 5, 6, 7, 8, 9] 6407db96d56Sopenharmony_ci >>> y = [1, 2, 3, 1, 2, 3, 1, 2, 3] 6417db96d56Sopenharmony_ci >>> covariance(x, y) 6427db96d56Sopenharmony_ci 0.75 6437db96d56Sopenharmony_ci >>> z = [9, 8, 7, 6, 5, 4, 3, 2, 1] 6447db96d56Sopenharmony_ci >>> covariance(x, z) 6457db96d56Sopenharmony_ci -7.5 6467db96d56Sopenharmony_ci >>> covariance(z, x) 6477db96d56Sopenharmony_ci -7.5 6487db96d56Sopenharmony_ci 6497db96d56Sopenharmony_ci .. versionadded:: 3.10 6507db96d56Sopenharmony_ci 6517db96d56Sopenharmony_ci.. function:: correlation(x, y, /) 6527db96d56Sopenharmony_ci 6537db96d56Sopenharmony_ci Return the `Pearson's correlation coefficient 6547db96d56Sopenharmony_ci <https://en.wikipedia.org/wiki/Pearson_correlation_coefficient>`_ 6557db96d56Sopenharmony_ci for two inputs. Pearson's correlation coefficient *r* takes values 6567db96d56Sopenharmony_ci between -1 and +1. It measures the strength and direction of the linear 6577db96d56Sopenharmony_ci relationship, where +1 means very strong, positive linear relationship, 6587db96d56Sopenharmony_ci -1 very strong, negative linear relationship, and 0 no linear relationship. 6597db96d56Sopenharmony_ci 6607db96d56Sopenharmony_ci Both inputs must be of the same length (no less than two), and need 6617db96d56Sopenharmony_ci not to be constant, otherwise :exc:`StatisticsError` is raised. 6627db96d56Sopenharmony_ci 6637db96d56Sopenharmony_ci Examples: 6647db96d56Sopenharmony_ci 6657db96d56Sopenharmony_ci .. doctest:: 6667db96d56Sopenharmony_ci 6677db96d56Sopenharmony_ci >>> x = [1, 2, 3, 4, 5, 6, 7, 8, 9] 6687db96d56Sopenharmony_ci >>> y = [9, 8, 7, 6, 5, 4, 3, 2, 1] 6697db96d56Sopenharmony_ci >>> correlation(x, x) 6707db96d56Sopenharmony_ci 1.0 6717db96d56Sopenharmony_ci >>> correlation(x, y) 6727db96d56Sopenharmony_ci -1.0 6737db96d56Sopenharmony_ci 6747db96d56Sopenharmony_ci .. versionadded:: 3.10 6757db96d56Sopenharmony_ci 6767db96d56Sopenharmony_ci.. function:: linear_regression(x, y, /, *, proportional=False) 6777db96d56Sopenharmony_ci 6787db96d56Sopenharmony_ci Return the slope and intercept of `simple linear regression 6797db96d56Sopenharmony_ci <https://en.wikipedia.org/wiki/Simple_linear_regression>`_ 6807db96d56Sopenharmony_ci parameters estimated using ordinary least squares. Simple linear 6817db96d56Sopenharmony_ci regression describes the relationship between an independent variable *x* and 6827db96d56Sopenharmony_ci a dependent variable *y* in terms of this linear function: 6837db96d56Sopenharmony_ci 6847db96d56Sopenharmony_ci *y = slope \* x + intercept + noise* 6857db96d56Sopenharmony_ci 6867db96d56Sopenharmony_ci where ``slope`` and ``intercept`` are the regression parameters that are 6877db96d56Sopenharmony_ci estimated, and ``noise`` represents the 6887db96d56Sopenharmony_ci variability of the data that was not explained by the linear regression 6897db96d56Sopenharmony_ci (it is equal to the difference between predicted and actual values 6907db96d56Sopenharmony_ci of the dependent variable). 6917db96d56Sopenharmony_ci 6927db96d56Sopenharmony_ci Both inputs must be of the same length (no less than two), and 6937db96d56Sopenharmony_ci the independent variable *x* cannot be constant; 6947db96d56Sopenharmony_ci otherwise a :exc:`StatisticsError` is raised. 6957db96d56Sopenharmony_ci 6967db96d56Sopenharmony_ci For example, we can use the `release dates of the Monty 6977db96d56Sopenharmony_ci Python films <https://en.wikipedia.org/wiki/Monty_Python#Films>`_ 6987db96d56Sopenharmony_ci to predict the cumulative number of Monty Python films 6997db96d56Sopenharmony_ci that would have been produced by 2019 7007db96d56Sopenharmony_ci assuming that they had kept the pace. 7017db96d56Sopenharmony_ci 7027db96d56Sopenharmony_ci .. doctest:: 7037db96d56Sopenharmony_ci 7047db96d56Sopenharmony_ci >>> year = [1971, 1975, 1979, 1982, 1983] 7057db96d56Sopenharmony_ci >>> films_total = [1, 2, 3, 4, 5] 7067db96d56Sopenharmony_ci >>> slope, intercept = linear_regression(year, films_total) 7077db96d56Sopenharmony_ci >>> round(slope * 2019 + intercept) 7087db96d56Sopenharmony_ci 16 7097db96d56Sopenharmony_ci 7107db96d56Sopenharmony_ci If *proportional* is true, the independent variable *x* and the 7117db96d56Sopenharmony_ci dependent variable *y* are assumed to be directly proportional. 7127db96d56Sopenharmony_ci The data is fit to a line passing through the origin. 7137db96d56Sopenharmony_ci Since the *intercept* will always be 0.0, the underlying linear 7147db96d56Sopenharmony_ci function simplifies to: 7157db96d56Sopenharmony_ci 7167db96d56Sopenharmony_ci *y = slope \* x + noise* 7177db96d56Sopenharmony_ci 7187db96d56Sopenharmony_ci .. versionadded:: 3.10 7197db96d56Sopenharmony_ci 7207db96d56Sopenharmony_ci .. versionchanged:: 3.11 7217db96d56Sopenharmony_ci Added support for *proportional*. 7227db96d56Sopenharmony_ci 7237db96d56Sopenharmony_ciExceptions 7247db96d56Sopenharmony_ci---------- 7257db96d56Sopenharmony_ci 7267db96d56Sopenharmony_ciA single exception is defined: 7277db96d56Sopenharmony_ci 7287db96d56Sopenharmony_ci.. exception:: StatisticsError 7297db96d56Sopenharmony_ci 7307db96d56Sopenharmony_ci Subclass of :exc:`ValueError` for statistics-related exceptions. 7317db96d56Sopenharmony_ci 7327db96d56Sopenharmony_ci 7337db96d56Sopenharmony_ci:class:`NormalDist` objects 7347db96d56Sopenharmony_ci--------------------------- 7357db96d56Sopenharmony_ci 7367db96d56Sopenharmony_ci:class:`NormalDist` is a tool for creating and manipulating normal 7377db96d56Sopenharmony_cidistributions of a `random variable 7387db96d56Sopenharmony_ci<http://www.stat.yale.edu/Courses/1997-98/101/ranvar.htm>`_. It is a 7397db96d56Sopenharmony_ciclass that treats the mean and standard deviation of data 7407db96d56Sopenharmony_cimeasurements as a single entity. 7417db96d56Sopenharmony_ci 7427db96d56Sopenharmony_ciNormal distributions arise from the `Central Limit Theorem 7437db96d56Sopenharmony_ci<https://en.wikipedia.org/wiki/Central_limit_theorem>`_ and have a wide range 7447db96d56Sopenharmony_ciof applications in statistics. 7457db96d56Sopenharmony_ci 7467db96d56Sopenharmony_ci.. class:: NormalDist(mu=0.0, sigma=1.0) 7477db96d56Sopenharmony_ci 7487db96d56Sopenharmony_ci Returns a new *NormalDist* object where *mu* represents the `arithmetic 7497db96d56Sopenharmony_ci mean <https://en.wikipedia.org/wiki/Arithmetic_mean>`_ and *sigma* 7507db96d56Sopenharmony_ci represents the `standard deviation 7517db96d56Sopenharmony_ci <https://en.wikipedia.org/wiki/Standard_deviation>`_. 7527db96d56Sopenharmony_ci 7537db96d56Sopenharmony_ci If *sigma* is negative, raises :exc:`StatisticsError`. 7547db96d56Sopenharmony_ci 7557db96d56Sopenharmony_ci .. attribute:: mean 7567db96d56Sopenharmony_ci 7577db96d56Sopenharmony_ci A read-only property for the `arithmetic mean 7587db96d56Sopenharmony_ci <https://en.wikipedia.org/wiki/Arithmetic_mean>`_ of a normal 7597db96d56Sopenharmony_ci distribution. 7607db96d56Sopenharmony_ci 7617db96d56Sopenharmony_ci .. attribute:: median 7627db96d56Sopenharmony_ci 7637db96d56Sopenharmony_ci A read-only property for the `median 7647db96d56Sopenharmony_ci <https://en.wikipedia.org/wiki/Median>`_ of a normal 7657db96d56Sopenharmony_ci distribution. 7667db96d56Sopenharmony_ci 7677db96d56Sopenharmony_ci .. attribute:: mode 7687db96d56Sopenharmony_ci 7697db96d56Sopenharmony_ci A read-only property for the `mode 7707db96d56Sopenharmony_ci <https://en.wikipedia.org/wiki/Mode_(statistics)>`_ of a normal 7717db96d56Sopenharmony_ci distribution. 7727db96d56Sopenharmony_ci 7737db96d56Sopenharmony_ci .. attribute:: stdev 7747db96d56Sopenharmony_ci 7757db96d56Sopenharmony_ci A read-only property for the `standard deviation 7767db96d56Sopenharmony_ci <https://en.wikipedia.org/wiki/Standard_deviation>`_ of a normal 7777db96d56Sopenharmony_ci distribution. 7787db96d56Sopenharmony_ci 7797db96d56Sopenharmony_ci .. attribute:: variance 7807db96d56Sopenharmony_ci 7817db96d56Sopenharmony_ci A read-only property for the `variance 7827db96d56Sopenharmony_ci <https://en.wikipedia.org/wiki/Variance>`_ of a normal 7837db96d56Sopenharmony_ci distribution. Equal to the square of the standard deviation. 7847db96d56Sopenharmony_ci 7857db96d56Sopenharmony_ci .. classmethod:: NormalDist.from_samples(data) 7867db96d56Sopenharmony_ci 7877db96d56Sopenharmony_ci Makes a normal distribution instance with *mu* and *sigma* parameters 7887db96d56Sopenharmony_ci estimated from the *data* using :func:`fmean` and :func:`stdev`. 7897db96d56Sopenharmony_ci 7907db96d56Sopenharmony_ci The *data* can be any :term:`iterable` and should consist of values 7917db96d56Sopenharmony_ci that can be converted to type :class:`float`. If *data* does not 7927db96d56Sopenharmony_ci contain at least two elements, raises :exc:`StatisticsError` because it 7937db96d56Sopenharmony_ci takes at least one point to estimate a central value and at least two 7947db96d56Sopenharmony_ci points to estimate dispersion. 7957db96d56Sopenharmony_ci 7967db96d56Sopenharmony_ci .. method:: NormalDist.samples(n, *, seed=None) 7977db96d56Sopenharmony_ci 7987db96d56Sopenharmony_ci Generates *n* random samples for a given mean and standard deviation. 7997db96d56Sopenharmony_ci Returns a :class:`list` of :class:`float` values. 8007db96d56Sopenharmony_ci 8017db96d56Sopenharmony_ci If *seed* is given, creates a new instance of the underlying random 8027db96d56Sopenharmony_ci number generator. This is useful for creating reproducible results, 8037db96d56Sopenharmony_ci even in a multi-threading context. 8047db96d56Sopenharmony_ci 8057db96d56Sopenharmony_ci .. method:: NormalDist.pdf(x) 8067db96d56Sopenharmony_ci 8077db96d56Sopenharmony_ci Using a `probability density function (pdf) 8087db96d56Sopenharmony_ci <https://en.wikipedia.org/wiki/Probability_density_function>`_, compute 8097db96d56Sopenharmony_ci the relative likelihood that a random variable *X* will be near the 8107db96d56Sopenharmony_ci given value *x*. Mathematically, it is the limit of the ratio ``P(x <= 8117db96d56Sopenharmony_ci X < x+dx) / dx`` as *dx* approaches zero. 8127db96d56Sopenharmony_ci 8137db96d56Sopenharmony_ci The relative likelihood is computed as the probability of a sample 8147db96d56Sopenharmony_ci occurring in a narrow range divided by the width of the range (hence 8157db96d56Sopenharmony_ci the word "density"). Since the likelihood is relative to other points, 8167db96d56Sopenharmony_ci its value can be greater than ``1.0``. 8177db96d56Sopenharmony_ci 8187db96d56Sopenharmony_ci .. method:: NormalDist.cdf(x) 8197db96d56Sopenharmony_ci 8207db96d56Sopenharmony_ci Using a `cumulative distribution function (cdf) 8217db96d56Sopenharmony_ci <https://en.wikipedia.org/wiki/Cumulative_distribution_function>`_, 8227db96d56Sopenharmony_ci compute the probability that a random variable *X* will be less than or 8237db96d56Sopenharmony_ci equal to *x*. Mathematically, it is written ``P(X <= x)``. 8247db96d56Sopenharmony_ci 8257db96d56Sopenharmony_ci .. method:: NormalDist.inv_cdf(p) 8267db96d56Sopenharmony_ci 8277db96d56Sopenharmony_ci Compute the inverse cumulative distribution function, also known as the 8287db96d56Sopenharmony_ci `quantile function <https://en.wikipedia.org/wiki/Quantile_function>`_ 8297db96d56Sopenharmony_ci or the `percent-point 8307db96d56Sopenharmony_ci <https://web.archive.org/web/20190203145224/https://www.statisticshowto.datasciencecentral.com/inverse-distribution-function/>`_ 8317db96d56Sopenharmony_ci function. Mathematically, it is written ``x : P(X <= x) = p``. 8327db96d56Sopenharmony_ci 8337db96d56Sopenharmony_ci Finds the value *x* of the random variable *X* such that the 8347db96d56Sopenharmony_ci probability of the variable being less than or equal to that value 8357db96d56Sopenharmony_ci equals the given probability *p*. 8367db96d56Sopenharmony_ci 8377db96d56Sopenharmony_ci .. method:: NormalDist.overlap(other) 8387db96d56Sopenharmony_ci 8397db96d56Sopenharmony_ci Measures the agreement between two normal probability distributions. 8407db96d56Sopenharmony_ci Returns a value between 0.0 and 1.0 giving `the overlapping area for 8417db96d56Sopenharmony_ci the two probability density functions 8427db96d56Sopenharmony_ci <https://www.rasch.org/rmt/rmt101r.htm>`_. 8437db96d56Sopenharmony_ci 8447db96d56Sopenharmony_ci .. method:: NormalDist.quantiles(n=4) 8457db96d56Sopenharmony_ci 8467db96d56Sopenharmony_ci Divide the normal distribution into *n* continuous intervals with 8477db96d56Sopenharmony_ci equal probability. Returns a list of (n - 1) cut points separating 8487db96d56Sopenharmony_ci the intervals. 8497db96d56Sopenharmony_ci 8507db96d56Sopenharmony_ci Set *n* to 4 for quartiles (the default). Set *n* to 10 for deciles. 8517db96d56Sopenharmony_ci Set *n* to 100 for percentiles which gives the 99 cuts points that 8527db96d56Sopenharmony_ci separate the normal distribution into 100 equal sized groups. 8537db96d56Sopenharmony_ci 8547db96d56Sopenharmony_ci .. method:: NormalDist.zscore(x) 8557db96d56Sopenharmony_ci 8567db96d56Sopenharmony_ci Compute the 8577db96d56Sopenharmony_ci `Standard Score <https://www.statisticshowto.com/probability-and-statistics/z-score/>`_ 8587db96d56Sopenharmony_ci describing *x* in terms of the number of standard deviations 8597db96d56Sopenharmony_ci above or below the mean of the normal distribution: 8607db96d56Sopenharmony_ci ``(x - mean) / stdev``. 8617db96d56Sopenharmony_ci 8627db96d56Sopenharmony_ci .. versionadded:: 3.9 8637db96d56Sopenharmony_ci 8647db96d56Sopenharmony_ci Instances of :class:`NormalDist` support addition, subtraction, 8657db96d56Sopenharmony_ci multiplication and division by a constant. These operations 8667db96d56Sopenharmony_ci are used for translation and scaling. For example: 8677db96d56Sopenharmony_ci 8687db96d56Sopenharmony_ci .. doctest:: 8697db96d56Sopenharmony_ci 8707db96d56Sopenharmony_ci >>> temperature_february = NormalDist(5, 2.5) # Celsius 8717db96d56Sopenharmony_ci >>> temperature_february * (9/5) + 32 # Fahrenheit 8727db96d56Sopenharmony_ci NormalDist(mu=41.0, sigma=4.5) 8737db96d56Sopenharmony_ci 8747db96d56Sopenharmony_ci Dividing a constant by an instance of :class:`NormalDist` is not supported 8757db96d56Sopenharmony_ci because the result wouldn't be normally distributed. 8767db96d56Sopenharmony_ci 8777db96d56Sopenharmony_ci Since normal distributions arise from additive effects of independent 8787db96d56Sopenharmony_ci variables, it is possible to `add and subtract two independent normally 8797db96d56Sopenharmony_ci distributed random variables 8807db96d56Sopenharmony_ci <https://en.wikipedia.org/wiki/Sum_of_normally_distributed_random_variables>`_ 8817db96d56Sopenharmony_ci represented as instances of :class:`NormalDist`. For example: 8827db96d56Sopenharmony_ci 8837db96d56Sopenharmony_ci .. doctest:: 8847db96d56Sopenharmony_ci 8857db96d56Sopenharmony_ci >>> birth_weights = NormalDist.from_samples([2.5, 3.1, 2.1, 2.4, 2.7, 3.5]) 8867db96d56Sopenharmony_ci >>> drug_effects = NormalDist(0.4, 0.15) 8877db96d56Sopenharmony_ci >>> combined = birth_weights + drug_effects 8887db96d56Sopenharmony_ci >>> round(combined.mean, 1) 8897db96d56Sopenharmony_ci 3.1 8907db96d56Sopenharmony_ci >>> round(combined.stdev, 1) 8917db96d56Sopenharmony_ci 0.5 8927db96d56Sopenharmony_ci 8937db96d56Sopenharmony_ci .. versionadded:: 3.8 8947db96d56Sopenharmony_ci 8957db96d56Sopenharmony_ci 8967db96d56Sopenharmony_ci:class:`NormalDist` Examples and Recipes 8977db96d56Sopenharmony_ci^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 8987db96d56Sopenharmony_ci 8997db96d56Sopenharmony_ci:class:`NormalDist` readily solves classic probability problems. 9007db96d56Sopenharmony_ci 9017db96d56Sopenharmony_ciFor example, given `historical data for SAT exams 9027db96d56Sopenharmony_ci<https://nces.ed.gov/programs/digest/d17/tables/dt17_226.40.asp>`_ showing 9037db96d56Sopenharmony_cithat scores are normally distributed with a mean of 1060 and a standard 9047db96d56Sopenharmony_cideviation of 195, determine the percentage of students with test scores 9057db96d56Sopenharmony_cibetween 1100 and 1200, after rounding to the nearest whole number: 9067db96d56Sopenharmony_ci 9077db96d56Sopenharmony_ci.. doctest:: 9087db96d56Sopenharmony_ci 9097db96d56Sopenharmony_ci >>> sat = NormalDist(1060, 195) 9107db96d56Sopenharmony_ci >>> fraction = sat.cdf(1200 + 0.5) - sat.cdf(1100 - 0.5) 9117db96d56Sopenharmony_ci >>> round(fraction * 100.0, 1) 9127db96d56Sopenharmony_ci 18.4 9137db96d56Sopenharmony_ci 9147db96d56Sopenharmony_ciFind the `quartiles <https://en.wikipedia.org/wiki/Quartile>`_ and `deciles 9157db96d56Sopenharmony_ci<https://en.wikipedia.org/wiki/Decile>`_ for the SAT scores: 9167db96d56Sopenharmony_ci 9177db96d56Sopenharmony_ci.. doctest:: 9187db96d56Sopenharmony_ci 9197db96d56Sopenharmony_ci >>> list(map(round, sat.quantiles())) 9207db96d56Sopenharmony_ci [928, 1060, 1192] 9217db96d56Sopenharmony_ci >>> list(map(round, sat.quantiles(n=10))) 9227db96d56Sopenharmony_ci [810, 896, 958, 1011, 1060, 1109, 1162, 1224, 1310] 9237db96d56Sopenharmony_ci 9247db96d56Sopenharmony_ciTo estimate the distribution for a model than isn't easy to solve 9257db96d56Sopenharmony_cianalytically, :class:`NormalDist` can generate input samples for a `Monte 9267db96d56Sopenharmony_ciCarlo simulation <https://en.wikipedia.org/wiki/Monte_Carlo_method>`_: 9277db96d56Sopenharmony_ci 9287db96d56Sopenharmony_ci.. doctest:: 9297db96d56Sopenharmony_ci 9307db96d56Sopenharmony_ci >>> def model(x, y, z): 9317db96d56Sopenharmony_ci ... return (3*x + 7*x*y - 5*y) / (11 * z) 9327db96d56Sopenharmony_ci ... 9337db96d56Sopenharmony_ci >>> n = 100_000 9347db96d56Sopenharmony_ci >>> X = NormalDist(10, 2.5).samples(n, seed=3652260728) 9357db96d56Sopenharmony_ci >>> Y = NormalDist(15, 1.75).samples(n, seed=4582495471) 9367db96d56Sopenharmony_ci >>> Z = NormalDist(50, 1.25).samples(n, seed=6582483453) 9377db96d56Sopenharmony_ci >>> quantiles(map(model, X, Y, Z)) # doctest: +SKIP 9387db96d56Sopenharmony_ci [1.4591308524824727, 1.8035946855390597, 2.175091447274739] 9397db96d56Sopenharmony_ci 9407db96d56Sopenharmony_ciNormal distributions can be used to approximate `Binomial 9417db96d56Sopenharmony_cidistributions <https://mathworld.wolfram.com/BinomialDistribution.html>`_ 9427db96d56Sopenharmony_ciwhen the sample size is large and when the probability of a successful 9437db96d56Sopenharmony_citrial is near 50%. 9447db96d56Sopenharmony_ci 9457db96d56Sopenharmony_ciFor example, an open source conference has 750 attendees and two rooms with a 9467db96d56Sopenharmony_ci500 person capacity. There is a talk about Python and another about Ruby. 9477db96d56Sopenharmony_ciIn previous conferences, 65% of the attendees preferred to listen to Python 9487db96d56Sopenharmony_citalks. Assuming the population preferences haven't changed, what is the 9497db96d56Sopenharmony_ciprobability that the Python room will stay within its capacity limits? 9507db96d56Sopenharmony_ci 9517db96d56Sopenharmony_ci.. doctest:: 9527db96d56Sopenharmony_ci 9537db96d56Sopenharmony_ci >>> n = 750 # Sample size 9547db96d56Sopenharmony_ci >>> p = 0.65 # Preference for Python 9557db96d56Sopenharmony_ci >>> q = 1.0 - p # Preference for Ruby 9567db96d56Sopenharmony_ci >>> k = 500 # Room capacity 9577db96d56Sopenharmony_ci 9587db96d56Sopenharmony_ci >>> # Approximation using the cumulative normal distribution 9597db96d56Sopenharmony_ci >>> from math import sqrt 9607db96d56Sopenharmony_ci >>> round(NormalDist(mu=n*p, sigma=sqrt(n*p*q)).cdf(k + 0.5), 4) 9617db96d56Sopenharmony_ci 0.8402 9627db96d56Sopenharmony_ci 9637db96d56Sopenharmony_ci >>> # Solution using the cumulative binomial distribution 9647db96d56Sopenharmony_ci >>> from math import comb, fsum 9657db96d56Sopenharmony_ci >>> round(fsum(comb(n, r) * p**r * q**(n-r) for r in range(k+1)), 4) 9667db96d56Sopenharmony_ci 0.8402 9677db96d56Sopenharmony_ci 9687db96d56Sopenharmony_ci >>> # Approximation using a simulation 9697db96d56Sopenharmony_ci >>> from random import seed, choices 9707db96d56Sopenharmony_ci >>> seed(8675309) 9717db96d56Sopenharmony_ci >>> def trial(): 9727db96d56Sopenharmony_ci ... return choices(('Python', 'Ruby'), (p, q), k=n).count('Python') 9737db96d56Sopenharmony_ci >>> mean(trial() <= k for i in range(10_000)) 9747db96d56Sopenharmony_ci 0.8398 9757db96d56Sopenharmony_ci 9767db96d56Sopenharmony_ciNormal distributions commonly arise in machine learning problems. 9777db96d56Sopenharmony_ci 9787db96d56Sopenharmony_ciWikipedia has a `nice example of a Naive Bayesian Classifier 9797db96d56Sopenharmony_ci<https://en.wikipedia.org/wiki/Naive_Bayes_classifier#Person_classification>`_. 9807db96d56Sopenharmony_ciThe challenge is to predict a person's gender from measurements of normally 9817db96d56Sopenharmony_cidistributed features including height, weight, and foot size. 9827db96d56Sopenharmony_ci 9837db96d56Sopenharmony_ciWe're given a training dataset with measurements for eight people. The 9847db96d56Sopenharmony_cimeasurements are assumed to be normally distributed, so we summarize the data 9857db96d56Sopenharmony_ciwith :class:`NormalDist`: 9867db96d56Sopenharmony_ci 9877db96d56Sopenharmony_ci.. doctest:: 9887db96d56Sopenharmony_ci 9897db96d56Sopenharmony_ci >>> height_male = NormalDist.from_samples([6, 5.92, 5.58, 5.92]) 9907db96d56Sopenharmony_ci >>> height_female = NormalDist.from_samples([5, 5.5, 5.42, 5.75]) 9917db96d56Sopenharmony_ci >>> weight_male = NormalDist.from_samples([180, 190, 170, 165]) 9927db96d56Sopenharmony_ci >>> weight_female = NormalDist.from_samples([100, 150, 130, 150]) 9937db96d56Sopenharmony_ci >>> foot_size_male = NormalDist.from_samples([12, 11, 12, 10]) 9947db96d56Sopenharmony_ci >>> foot_size_female = NormalDist.from_samples([6, 8, 7, 9]) 9957db96d56Sopenharmony_ci 9967db96d56Sopenharmony_ciNext, we encounter a new person whose feature measurements are known but whose 9977db96d56Sopenharmony_cigender is unknown: 9987db96d56Sopenharmony_ci 9997db96d56Sopenharmony_ci.. doctest:: 10007db96d56Sopenharmony_ci 10017db96d56Sopenharmony_ci >>> ht = 6.0 # height 10027db96d56Sopenharmony_ci >>> wt = 130 # weight 10037db96d56Sopenharmony_ci >>> fs = 8 # foot size 10047db96d56Sopenharmony_ci 10057db96d56Sopenharmony_ciStarting with a 50% `prior probability 10067db96d56Sopenharmony_ci<https://en.wikipedia.org/wiki/Prior_probability>`_ of being male or female, 10077db96d56Sopenharmony_ciwe compute the posterior as the prior times the product of likelihoods for the 10087db96d56Sopenharmony_cifeature measurements given the gender: 10097db96d56Sopenharmony_ci 10107db96d56Sopenharmony_ci.. doctest:: 10117db96d56Sopenharmony_ci 10127db96d56Sopenharmony_ci >>> prior_male = 0.5 10137db96d56Sopenharmony_ci >>> prior_female = 0.5 10147db96d56Sopenharmony_ci >>> posterior_male = (prior_male * height_male.pdf(ht) * 10157db96d56Sopenharmony_ci ... weight_male.pdf(wt) * foot_size_male.pdf(fs)) 10167db96d56Sopenharmony_ci 10177db96d56Sopenharmony_ci >>> posterior_female = (prior_female * height_female.pdf(ht) * 10187db96d56Sopenharmony_ci ... weight_female.pdf(wt) * foot_size_female.pdf(fs)) 10197db96d56Sopenharmony_ci 10207db96d56Sopenharmony_ciThe final prediction goes to the largest posterior. This is known as the 10217db96d56Sopenharmony_ci`maximum a posteriori 10227db96d56Sopenharmony_ci<https://en.wikipedia.org/wiki/Maximum_a_posteriori_estimation>`_ or MAP: 10237db96d56Sopenharmony_ci 10247db96d56Sopenharmony_ci.. doctest:: 10257db96d56Sopenharmony_ci 10267db96d56Sopenharmony_ci >>> 'male' if posterior_male > posterior_female else 'female' 10277db96d56Sopenharmony_ci 'female' 10287db96d56Sopenharmony_ci 10297db96d56Sopenharmony_ci 10307db96d56Sopenharmony_ci.. 10317db96d56Sopenharmony_ci # This modelines must appear within the last ten lines of the file. 10327db96d56Sopenharmony_ci kate: indent-width 3; remove-trailing-space on; replace-tabs on; encoding utf-8; 1033