1# We did not author this file nor mantain it. Skip linting it. 2#pylint: skip-file 3# Copyright (c) 1999-2008 Gary Strangman; All Rights Reserved. 4# 5# Permission is hereby granted, free of charge, to any person obtaining a copy 6# of this software and associated documentation files (the "Software"), to deal 7# in the Software without restriction, including without limitation the rights 8# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9# copies of the Software, and to permit persons to whom the Software is 10# furnished to do so, subject to the following conditions: 11# 12# The above copyright notice and this permission notice shall be included in 13# all copies or substantial portions of the Software. 14# 15# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN 21# THE SOFTWARE. 22# 23# Comments and/or additions are welcome (send e-mail to: 24# strang@nmr.mgh.harvard.edu). 25# 26"""stats.py module 27 28(Requires pstat.py module.) 29 30################################################# 31####### Written by: Gary Strangman ########### 32####### Last modified: Oct 31, 2008 ########### 33################################################# 34 35A collection of basic statistical functions for python. The function 36names appear below. 37 38IMPORTANT: There are really *3* sets of functions. The first set has an 'l' 39prefix, which can be used with list or tuple arguments. The second set has 40an 'a' prefix, which can accept NumPy array arguments. These latter 41functions are defined only when NumPy is available on the system. The third 42type has NO prefix (i.e., has the name that appears below). Functions of 43this set are members of a "Dispatch" class, c/o David Ascher. This class 44allows different functions to be called depending on the type of the passed 45arguments. Thus, stats.mean is a member of the Dispatch class and 46stats.mean(range(20)) will call stats.lmean(range(20)) while 47stats.mean(Numeric.arange(20)) will call stats.amean(Numeric.arange(20)). 48This is a handy way to keep consistent function names when different 49argument types require different functions to be called. Having 50implementated the Dispatch class, however, means that to get info on 51a given function, you must use the REAL function name ... that is 52"print stats.lmean.__doc__" or "print stats.amean.__doc__" work fine, 53while "print stats.mean.__doc__" will print the doc for the Dispatch 54class. NUMPY FUNCTIONS ('a' prefix) generally have more argument options 55but should otherwise be consistent with the corresponding list functions. 56 57Disclaimers: The function list is obviously incomplete and, worse, the 58functions are not optimized. All functions have been tested (some more 59so than others), but they are far from bulletproof. Thus, as with any 60free software, no warranty or guarantee is expressed or implied. :-) A 61few extra functions that don't appear in the list below can be found by 62interested treasure-hunters. These functions don't necessarily have 63both list and array versions but were deemed useful 64 65CENTRAL TENDENCY: geometricmean 66 harmonicmean 67 mean 68 median 69 medianscore 70 mode 71 72MOMENTS: moment 73 variation 74 skew 75 kurtosis 76 skewtest (for Numpy arrays only) 77 kurtosistest (for Numpy arrays only) 78 normaltest (for Numpy arrays only) 79 80ALTERED VERSIONS: tmean (for Numpy arrays only) 81 tvar (for Numpy arrays only) 82 tmin (for Numpy arrays only) 83 tmax (for Numpy arrays only) 84 tstdev (for Numpy arrays only) 85 tsem (for Numpy arrays only) 86 describe 87 88FREQUENCY STATS: itemfreq 89 scoreatpercentile 90 percentileofscore 91 histogram 92 cumfreq 93 relfreq 94 95VARIABILITY: obrientransform 96 samplevar 97 samplestdev 98 signaltonoise (for Numpy arrays only) 99 var 100 stdev 101 sterr 102 sem 103 z 104 zs 105 zmap (for Numpy arrays only) 106 107TRIMMING FCNS: threshold (for Numpy arrays only) 108 trimboth 109 trim1 110 round (round all vals to 'n' decimals; Numpy only) 111 112CORRELATION FCNS: covariance (for Numpy arrays only) 113 correlation (for Numpy arrays only) 114 paired 115 pearsonr 116 spearmanr 117 pointbiserialr 118 kendalltau 119 linregress 120 121INFERENTIAL STATS: ttest_1samp 122 ttest_ind 123 ttest_rel 124 chisquare 125 ks_2samp 126 mannwhitneyu 127 ranksums 128 wilcoxont 129 kruskalwallish 130 friedmanchisquare 131 132PROBABILITY CALCS: chisqprob 133 erfcc 134 zprob 135 ksprob 136 fprob 137 betacf 138 gammln 139 betai 140 141ANOVA FUNCTIONS: F_oneway 142 F_value 143 144SUPPORT FUNCTIONS: writecc 145 incr 146 sign (for Numpy arrays only) 147 sum 148 cumsum 149 ss 150 summult 151 sumdiffsquared 152 square_of_sums 153 shellsort 154 rankdata 155 outputpairedstats 156 findwithin 157""" 158## CHANGE LOG: 159## =========== 160## 09-07-21 ... added capability for getting the 'proportion' out of l/amannwhitneyu (but comment-disabled) 161## 08-10-31 ... fixed import LinearAlgebra bug before glm fcns 162## 07-11-26 ... conversion for numpy started 163## 07-05-16 ... added Lin's Concordance Correlation Coefficient (alincc) and acov 164## 05-08-21 ... added "Dice's coefficient" 165## 04-10-26 ... added ap2t(), an ugly fcn for converting p-vals to T-vals 166## 04-04-03 ... added amasslinregress() function to do regression on N-D arrays 167## 03-01-03 ... CHANGED VERSION TO 0.6 168## fixed atsem() to properly handle limits=None case 169## improved histogram and median functions (estbinwidth) and 170## fixed atvar() function (wrong answers for neg numbers?!?) 171## 02-11-19 ... fixed attest_ind and attest_rel for div-by-zero Overflows 172## 02-05-10 ... fixed lchisqprob indentation (failed when df=even) 173## 00-12-28 ... removed aanova() to separate module, fixed licensing to 174## match Python License, fixed doc string & imports 175## 00-04-13 ... pulled all "global" statements, except from aanova() 176## added/fixed lots of documentation, removed io.py dependency 177## changed to version 0.5 178## 99-11-13 ... added asign() function 179## 99-11-01 ... changed version to 0.4 ... enough incremental changes now 180## 99-10-25 ... added acovariance and acorrelation functions 181## 99-10-10 ... fixed askew/akurtosis to avoid divide-by-zero errors 182## added aglm function (crude, but will be improved) 183## 99-10-04 ... upgraded acumsum, ass, asummult, asamplevar, avar, etc. to 184## all handle lists of 'dimension's and keepdims 185## REMOVED ar0, ar2, ar3, ar4 and replaced them with around 186## reinserted fixes for abetai to avoid math overflows 187## 99-09-05 ... rewrote achisqprob/aerfcc/aksprob/afprob/abetacf/abetai to 188## handle multi-dimensional arrays (whew!) 189## 99-08-30 ... fixed l/amoment, l/askew, l/akurtosis per D'Agostino (1990) 190## added anormaltest per same reference 191## re-wrote azprob to calc arrays of probs all at once 192## 99-08-22 ... edited attest_ind printing section so arrays could be rounded 193## 99-08-19 ... fixed amean and aharmonicmean for non-error(!) overflow on 194## short/byte arrays (mean of #s btw 100-300 = -150??) 195## 99-08-09 ... fixed asum so that the None case works for Byte arrays 196## 99-08-08 ... fixed 7/3 'improvement' to handle t-calcs on N-D arrays 197## 99-07-03 ... improved attest_ind, attest_rel (zero-division errortrap) 198## 99-06-24 ... fixed bug(?) in attest_ind (n1=a.shape[0]) 199## 04/11/99 ... added asignaltonoise, athreshold functions, changed all 200## max/min in array section to N.maximum/N.minimum, 201## fixed square_of_sums to prevent integer overflow 202## 04/10/99 ... !!! Changed function name ... sumsquared ==> square_of_sums 203## 03/18/99 ... Added ar0, ar2, ar3 and ar4 rounding functions 204## 02/28/99 ... Fixed aobrientransform to return an array rather than a list 205## 01/15/99 ... Essentially ceased updating list-versions of functions (!!!) 206## 01/13/99 ... CHANGED TO VERSION 0.3 207## fixed bug in a/lmannwhitneyu p-value calculation 208## 12/31/98 ... fixed variable-name bug in ldescribe 209## 12/19/98 ... fixed bug in findwithin (fcns needed pstat. prefix) 210## 12/16/98 ... changed amedianscore to return float (not array) for 1 score 211## 12/14/98 ... added atmin and atmax functions 212## removed umath from import line (not needed) 213## l/ageometricmean modified to reduce chance of overflows (take 214## nth root first, then multiply) 215## 12/07/98 ... added __version__variable (now 0.2) 216## removed all 'stats.' from anova() fcn 217## 12/06/98 ... changed those functions (except shellsort) that altered 218## arguments in-place ... cumsum, ranksort, ... 219## updated (and fixed some) doc-strings 220## 12/01/98 ... added anova() function (requires NumPy) 221## incorporated Dispatch class 222## 11/12/98 ... added functionality to amean, aharmonicmean, ageometricmean 223## added 'asum' function (added functionality to N.add.reduce) 224## fixed both moment and amoment (two errors) 225## changed name of skewness and askewness to skew and askew 226## fixed (a)histogram (which sometimes counted points <lowerlimit) 227 228import pstat # required 3rd party module 229import math, string, copy # required python modules 230from types import * 231 232__version__ = 0.6 233 234############# DISPATCH CODE ############## 235 236 237class Dispatch: 238 """ 239The Dispatch class, care of David Ascher, allows different functions to 240be called depending on the argument types. This way, there can be one 241function name regardless of the argument type. To access function doc 242in stats.py module, prefix the function with an 'l' or 'a' for list or 243array arguments, respectively. That is, print stats.lmean.__doc__ or 244print stats.amean.__doc__ or whatever. 245""" 246 247 def __init__(self, *tuples): 248 self._dispatch = {} 249 for func, types in tuples: 250 for t in types: 251 if t in self._dispatch.keys(): 252 raise ValueError, "can't have two dispatches on " + str(t) 253 self._dispatch[t] = func 254 self._types = self._dispatch.keys() 255 256 def __call__(self, arg1, *args, **kw): 257 if type(arg1) not in self._types: 258 raise TypeError, "don't know how to dispatch %s arguments" % type(arg1) 259 return apply(self._dispatch[type(arg1)], (arg1,) + args, kw) 260 261########################################################################## 262######################## LIST-BASED FUNCTIONS ######################## 263########################################################################## 264 265### Define these regardless 266 267#################################### 268####### CENTRAL TENDENCY ######### 269#################################### 270 271 272def lgeometricmean(inlist): 273 """ 274Calculates the geometric mean of the values in the passed list. 275That is: n-th root of (x1 * x2 * ... * xn). Assumes a '1D' list. 276 277Usage: lgeometricmean(inlist) 278""" 279 mult = 1.0 280 one_over_n = 1.0 / len(inlist) 281 for item in inlist: 282 mult = mult * pow(item, one_over_n) 283 return mult 284 285 286def lharmonicmean(inlist): 287 """ 288Calculates the harmonic mean of the values in the passed list. 289That is: n / (1/x1 + 1/x2 + ... + 1/xn). Assumes a '1D' list. 290 291Usage: lharmonicmean(inlist) 292""" 293 sum = 0 294 for item in inlist: 295 sum = sum + 1.0 / item 296 return len(inlist) / sum 297 298 299def lmean(inlist): 300 """ 301Returns the arithematic mean of the values in the passed list. 302Assumes a '1D' list, but will function on the 1st dim of an array(!). 303 304Usage: lmean(inlist) 305""" 306 sum = 0 307 for item in inlist: 308 sum = sum + item 309 return sum / float(len(inlist)) 310 311 312def lmedian(inlist, numbins=1000): 313 """ 314Returns the computed median value of a list of numbers, given the 315number of bins to use for the histogram (more bins brings the computed value 316closer to the median score, default number of bins = 1000). See G.W. 317Heiman's Basic Stats (1st Edition), or CRC Probability & Statistics. 318 319Usage: lmedian (inlist, numbins=1000) 320""" 321 (hist, smallest, binsize, extras) = histogram( 322 inlist, numbins, [min(inlist), max(inlist)]) # make histog 323 cumhist = cumsum(hist) # make cumulative histogram 324 for i in range(len(cumhist)): # get 1st(!) index holding 50%ile score 325 if cumhist[i] >= len(inlist) / 2.0: 326 cfbin = i 327 break 328 LRL = smallest + binsize * cfbin # get lower read limit of that bin 329 cfbelow = cumhist[cfbin - 1] 330 freq = float(hist[cfbin]) # frequency IN the 50%ile bin 331 median = LRL + ( 332 (len(inlist) / 2.0 - cfbelow) / float(freq)) * binsize # median formula 333 return median 334 335 336def lmedianscore(inlist): 337 """ 338Returns the 'middle' score of the passed list. If there is an even 339number of scores, the mean of the 2 middle scores is returned. 340 341Usage: lmedianscore(inlist) 342""" 343 344 newlist = copy.deepcopy(inlist) 345 newlist.sort() 346 if len(newlist) % 2 == 0: # if even number of scores, average middle 2 347 index = len(newlist) / 2 # integer division correct 348 median = float(newlist[index] + newlist[index - 1]) / 2 349 else: 350 index = len(newlist) / 2 # int divsion gives mid value when count from 0 351 median = newlist[index] 352 return median 353 354 355def lmode(inlist): 356 """ 357Returns a list of the modal (most common) score(s) in the passed 358list. If there is more than one such score, all are returned. The 359bin-count for the mode(s) is also returned. 360 361Usage: lmode(inlist) 362Returns: bin-count for mode(s), a list of modal value(s) 363""" 364 365 scores = pstat.unique(inlist) 366 scores.sort() 367 freq = [] 368 for item in scores: 369 freq.append(inlist.count(item)) 370 maxfreq = max(freq) 371 mode = [] 372 stillmore = 1 373 while stillmore: 374 try: 375 indx = freq.index(maxfreq) 376 mode.append(scores[indx]) 377 del freq[indx] 378 del scores[indx] 379 except ValueError: 380 stillmore = 0 381 return maxfreq, mode 382 383#################################### 384############ MOMENTS ############# 385#################################### 386 387 388def lmoment(inlist, moment=1): 389 """ 390Calculates the nth moment about the mean for a sample (defaults to 391the 1st moment). Used to calculate coefficients of skewness and kurtosis. 392 393Usage: lmoment(inlist,moment=1) 394Returns: appropriate moment (r) from ... 1/n * SUM((inlist(i)-mean)**r) 395""" 396 if moment == 1: 397 return 0.0 398 else: 399 mn = mean(inlist) 400 n = len(inlist) 401 s = 0 402 for x in inlist: 403 s = s + (x - mn)**moment 404 return s / float(n) 405 406 407def lvariation(inlist): 408 """ 409Returns the coefficient of variation, as defined in CRC Standard 410Probability and Statistics, p.6. 411 412Usage: lvariation(inlist) 413""" 414 return 100.0 * samplestdev(inlist) / float(mean(inlist)) 415 416 417def lskew(inlist): 418 """ 419Returns the skewness of a distribution, as defined in Numerical 420Recipies (alternate defn in CRC Standard Probability and Statistics, p.6.) 421 422Usage: lskew(inlist) 423""" 424 return moment(inlist, 3) / pow(moment(inlist, 2), 1.5) 425 426 427def lkurtosis(inlist): 428 """ 429Returns the kurtosis of a distribution, as defined in Numerical 430Recipies (alternate defn in CRC Standard Probability and Statistics, p.6.) 431 432Usage: lkurtosis(inlist) 433""" 434 return moment(inlist, 4) / pow(moment(inlist, 2), 2.0) 435 436 437def ldescribe(inlist): 438 """ 439Returns some descriptive statistics of the passed list (assumed to be 1D). 440 441Usage: ldescribe(inlist) 442Returns: n, mean, standard deviation, skew, kurtosis 443""" 444 n = len(inlist) 445 mm = (min(inlist), max(inlist)) 446 m = mean(inlist) 447 sd = stdev(inlist) 448 sk = skew(inlist) 449 kurt = kurtosis(inlist) 450 return n, mm, m, sd, sk, kurt 451 452#################################### 453####### FREQUENCY STATS ########## 454#################################### 455 456 457def litemfreq(inlist): 458 """ 459Returns a list of pairs. Each pair consists of one of the scores in inlist 460and it's frequency count. Assumes a 1D list is passed. 461 462Usage: litemfreq(inlist) 463Returns: a 2D frequency table (col [0:n-1]=scores, col n=frequencies) 464""" 465 scores = pstat.unique(inlist) 466 scores.sort() 467 freq = [] 468 for item in scores: 469 freq.append(inlist.count(item)) 470 return pstat.abut(scores, freq) 471 472 473def lscoreatpercentile(inlist, percent): 474 """ 475Returns the score at a given percentile relative to the distribution 476given by inlist. 477 478Usage: lscoreatpercentile(inlist,percent) 479""" 480 if percent > 1: 481 print '\nDividing percent>1 by 100 in lscoreatpercentile().\n' 482 percent = percent / 100.0 483 targetcf = percent * len(inlist) 484 h, lrl, binsize, extras = histogram(inlist) 485 cumhist = cumsum(copy.deepcopy(h)) 486 for i in range(len(cumhist)): 487 if cumhist[i] >= targetcf: 488 break 489 score = binsize * ( 490 (targetcf - cumhist[i - 1]) / float(h[i])) + (lrl + binsize * i) 491 return score 492 493 494def lpercentileofscore(inlist, score, histbins=10, defaultlimits=None): 495 """ 496Returns the percentile value of a score relative to the distribution 497given by inlist. Formula depends on the values used to histogram the data(!). 498 499Usage: lpercentileofscore(inlist,score,histbins=10,defaultlimits=None) 500""" 501 502 h, lrl, binsize, extras = histogram(inlist, histbins, defaultlimits) 503 cumhist = cumsum(copy.deepcopy(h)) 504 i = int((score - lrl) / float(binsize)) 505 pct = (cumhist[i - 1] + ( 506 (score - 507 (lrl + binsize * i)) / float(binsize)) * h[i]) / float(len(inlist)) * 100 508 return pct 509 510 511def lhistogram(inlist, numbins=10, defaultreallimits=None, printextras=0): 512 """ 513Returns (i) a list of histogram bin counts, (ii) the smallest value 514of the histogram binning, and (iii) the bin width (the last 2 are not 515necessarily integers). Default number of bins is 10. If no sequence object 516is given for defaultreallimits, the routine picks (usually non-pretty) bins 517spanning all the numbers in the inlist. 518 519Usage: lhistogram (inlist, numbins=10, 520defaultreallimits=None,suppressoutput=0) 521Returns: list of bin values, lowerreallimit, binsize, extrapoints 522""" 523 if (defaultreallimits <> None): 524 if type(defaultreallimits) not in [ListType, TupleType] or len( 525 defaultreallimits) == 1: # only one limit given, assumed to be lower one & upper is calc'd 526 lowerreallimit = defaultreallimits 527 upperreallimit = 1.000001 * max(inlist) 528 else: # assume both limits given 529 lowerreallimit = defaultreallimits[0] 530 upperreallimit = defaultreallimits[1] 531 binsize = (upperreallimit - lowerreallimit) / float(numbins) 532 else: # no limits given for histogram, both must be calc'd 533 estbinwidth = (max(inlist) - 534 min(inlist)) / float(numbins) + 1e-6 #1=>cover all 535 binsize = ((max(inlist) - min(inlist) + estbinwidth)) / float(numbins) 536 lowerreallimit = min(inlist) - binsize / 2 #lower real limit,1st bin 537 bins = [0] * (numbins) 538 extrapoints = 0 539 for num in inlist: 540 try: 541 if (num - lowerreallimit) < 0: 542 extrapoints = extrapoints + 1 543 else: 544 bintoincrement = int((num - lowerreallimit) / float(binsize)) 545 bins[bintoincrement] = bins[bintoincrement] + 1 546 except: 547 extrapoints = extrapoints + 1 548 if (extrapoints > 0 and printextras == 1): 549 print '\nPoints outside given histogram range =', extrapoints 550 return (bins, lowerreallimit, binsize, extrapoints) 551 552 553def lcumfreq(inlist, numbins=10, defaultreallimits=None): 554 """ 555Returns a cumulative frequency histogram, using the histogram function. 556 557Usage: lcumfreq(inlist,numbins=10,defaultreallimits=None) 558Returns: list of cumfreq bin values, lowerreallimit, binsize, extrapoints 559""" 560 h, l, b, e = histogram(inlist, numbins, defaultreallimits) 561 cumhist = cumsum(copy.deepcopy(h)) 562 return cumhist, l, b, e 563 564 565def lrelfreq(inlist, numbins=10, defaultreallimits=None): 566 """ 567Returns a relative frequency histogram, using the histogram function. 568 569Usage: lrelfreq(inlist,numbins=10,defaultreallimits=None) 570Returns: list of cumfreq bin values, lowerreallimit, binsize, extrapoints 571""" 572 h, l, b, e = histogram(inlist, numbins, defaultreallimits) 573 for i in range(len(h)): 574 h[i] = h[i] / float(len(inlist)) 575 return h, l, b, e 576 577#################################### 578##### VARIABILITY FUNCTIONS ###### 579#################################### 580 581 582def lobrientransform(*args): 583 """ 584Computes a transform on input data (any number of columns). Used to 585test for homogeneity of variance prior to running one-way stats. From 586Maxwell and Delaney, p.112. 587 588Usage: lobrientransform(*args) 589Returns: transformed data for use in an ANOVA 590""" 591 TINY = 1e-10 592 k = len(args) 593 n = [0.0] * k 594 v = [0.0] * k 595 m = [0.0] * k 596 nargs = [] 597 for i in range(k): 598 nargs.append(copy.deepcopy(args[i])) 599 n[i] = float(len(nargs[i])) 600 v[i] = var(nargs[i]) 601 m[i] = mean(nargs[i]) 602 for j in range(k): 603 for i in range(n[j]): 604 t1 = (n[j] - 1.5) * n[j] * (nargs[j][i] - m[j])**2 605 t2 = 0.5 * v[j] * (n[j] - 1.0) 606 t3 = (n[j] - 1.0) * (n[j] - 2.0) 607 nargs[j][i] = (t1 - t2) / float(t3) 608 check = 1 609 for j in range(k): 610 if v[j] - mean(nargs[j]) > TINY: 611 check = 0 612 if check <> 1: 613 raise ValueError, 'Problem in obrientransform.' 614 else: 615 return nargs 616 617 618def lsamplevar(inlist): 619 """ 620Returns the variance of the values in the passed list using 621N for the denominator (i.e., DESCRIBES the sample variance only). 622 623Usage: lsamplevar(inlist) 624""" 625 n = len(inlist) 626 mn = mean(inlist) 627 deviations = [] 628 for item in inlist: 629 deviations.append(item - mn) 630 return ss(deviations) / float(n) 631 632 633def lsamplestdev(inlist): 634 """ 635Returns the standard deviation of the values in the passed list using 636N for the denominator (i.e., DESCRIBES the sample stdev only). 637 638Usage: lsamplestdev(inlist) 639""" 640 return math.sqrt(samplevar(inlist)) 641 642 643def lcov(x, y, keepdims=0): 644 """ 645Returns the estimated covariance of the values in the passed 646array (i.e., N-1). Dimension can equal None (ravel array first), an 647integer (the dimension over which to operate), or a sequence (operate 648over multiple dimensions). Set keepdims=1 to return an array with the 649same number of dimensions as inarray. 650 651Usage: lcov(x,y,keepdims=0) 652""" 653 654 n = len(x) 655 xmn = mean(x) 656 ymn = mean(y) 657 xdeviations = [0] * len(x) 658 ydeviations = [0] * len(y) 659 for i in range(len(x)): 660 xdeviations[i] = x[i] - xmn 661 ydeviations[i] = y[i] - ymn 662 ss = 0.0 663 for i in range(len(xdeviations)): 664 ss = ss + xdeviations[i] * ydeviations[i] 665 return ss / float(n - 1) 666 667 668def lvar(inlist): 669 """ 670Returns the variance of the values in the passed list using N-1 671for the denominator (i.e., for estimating population variance). 672 673Usage: lvar(inlist) 674""" 675 n = len(inlist) 676 mn = mean(inlist) 677 deviations = [0] * len(inlist) 678 for i in range(len(inlist)): 679 deviations[i] = inlist[i] - mn 680 return ss(deviations) / float(n - 1) 681 682 683def lstdev(inlist): 684 """ 685Returns the standard deviation of the values in the passed list 686using N-1 in the denominator (i.e., to estimate population stdev). 687 688Usage: lstdev(inlist) 689""" 690 return math.sqrt(var(inlist)) 691 692 693def lsterr(inlist): 694 """ 695Returns the standard error of the values in the passed list using N-1 696in the denominator (i.e., to estimate population standard error). 697 698Usage: lsterr(inlist) 699""" 700 return stdev(inlist) / float(math.sqrt(len(inlist))) 701 702 703def lsem(inlist): 704 """ 705Returns the estimated standard error of the mean (sx-bar) of the 706values in the passed list. sem = stdev / sqrt(n) 707 708Usage: lsem(inlist) 709""" 710 sd = stdev(inlist) 711 n = len(inlist) 712 return sd / math.sqrt(n) 713 714 715def lz(inlist, score): 716 """ 717Returns the z-score for a given input score, given that score and the 718list from which that score came. Not appropriate for population calculations. 719 720Usage: lz(inlist, score) 721""" 722 z = (score - mean(inlist)) / samplestdev(inlist) 723 return z 724 725 726def lzs(inlist): 727 """ 728Returns a list of z-scores, one for each score in the passed list. 729 730Usage: lzs(inlist) 731""" 732 zscores = [] 733 for item in inlist: 734 zscores.append(z(inlist, item)) 735 return zscores 736 737#################################### 738####### TRIMMING FUNCTIONS ####### 739#################################### 740 741 742def ltrimboth(l, proportiontocut): 743 """ 744Slices off the passed proportion of items from BOTH ends of the passed 745list (i.e., with proportiontocut=0.1, slices 'leftmost' 10% AND 'rightmost' 74610% of scores. Assumes list is sorted by magnitude. Slices off LESS if 747proportion results in a non-integer slice index (i.e., conservatively 748slices off proportiontocut). 749 750Usage: ltrimboth (l,proportiontocut) 751Returns: trimmed version of list l 752""" 753 lowercut = int(proportiontocut * len(l)) 754 uppercut = len(l) - lowercut 755 return l[lowercut:uppercut] 756 757 758def ltrim1(l, proportiontocut, tail='right'): 759 """ 760Slices off the passed proportion of items from ONE end of the passed 761list (i.e., if proportiontocut=0.1, slices off 'leftmost' or 'rightmost' 76210% of scores). Slices off LESS if proportion results in a non-integer 763slice index (i.e., conservatively slices off proportiontocut). 764 765Usage: ltrim1 (l,proportiontocut,tail='right') or set tail='left' 766Returns: trimmed version of list l 767""" 768 if tail == 'right': 769 lowercut = 0 770 uppercut = len(l) - int(proportiontocut * len(l)) 771 elif tail == 'left': 772 lowercut = int(proportiontocut * len(l)) 773 uppercut = len(l) 774 return l[lowercut:uppercut] 775 776#################################### 777##### CORRELATION FUNCTIONS ###### 778#################################### 779 780 781def lpaired(x, y): 782 """ 783Interactively determines the type of data and then runs the 784appropriated statistic for paired group data. 785 786Usage: lpaired(x,y) 787Returns: appropriate statistic name, value, and probability 788""" 789 samples = '' 790 while samples not in ['i', 'r', 'I', 'R', 'c', 'C']: 791 print '\nIndependent or related samples, or correlation (i,r,c): ', 792 samples = raw_input() 793 794 if samples in ['i', 'I', 'r', 'R']: 795 print '\nComparing variances ...', 796 # USE O'BRIEN'S TEST FOR HOMOGENEITY OF VARIANCE, Maxwell & delaney, p.112 797 r = obrientransform(x, y) 798 f, p = F_oneway(pstat.colex(r, 0), pstat.colex(r, 1)) 799 if p < 0.05: 800 vartype = 'unequal, p=' + str(round(p, 4)) 801 else: 802 vartype = 'equal' 803 print vartype 804 if samples in ['i', 'I']: 805 if vartype[0] == 'e': 806 t, p = ttest_ind(x, y, 0) 807 print '\nIndependent samples t-test: ', round(t, 4), round(p, 4) 808 else: 809 if len(x) > 20 or len(y) > 20: 810 z, p = ranksums(x, y) 811 print '\nRank Sums test (NONparametric, n>20): ', round(z, 4), round( 812 p, 4) 813 else: 814 u, p = mannwhitneyu(x, y) 815 print '\nMann-Whitney U-test (NONparametric, ns<20): ', round( 816 u, 4), round(p, 4) 817 818 else: # RELATED SAMPLES 819 if vartype[0] == 'e': 820 t, p = ttest_rel(x, y, 0) 821 print '\nRelated samples t-test: ', round(t, 4), round(p, 4) 822 else: 823 t, p = ranksums(x, y) 824 print '\nWilcoxon T-test (NONparametric): ', round(t, 4), round(p, 4) 825 else: # CORRELATION ANALYSIS 826 corrtype = '' 827 while corrtype not in ['c', 'C', 'r', 'R', 'd', 'D']: 828 print '\nIs the data Continuous, Ranked, or Dichotomous (c,r,d): ', 829 corrtype = raw_input() 830 if corrtype in ['c', 'C']: 831 m, b, r, p, see = linregress(x, y) 832 print '\nLinear regression for continuous variables ...' 833 lol = [['Slope', 'Intercept', 'r', 'Prob', 'SEestimate'], 834 [round(m, 4), round(b, 4), round(r, 4), round(p, 4), round(see, 4)] 835 ] 836 pstat.printcc(lol) 837 elif corrtype in ['r', 'R']: 838 r, p = spearmanr(x, y) 839 print '\nCorrelation for ranked variables ...' 840 print "Spearman's r: ", round(r, 4), round(p, 4) 841 else: # DICHOTOMOUS 842 r, p = pointbiserialr(x, y) 843 print '\nAssuming x contains a dichotomous variable ...' 844 print 'Point Biserial r: ', round(r, 4), round(p, 4) 845 print '\n\n' 846 return None 847 848 849def lpearsonr(x, y): 850 """ 851Calculates a Pearson correlation coefficient and the associated 852probability value. Taken from Heiman's Basic Statistics for the Behav. 853Sci (2nd), p.195. 854 855Usage: lpearsonr(x,y) where x and y are equal-length lists 856Returns: Pearson's r value, two-tailed p-value 857""" 858 TINY = 1.0e-30 859 if len(x) <> len(y): 860 raise ValueError, 'Input values not paired in pearsonr. Aborting.' 861 n = len(x) 862 x = map(float, x) 863 y = map(float, y) 864 xmean = mean(x) 865 ymean = mean(y) 866 r_num = n * (summult(x, y)) - sum(x) * sum(y) 867 r_den = math.sqrt((n * ss(x) - square_of_sums(x)) * 868 (n * ss(y) - square_of_sums(y))) 869 r = (r_num / r_den) # denominator already a float 870 df = n - 2 871 t = r * math.sqrt(df / ((1.0 - r + TINY) * (1.0 + r + TINY))) 872 prob = betai(0.5 * df, 0.5, df / float(df + t * t)) 873 return r, prob 874 875 876def llincc(x, y): 877 """ 878Calculates Lin's concordance correlation coefficient. 879 880Usage: alincc(x,y) where x, y are equal-length arrays 881Returns: Lin's CC 882""" 883 covar = lcov(x, y) * (len(x) - 1) / float(len(x)) # correct denom to n 884 xvar = lvar(x) * (len(x) - 1) / float(len(x)) # correct denom to n 885 yvar = lvar(y) * (len(y) - 1) / float(len(y)) # correct denom to n 886 lincc = (2 * covar) / ((xvar + yvar) + ((amean(x) - amean(y))**2)) 887 return lincc 888 889 890def lspearmanr(x, y): 891 """ 892Calculates a Spearman rank-order correlation coefficient. Taken 893from Heiman's Basic Statistics for the Behav. Sci (1st), p.192. 894 895Usage: lspearmanr(x,y) where x and y are equal-length lists 896Returns: Spearman's r, two-tailed p-value 897""" 898 TINY = 1e-30 899 if len(x) <> len(y): 900 raise ValueError, 'Input values not paired in spearmanr. Aborting.' 901 n = len(x) 902 rankx = rankdata(x) 903 ranky = rankdata(y) 904 dsq = sumdiffsquared(rankx, ranky) 905 rs = 1 - 6 * dsq / float(n * (n**2 - 1)) 906 t = rs * math.sqrt((n - 2) / ((rs + 1.0) * (1.0 - rs))) 907 df = n - 2 908 probrs = betai(0.5 * df, 0.5, df / (df + t * t)) # t already a float 909 # probability values for rs are from part 2 of the spearman function in 910 # Numerical Recipies, p.510. They are close to tables, but not exact. (?) 911 return rs, probrs 912 913 914def lpointbiserialr(x, y): 915 """ 916Calculates a point-biserial correlation coefficient and the associated 917probability value. Taken from Heiman's Basic Statistics for the Behav. 918Sci (1st), p.194. 919 920Usage: lpointbiserialr(x,y) where x,y are equal-length lists 921Returns: Point-biserial r, two-tailed p-value 922""" 923 TINY = 1e-30 924 if len(x) <> len(y): 925 raise ValueError, 'INPUT VALUES NOT PAIRED IN pointbiserialr. ABORTING.' 926 data = pstat.abut(x, y) 927 categories = pstat.unique(x) 928 if len(categories) <> 2: 929 raise ValueError, 'Exactly 2 categories required for pointbiserialr().' 930 else: # there are 2 categories, continue 931 codemap = pstat.abut(categories, range(2)) 932 recoded = pstat.recode(data, codemap, 0) 933 x = pstat.linexand(data, 0, categories[0]) 934 y = pstat.linexand(data, 0, categories[1]) 935 xmean = mean(pstat.colex(x, 1)) 936 ymean = mean(pstat.colex(y, 1)) 937 n = len(data) 938 adjust = math.sqrt((len(x) / float(n)) * (len(y) / float(n))) 939 rpb = (ymean - xmean) / samplestdev(pstat.colex(data, 1)) * adjust 940 df = n - 2 941 t = rpb * math.sqrt(df / ((1.0 - rpb + TINY) * (1.0 + rpb + TINY))) 942 prob = betai(0.5 * df, 0.5, df / (df + t * t)) # t already a float 943 return rpb, prob 944 945 946def lkendalltau(x, y): 947 """ 948Calculates Kendall's tau ... correlation of ordinal data. Adapted 949from function kendl1 in Numerical Recipies. Needs good test-routine.@@@ 950 951Usage: lkendalltau(x,y) 952Returns: Kendall's tau, two-tailed p-value 953""" 954 n1 = 0 955 n2 = 0 956 iss = 0 957 for j in range(len(x) - 1): 958 for k in range(j, len(y)): 959 a1 = x[j] - x[k] 960 a2 = y[j] - y[k] 961 aa = a1 * a2 962 if (aa): # neither list has a tie 963 n1 = n1 + 1 964 n2 = n2 + 1 965 if aa > 0: 966 iss = iss + 1 967 else: 968 iss = iss - 1 969 else: 970 if (a1): 971 n1 = n1 + 1 972 else: 973 n2 = n2 + 1 974 tau = iss / math.sqrt(n1 * n2) 975 svar = (4.0 * len(x) + 10.0) / (9.0 * len(x) * (len(x) - 1)) 976 z = tau / math.sqrt(svar) 977 prob = erfcc(abs(z) / 1.4142136) 978 return tau, prob 979 980 981def llinregress(x, y): 982 """ 983Calculates a regression line on x,y pairs. 984 985Usage: llinregress(x,y) x,y are equal-length lists of x-y coordinates 986Returns: slope, intercept, r, two-tailed prob, sterr-of-estimate 987""" 988 TINY = 1.0e-20 989 if len(x) <> len(y): 990 raise ValueError, 'Input values not paired in linregress. Aborting.' 991 n = len(x) 992 x = map(float, x) 993 y = map(float, y) 994 xmean = mean(x) 995 ymean = mean(y) 996 r_num = float(n * (summult(x, y)) - sum(x) * sum(y)) 997 r_den = math.sqrt((n * ss(x) - square_of_sums(x)) * 998 (n * ss(y) - square_of_sums(y))) 999 r = r_num / r_den 1000 z = 0.5 * math.log((1.0 + r + TINY) / (1.0 - r + TINY)) 1001 df = n - 2 1002 t = r * math.sqrt(df / ((1.0 - r + TINY) * (1.0 + r + TINY))) 1003 prob = betai(0.5 * df, 0.5, df / (df + t * t)) 1004 slope = r_num / float(n * ss(x) - square_of_sums(x)) 1005 intercept = ymean - slope * xmean 1006 sterrest = math.sqrt(1 - r * r) * samplestdev(y) 1007 return slope, intercept, r, prob, sterrest 1008 1009#################################### 1010##### INFERENTIAL STATISTICS ##### 1011#################################### 1012 1013 1014def lttest_1samp(a, popmean, printit=0, name='Sample', writemode='a'): 1015 """ 1016Calculates the t-obtained for the independent samples T-test on ONE group 1017of scores a, given a population mean. If printit=1, results are printed 1018to the screen. If printit='filename', the results are output to 'filename' 1019using the given writemode (default=append). Returns t-value, and prob. 1020 1021Usage: lttest_1samp(a,popmean,Name='Sample',printit=0,writemode='a') 1022Returns: t-value, two-tailed prob 1023""" 1024 x = mean(a) 1025 v = var(a) 1026 n = len(a) 1027 df = n - 1 1028 svar = ((n - 1) * v) / float(df) 1029 t = (x - popmean) / math.sqrt(svar * (1.0 / n)) 1030 prob = betai(0.5 * df, 0.5, float(df) / (df + t * t)) 1031 1032 if printit <> 0: 1033 statname = 'Single-sample T-test.' 1034 outputpairedstats(printit, writemode, 'Population', '--', popmean, 0, 0, 0, 1035 name, n, x, v, min(a), max(a), statname, t, prob) 1036 return t, prob 1037 1038 1039def lttest_ind(a, b, printit=0, name1='Samp1', name2='Samp2', writemode='a'): 1040 """ 1041Calculates the t-obtained T-test on TWO INDEPENDENT samples of 1042scores a, and b. From Numerical Recipies, p.483. If printit=1, results 1043are printed to the screen. If printit='filename', the results are output 1044to 'filename' using the given writemode (default=append). Returns t-value, 1045and prob. 1046 1047Usage: lttest_ind(a,b,printit=0,name1='Samp1',name2='Samp2',writemode='a') 1048Returns: t-value, two-tailed prob 1049""" 1050 x1 = mean(a) 1051 x2 = mean(b) 1052 v1 = stdev(a)**2 1053 v2 = stdev(b)**2 1054 n1 = len(a) 1055 n2 = len(b) 1056 df = n1 + n2 - 2 1057 svar = ((n1 - 1) * v1 + (n2 - 1) * v2) / float(df) 1058 if not svar: 1059 svar = 1.0e-26 1060 t = (x1 - x2) / math.sqrt(svar * (1.0 / n1 + 1.0 / n2)) 1061 prob = betai(0.5 * df, 0.5, df / (df + t * t)) 1062 1063 if printit <> 0: 1064 statname = 'Independent samples T-test.' 1065 outputpairedstats(printit, writemode, name1, n1, x1, v1, min(a), max(a), 1066 name2, n2, x2, v2, min(b), max(b), statname, t, prob) 1067 return t, prob 1068 1069 1070def lttest_rel(a, 1071 b, 1072 printit=0, 1073 name1='Sample1', 1074 name2='Sample2', 1075 writemode='a'): 1076 """ 1077Calculates the t-obtained T-test on TWO RELATED samples of scores, 1078a and b. From Numerical Recipies, p.483. If printit=1, results are 1079printed to the screen. If printit='filename', the results are output to 1080'filename' using the given writemode (default=append). Returns t-value, 1081and prob. 1082 1083Usage: lttest_rel(a,b,printit=0,name1='Sample1',name2='Sample2',writemode='a') 1084Returns: t-value, two-tailed prob 1085""" 1086 if len(a) <> len(b): 1087 raise ValueError, 'Unequal length lists in ttest_rel.' 1088 x1 = mean(a) 1089 x2 = mean(b) 1090 v1 = var(a) 1091 v2 = var(b) 1092 n = len(a) 1093 cov = 0 1094 for i in range(len(a)): 1095 cov = cov + (a[i] - x1) * (b[i] - x2) 1096 df = n - 1 1097 cov = cov / float(df) 1098 sd = math.sqrt((v1 + v2 - 2.0 * cov) / float(n)) 1099 t = (x1 - x2) / sd 1100 prob = betai(0.5 * df, 0.5, df / (df + t * t)) 1101 1102 if printit <> 0: 1103 statname = 'Related samples T-test.' 1104 outputpairedstats(printit, writemode, name1, n, x1, v1, min(a), max(a), 1105 name2, n, x2, v2, min(b), max(b), statname, t, prob) 1106 return t, prob 1107 1108 1109def lchisquare(f_obs, f_exp=None): 1110 """ 1111Calculates a one-way chi square for list of observed frequencies and returns 1112the result. If no expected frequencies are given, the total N is assumed to 1113be equally distributed across all groups. 1114 1115Usage: lchisquare(f_obs, f_exp=None) f_obs = list of observed cell freq. 1116Returns: chisquare-statistic, associated p-value 1117""" 1118 k = len(f_obs) # number of groups 1119 if f_exp == None: 1120 f_exp = [sum(f_obs) / float(k)] * len(f_obs) # create k bins with = freq. 1121 chisq = 0 1122 for i in range(len(f_obs)): 1123 chisq = chisq + (f_obs[i] - f_exp[i])**2 / float(f_exp[i]) 1124 return chisq, chisqprob(chisq, k - 1) 1125 1126 1127def lks_2samp(data1, data2): 1128 """ 1129Computes the Kolmogorov-Smirnof statistic on 2 samples. From 1130Numerical Recipies in C, page 493. 1131 1132Usage: lks_2samp(data1,data2) data1&2 are lists of values for 2 conditions 1133Returns: KS D-value, associated p-value 1134""" 1135 j1 = 0 1136 j2 = 0 1137 fn1 = 0.0 1138 fn2 = 0.0 1139 n1 = len(data1) 1140 n2 = len(data2) 1141 en1 = n1 1142 en2 = n2 1143 d = 0.0 1144 data1.sort() 1145 data2.sort() 1146 while j1 < n1 and j2 < n2: 1147 d1 = data1[j1] 1148 d2 = data2[j2] 1149 if d1 <= d2: 1150 fn1 = (j1) / float(en1) 1151 j1 = j1 + 1 1152 if d2 <= d1: 1153 fn2 = (j2) / float(en2) 1154 j2 = j2 + 1 1155 dt = (fn2 - fn1) 1156 if math.fabs(dt) > math.fabs(d): 1157 d = dt 1158 try: 1159 en = math.sqrt(en1 * en2 / float(en1 + en2)) 1160 prob = ksprob((en + 0.12 + 0.11 / en) * abs(d)) 1161 except: 1162 prob = 1.0 1163 return d, prob 1164 1165 1166def lmannwhitneyu(x, y): 1167 """ 1168Calculates a Mann-Whitney U statistic on the provided scores and 1169returns the result. Use only when the n in each condition is < 20 and 1170you have 2 independent samples of ranks. NOTE: Mann-Whitney U is 1171significant if the u-obtained is LESS THAN or equal to the critical 1172value of U found in the tables. Equivalent to Kruskal-Wallis H with 1173just 2 groups. 1174 1175Usage: lmannwhitneyu(data) 1176Returns: u-statistic, one-tailed p-value (i.e., p(z(U))) 1177""" 1178 n1 = len(x) 1179 n2 = len(y) 1180 ranked = rankdata(x + y) 1181 rankx = ranked[0:n1] # get the x-ranks 1182 ranky = ranked[n1:] # the rest are y-ranks 1183 u1 = n1 * n2 + (n1 * (n1 + 1)) / 2.0 - sum(rankx) # calc U for x 1184 u2 = n1 * n2 - u1 # remainder is U for y 1185 bigu = max(u1, u2) 1186 smallu = min(u1, u2) 1187 proportion = bigu / float(n1 * n2) 1188 T = math.sqrt(tiecorrect(ranked)) # correction factor for tied scores 1189 if T == 0: 1190 raise ValueError, 'All numbers are identical in lmannwhitneyu' 1191 sd = math.sqrt(T * n1 * n2 * (n1 + n2 + 1) / 12.0) 1192 z = abs((bigu - n1 * n2 / 2.0) / sd) # normal approximation for prob calc 1193 return smallu, 1.0 - zprob(z) #, proportion 1194 1195 1196def ltiecorrect(rankvals): 1197 """ 1198Corrects for ties in Mann Whitney U and Kruskal Wallis H tests. See 1199Siegel, S. (1956) Nonparametric Statistics for the Behavioral Sciences. 1200New York: McGraw-Hill. Code adapted from |Stat rankind.c code. 1201 1202Usage: ltiecorrect(rankvals) 1203Returns: T correction factor for U or H 1204""" 1205 sorted, posn = shellsort(rankvals) 1206 n = len(sorted) 1207 T = 0.0 1208 i = 0 1209 while (i < n - 1): 1210 if sorted[i] == sorted[i + 1]: 1211 nties = 1 1212 while (i < n - 1) and (sorted[i] == sorted[i + 1]): 1213 nties = nties + 1 1214 i = i + 1 1215 T = T + nties**3 - nties 1216 i = i + 1 1217 T = T / float(n**3 - n) 1218 return 1.0 - T 1219 1220 1221def lranksums(x, y): 1222 """ 1223Calculates the rank sums statistic on the provided scores and 1224returns the result. Use only when the n in each condition is > 20 and you 1225have 2 independent samples of ranks. 1226 1227Usage: lranksums(x,y) 1228Returns: a z-statistic, two-tailed p-value 1229""" 1230 n1 = len(x) 1231 n2 = len(y) 1232 alldata = x + y 1233 ranked = rankdata(alldata) 1234 x = ranked[:n1] 1235 y = ranked[n1:] 1236 s = sum(x) 1237 expected = n1 * (n1 + n2 + 1) / 2.0 1238 z = (s - expected) / math.sqrt(n1 * n2 * (n1 + n2 + 1) / 12.0) 1239 prob = 2 * (1.0 - zprob(abs(z))) 1240 return z, prob 1241 1242 1243def lwilcoxont(x, y): 1244 """ 1245Calculates the Wilcoxon T-test for related samples and returns the 1246result. A non-parametric T-test. 1247 1248Usage: lwilcoxont(x,y) 1249Returns: a t-statistic, two-tail probability estimate 1250""" 1251 if len(x) <> len(y): 1252 raise ValueError, 'Unequal N in wilcoxont. Aborting.' 1253 d = [] 1254 for i in range(len(x)): 1255 diff = x[i] - y[i] 1256 if diff <> 0: 1257 d.append(diff) 1258 count = len(d) 1259 absd = map(abs, d) 1260 absranked = rankdata(absd) 1261 r_plus = 0.0 1262 r_minus = 0.0 1263 for i in range(len(absd)): 1264 if d[i] < 0: 1265 r_minus = r_minus + absranked[i] 1266 else: 1267 r_plus = r_plus + absranked[i] 1268 wt = min(r_plus, r_minus) 1269 mn = count * (count + 1) * 0.25 1270 se = math.sqrt(count * (count + 1) * (2.0 * count + 1.0) / 24.0) 1271 z = math.fabs(wt - mn) / se 1272 prob = 2 * (1.0 - zprob(abs(z))) 1273 return wt, prob 1274 1275 1276def lkruskalwallish(*args): 1277 """ 1278The Kruskal-Wallis H-test is a non-parametric ANOVA for 3 or more 1279groups, requiring at least 5 subjects in each group. This function 1280calculates the Kruskal-Wallis H-test for 3 or more independent samples 1281and returns the result. 1282 1283Usage: lkruskalwallish(*args) 1284Returns: H-statistic (corrected for ties), associated p-value 1285""" 1286 args = list(args) 1287 n = [0] * len(args) 1288 all = [] 1289 n = map(len, args) 1290 for i in range(len(args)): 1291 all = all + args[i] 1292 ranked = rankdata(all) 1293 T = tiecorrect(ranked) 1294 for i in range(len(args)): 1295 args[i] = ranked[0:n[i]] 1296 del ranked[0:n[i]] 1297 rsums = [] 1298 for i in range(len(args)): 1299 rsums.append(sum(args[i])**2) 1300 rsums[i] = rsums[i] / float(n[i]) 1301 ssbn = sum(rsums) 1302 totaln = sum(n) 1303 h = 12.0 / (totaln * (totaln + 1)) * ssbn - 3 * (totaln + 1) 1304 df = len(args) - 1 1305 if T == 0: 1306 raise ValueError, 'All numbers are identical in lkruskalwallish' 1307 h = h / float(T) 1308 return h, chisqprob(h, df) 1309 1310 1311def lfriedmanchisquare(*args): 1312 """ 1313Friedman Chi-Square is a non-parametric, one-way within-subjects 1314ANOVA. This function calculates the Friedman Chi-square test for repeated 1315measures and returns the result, along with the associated probability 1316value. It assumes 3 or more repeated measures. Only 3 levels requires a 1317minimum of 10 subjects in the study. Four levels requires 5 subjects per 1318level(??). 1319 1320Usage: lfriedmanchisquare(*args) 1321Returns: chi-square statistic, associated p-value 1322""" 1323 k = len(args) 1324 if k < 3: 1325 raise ValueError, 'Less than 3 levels. Friedman test not appropriate.' 1326 n = len(args[0]) 1327 data = apply(pstat.abut, tuple(args)) 1328 for i in range(len(data)): 1329 data[i] = rankdata(data[i]) 1330 ssbn = 0 1331 for i in range(k): 1332 ssbn = ssbn + sum(args[i])**2 1333 chisq = 12.0 / (k * n * (k + 1)) * ssbn - 3 * n * (k + 1) 1334 return chisq, chisqprob(chisq, k - 1) 1335 1336#################################### 1337#### PROBABILITY CALCULATIONS #### 1338#################################### 1339 1340 1341def lchisqprob(chisq, df): 1342 """ 1343Returns the (1-tailed) probability value associated with the provided 1344chi-square value and df. Adapted from chisq.c in Gary Perlman's |Stat. 1345 1346Usage: lchisqprob(chisq,df) 1347""" 1348 BIG = 20.0 1349 1350 def ex(x): 1351 BIG = 20.0 1352 if x < -BIG: 1353 return 0.0 1354 else: 1355 return math.exp(x) 1356 1357 if chisq <= 0 or df < 1: 1358 return 1.0 1359 a = 0.5 * chisq 1360 if df % 2 == 0: 1361 even = 1 1362 else: 1363 even = 0 1364 if df > 1: 1365 y = ex(-a) 1366 if even: 1367 s = y 1368 else: 1369 s = 2.0 * zprob(-math.sqrt(chisq)) 1370 if (df > 2): 1371 chisq = 0.5 * (df - 1.0) 1372 if even: 1373 z = 1.0 1374 else: 1375 z = 0.5 1376 if a > BIG: 1377 if even: 1378 e = 0.0 1379 else: 1380 e = math.log(math.sqrt(math.pi)) 1381 c = math.log(a) 1382 while (z <= chisq): 1383 e = math.log(z) + e 1384 s = s + ex(c * z - a - e) 1385 z = z + 1.0 1386 return s 1387 else: 1388 if even: 1389 e = 1.0 1390 else: 1391 e = 1.0 / math.sqrt(math.pi) / math.sqrt(a) 1392 c = 0.0 1393 while (z <= chisq): 1394 e = e * (a / float(z)) 1395 c = c + e 1396 z = z + 1.0 1397 return (c * y + s) 1398 else: 1399 return s 1400 1401 1402def lerfcc(x): 1403 """ 1404Returns the complementary error function erfc(x) with fractional 1405error everywhere less than 1.2e-7. Adapted from Numerical Recipies. 1406 1407Usage: lerfcc(x) 1408""" 1409 z = abs(x) 1410 t = 1.0 / (1.0 + 0.5 * z) 1411 ans = t * math.exp(-z * z - 1.26551223 + t * (1.00002368 + t * ( 1412 0.37409196 + t * (0.09678418 + t * (-0.18628806 + t * (0.27886807 + t * ( 1413 -1.13520398 + t * (1.48851587 + t * (-0.82215223 + t * 0.17087277))))) 1414 )))) 1415 if x >= 0: 1416 return ans 1417 else: 1418 return 2.0 - ans 1419 1420 1421def lzprob(z): 1422 """ 1423Returns the area under the normal curve 'to the left of' the given z value. 1424Thus, 1425 for z<0, zprob(z) = 1-tail probability 1426 for z>0, 1.0-zprob(z) = 1-tail probability 1427 for any z, 2.0*(1.0-zprob(abs(z))) = 2-tail probability 1428Adapted from z.c in Gary Perlman's |Stat. 1429 1430Usage: lzprob(z) 1431""" 1432 Z_MAX = 6.0 # maximum meaningful z-value 1433 if z == 0.0: 1434 x = 0.0 1435 else: 1436 y = 0.5 * math.fabs(z) 1437 if y >= (Z_MAX * 0.5): 1438 x = 1.0 1439 elif (y < 1.0): 1440 w = y * y 1441 x = (( 1442 ((((((0.000124818987 * w - 0.001075204047) * w + 0.005198775019) * w - 1443 0.019198292004) * w + 0.059054035642) * w - 0.151968751364) * w + 1444 0.319152932694) * w - 0.531923007300) * w + 0.797884560593) * y * 2.0 1445 else: 1446 y = y - 2.0 1447 x = ((((((( 1448 ((((((-0.000045255659 * y + 0.000152529290) * y - 0.000019538132) * y 1449 - 0.000676904986) * y + 0.001390604284) * y - 0.000794620820) * y 1450 - 0.002034254874) * y + 0.006549791214) * y - 0.010557625006) * y + 1451 0.011630447319) * y - 0.009279453341) * y + 0.005353579108) * y - 1452 0.002141268741) * y + 0.000535310849) * y + 0.999936657524 1453 if z > 0.0: 1454 prob = ((x + 1.0) * 0.5) 1455 else: 1456 prob = ((1.0 - x) * 0.5) 1457 return prob 1458 1459 1460def lksprob(alam): 1461 """ 1462Computes a Kolmolgorov-Smirnov t-test significance level. Adapted from 1463Numerical Recipies. 1464 1465Usage: lksprob(alam) 1466""" 1467 fac = 2.0 1468 sum = 0.0 1469 termbf = 0.0 1470 a2 = -2.0 * alam * alam 1471 for j in range(1, 201): 1472 term = fac * math.exp(a2 * j * j) 1473 sum = sum + term 1474 if math.fabs(term) <= (0.001 * termbf) or math.fabs(term) < (1.0e-8 * sum): 1475 return sum 1476 fac = -fac 1477 termbf = math.fabs(term) 1478 return 1.0 # Get here only if fails to converge; was 0.0!! 1479 1480 1481def lfprob(dfnum, dfden, F): 1482 """ 1483Returns the (1-tailed) significance level (p-value) of an F 1484statistic given the degrees of freedom for the numerator (dfR-dfF) and 1485the degrees of freedom for the denominator (dfF). 1486 1487Usage: lfprob(dfnum, dfden, F) where usually dfnum=dfbn, dfden=dfwn 1488""" 1489 p = betai(0.5 * dfden, 0.5 * dfnum, dfden / float(dfden + dfnum * F)) 1490 return p 1491 1492 1493def lbetacf(a, b, x): 1494 """ 1495This function evaluates the continued fraction form of the incomplete 1496Beta function, betai. (Adapted from: Numerical Recipies in C.) 1497 1498Usage: lbetacf(a,b,x) 1499""" 1500 ITMAX = 200 1501 EPS = 3.0e-7 1502 1503 bm = az = am = 1.0 1504 qab = a + b 1505 qap = a + 1.0 1506 qam = a - 1.0 1507 bz = 1.0 - qab * x / qap 1508 for i in range(ITMAX + 1): 1509 em = float(i + 1) 1510 tem = em + em 1511 d = em * (b - em) * x / ((qam + tem) * (a + tem)) 1512 ap = az + d * am 1513 bp = bz + d * bm 1514 d = -(a + em) * (qab + em) * x / ((qap + tem) * (a + tem)) 1515 app = ap + d * az 1516 bpp = bp + d * bz 1517 aold = az 1518 am = ap / bpp 1519 bm = bp / bpp 1520 az = app / bpp 1521 bz = 1.0 1522 if (abs(az - aold) < (EPS * abs(az))): 1523 return az 1524 print 'a or b too big, or ITMAX too small in Betacf.' 1525 1526 1527def lgammln(xx): 1528 """ 1529Returns the gamma function of xx. 1530 Gamma(z) = Integral(0,infinity) of t^(z-1)exp(-t) dt. 1531(Adapted from: Numerical Recipies in C.) 1532 1533Usage: lgammln(xx) 1534""" 1535 1536 coeff = [76.18009173, -86.50532033, 24.01409822, -1.231739516, 0.120858003e-2, 1537 -0.536382e-5] 1538 x = xx - 1.0 1539 tmp = x + 5.5 1540 tmp = tmp - (x + 0.5) * math.log(tmp) 1541 ser = 1.0 1542 for j in range(len(coeff)): 1543 x = x + 1 1544 ser = ser + coeff[j] / x 1545 return -tmp + math.log(2.50662827465 * ser) 1546 1547 1548def lbetai(a, b, x): 1549 """ 1550Returns the incomplete beta function: 1551 1552 I-sub-x(a,b) = 1/B(a,b)*(Integral(0,x) of t^(a-1)(1-t)^(b-1) dt) 1553 1554where a,b>0 and B(a,b) = G(a)*G(b)/(G(a+b)) where G(a) is the gamma 1555function of a. The continued fraction formulation is implemented here, 1556using the betacf function. (Adapted from: Numerical Recipies in C.) 1557 1558Usage: lbetai(a,b,x) 1559""" 1560 if (x < 0.0 or x > 1.0): 1561 raise ValueError, 'Bad x in lbetai' 1562 if (x == 0.0 or x == 1.0): 1563 bt = 0.0 1564 else: 1565 bt = math.exp(gammln(a + b) - gammln(a) - gammln(b) + a * math.log(x) + b * 1566 math.log(1.0 - x)) 1567 if (x < (a + 1.0) / (a + b + 2.0)): 1568 return bt * betacf(a, b, x) / float(a) 1569 else: 1570 return 1.0 - bt * betacf(b, a, 1.0 - x) / float(b) 1571 1572#################################### 1573####### ANOVA CALCULATIONS ####### 1574#################################### 1575 1576 1577def lF_oneway(*lists): 1578 """ 1579Performs a 1-way ANOVA, returning an F-value and probability given 1580any number of groups. From Heiman, pp.394-7. 1581 1582Usage: F_oneway(*lists) where *lists is any number of lists, one per 1583 treatment group 1584Returns: F value, one-tailed p-value 1585""" 1586 a = len(lists) # ANOVA on 'a' groups, each in it's own list 1587 means = [0] * a 1588 vars = [0] * a 1589 ns = [0] * a 1590 alldata = [] 1591 tmp = map(N.array, lists) 1592 means = map(amean, tmp) 1593 vars = map(avar, tmp) 1594 ns = map(len, lists) 1595 for i in range(len(lists)): 1596 alldata = alldata + lists[i] 1597 alldata = N.array(alldata) 1598 bign = len(alldata) 1599 sstot = ass(alldata) - (asquare_of_sums(alldata) / float(bign)) 1600 ssbn = 0 1601 for list in lists: 1602 ssbn = ssbn + asquare_of_sums(N.array(list)) / float(len(list)) 1603 ssbn = ssbn - (asquare_of_sums(alldata) / float(bign)) 1604 sswn = sstot - ssbn 1605 dfbn = a - 1 1606 dfwn = bign - a 1607 msb = ssbn / float(dfbn) 1608 msw = sswn / float(dfwn) 1609 f = msb / msw 1610 prob = fprob(dfbn, dfwn, f) 1611 return f, prob 1612 1613 1614def lF_value(ER, EF, dfnum, dfden): 1615 """ 1616Returns an F-statistic given the following: 1617 ER = error associated with the null hypothesis (the Restricted model) 1618 EF = error associated with the alternate hypothesis (the Full model) 1619 dfR-dfF = degrees of freedom of the numerator 1620 dfF = degrees of freedom associated with the denominator/Full model 1621 1622Usage: lF_value(ER,EF,dfnum,dfden) 1623""" 1624 return ((ER - EF) / float(dfnum) / (EF / float(dfden))) 1625 1626#################################### 1627######## SUPPORT FUNCTIONS ####### 1628#################################### 1629 1630 1631def writecc(listoflists, file, writetype='w', extra=2): 1632 """ 1633Writes a list of lists to a file in columns, customized by the max 1634size of items within the columns (max size of items in col, +2 characters) 1635to specified file. File-overwrite is the default. 1636 1637Usage: writecc (listoflists,file,writetype='w',extra=2) 1638Returns: None 1639""" 1640 if type(listoflists[0]) not in [ListType, TupleType]: 1641 listoflists = [listoflists] 1642 outfile = open(file, writetype) 1643 rowstokill = [] 1644 list2print = copy.deepcopy(listoflists) 1645 for i in range(len(listoflists)): 1646 if listoflists[i] == [ 1647 '\n' 1648 ] or listoflists[i] == '\n' or listoflists[i] == 'dashes': 1649 rowstokill = rowstokill + [i] 1650 rowstokill.reverse() 1651 for row in rowstokill: 1652 del list2print[row] 1653 maxsize = [0] * len(list2print[0]) 1654 for col in range(len(list2print[0])): 1655 items = pstat.colex(list2print, col) 1656 items = map(pstat.makestr, items) 1657 maxsize[col] = max(map(len, items)) + extra 1658 for row in listoflists: 1659 if row == ['\n'] or row == '\n': 1660 outfile.write('\n') 1661 elif row == ['dashes'] or row == 'dashes': 1662 dashes = [0] * len(maxsize) 1663 for j in range(len(maxsize)): 1664 dashes[j] = '-' * (maxsize[j] - 2) 1665 outfile.write(pstat.lineincustcols(dashes, maxsize)) 1666 else: 1667 outfile.write(pstat.lineincustcols(row, maxsize)) 1668 outfile.write('\n') 1669 outfile.close() 1670 return None 1671 1672 1673def lincr(l, cap): # to increment a list up to a max-list of 'cap' 1674 """ 1675Simulate a counting system from an n-dimensional list. 1676 1677Usage: lincr(l,cap) l=list to increment, cap=max values for each list pos'n 1678Returns: next set of values for list l, OR -1 (if overflow) 1679""" 1680 l[0] = l[0] + 1 # e.g., [0,0,0] --> [2,4,3] (=cap) 1681 for i in range(len(l)): 1682 if l[i] > cap[i] and i < len(l) - 1: # if carryover AND not done 1683 l[i] = 0 1684 l[i + 1] = l[i + 1] + 1 1685 elif l[i] > cap[i] and i == len( 1686 l) - 1: # overflow past last column, must be finished 1687 l = -1 1688 return l 1689 1690 1691def lsum(inlist): 1692 """ 1693Returns the sum of the items in the passed list. 1694 1695Usage: lsum(inlist) 1696""" 1697 s = 0 1698 for item in inlist: 1699 s = s + item 1700 return s 1701 1702 1703def lcumsum(inlist): 1704 """ 1705Returns a list consisting of the cumulative sum of the items in the 1706passed list. 1707 1708Usage: lcumsum(inlist) 1709""" 1710 newlist = copy.deepcopy(inlist) 1711 for i in range(1, len(newlist)): 1712 newlist[i] = newlist[i] + newlist[i - 1] 1713 return newlist 1714 1715 1716def lss(inlist): 1717 """ 1718Squares each value in the passed list, adds up these squares and 1719returns the result. 1720 1721Usage: lss(inlist) 1722""" 1723 ss = 0 1724 for item in inlist: 1725 ss = ss + item * item 1726 return ss 1727 1728 1729def lsummult(list1, list2): 1730 """ 1731Multiplies elements in list1 and list2, element by element, and 1732returns the sum of all resulting multiplications. Must provide equal 1733length lists. 1734 1735Usage: lsummult(list1,list2) 1736""" 1737 if len(list1) <> len(list2): 1738 raise ValueError, 'Lists not equal length in summult.' 1739 s = 0 1740 for item1, item2 in pstat.abut(list1, list2): 1741 s = s + item1 * item2 1742 return s 1743 1744 1745def lsumdiffsquared(x, y): 1746 """ 1747Takes pairwise differences of the values in lists x and y, squares 1748these differences, and returns the sum of these squares. 1749 1750Usage: lsumdiffsquared(x,y) 1751Returns: sum[(x[i]-y[i])**2] 1752""" 1753 sds = 0 1754 for i in range(len(x)): 1755 sds = sds + (x[i] - y[i])**2 1756 return sds 1757 1758 1759def lsquare_of_sums(inlist): 1760 """ 1761Adds the values in the passed list, squares the sum, and returns 1762the result. 1763 1764Usage: lsquare_of_sums(inlist) 1765Returns: sum(inlist[i])**2 1766""" 1767 s = sum(inlist) 1768 return float(s) * s 1769 1770 1771def lshellsort(inlist): 1772 """ 1773Shellsort algorithm. Sorts a 1D-list. 1774 1775Usage: lshellsort(inlist) 1776Returns: sorted-inlist, sorting-index-vector (for original list) 1777""" 1778 n = len(inlist) 1779 svec = copy.deepcopy(inlist) 1780 ivec = range(n) 1781 gap = n / 2 # integer division needed 1782 while gap > 0: 1783 for i in range(gap, n): 1784 for j in range(i - gap, -1, -gap): 1785 while j >= 0 and svec[j] > svec[j + gap]: 1786 temp = svec[j] 1787 svec[j] = svec[j + gap] 1788 svec[j + gap] = temp 1789 itemp = ivec[j] 1790 ivec[j] = ivec[j + gap] 1791 ivec[j + gap] = itemp 1792 gap = gap / 2 # integer division needed 1793# svec is now sorted inlist, and ivec has the order svec[i] = vec[ivec[i]] 1794 return svec, ivec 1795 1796 1797def lrankdata(inlist): 1798 """ 1799Ranks the data in inlist, dealing with ties appropritely. Assumes 1800a 1D inlist. Adapted from Gary Perlman's |Stat ranksort. 1801 1802Usage: lrankdata(inlist) 1803Returns: a list of length equal to inlist, containing rank scores 1804""" 1805 n = len(inlist) 1806 svec, ivec = shellsort(inlist) 1807 sumranks = 0 1808 dupcount = 0 1809 newlist = [0] * n 1810 for i in range(n): 1811 sumranks = sumranks + i 1812 dupcount = dupcount + 1 1813 if i == n - 1 or svec[i] <> svec[i + 1]: 1814 averank = sumranks / float(dupcount) + 1 1815 for j in range(i - dupcount + 1, i + 1): 1816 newlist[ivec[j]] = averank 1817 sumranks = 0 1818 dupcount = 0 1819 return newlist 1820 1821 1822def outputpairedstats(fname, writemode, name1, n1, m1, se1, min1, max1, name2, 1823 n2, m2, se2, min2, max2, statname, stat, prob): 1824 """ 1825Prints or write to a file stats for two groups, using the name, n, 1826mean, sterr, min and max for each group, as well as the statistic name, 1827its value, and the associated p-value. 1828 1829Usage: outputpairedstats(fname,writemode, 1830 name1,n1,mean1,stderr1,min1,max1, 1831 name2,n2,mean2,stderr2,min2,max2, 1832 statname,stat,prob) 1833Returns: None 1834""" 1835 suffix = '' # for *s after the p-value 1836 try: 1837 x = prob.shape 1838 prob = prob[0] 1839 except: 1840 pass 1841 if prob < 0.001: 1842 suffix = ' ***' 1843 elif prob < 0.01: 1844 suffix = ' **' 1845 elif prob < 0.05: 1846 suffix = ' *' 1847 title = [['Name', 'N', 'Mean', 'SD', 'Min', 'Max']] 1848 lofl = title + [[name1, n1, round(m1, 3), round( 1849 math.sqrt(se1), 3), min1, max1], [name2, n2, round(m2, 3), round( 1850 math.sqrt(se2), 3), min2, max2]] 1851 if type(fname) <> StringType or len(fname) == 0: 1852 print 1853 print statname 1854 print 1855 pstat.printcc(lofl) 1856 print 1857 try: 1858 if stat.shape == (): 1859 stat = stat[0] 1860 if prob.shape == (): 1861 prob = prob[0] 1862 except: 1863 pass 1864 print 'Test statistic = ', round(stat, 3), ' p = ', round(prob, 3), suffix 1865 print 1866 else: 1867 file = open(fname, writemode) 1868 file.write('\n' + statname + '\n\n') 1869 file.close() 1870 writecc(lofl, fname, 'a') 1871 file = open(fname, 'a') 1872 try: 1873 if stat.shape == (): 1874 stat = stat[0] 1875 if prob.shape == (): 1876 prob = prob[0] 1877 except: 1878 pass 1879 file.write(pstat.list2string(['\nTest statistic = ', round(stat, 4), 1880 ' p = ', round(prob, 4), suffix, '\n\n'])) 1881 file.close() 1882 return None 1883 1884 1885def lfindwithin(data): 1886 """ 1887Returns an integer representing a binary vector, where 1=within- 1888subject factor, 0=between. Input equals the entire data 2D list (i.e., 1889column 0=random factor, column -1=measured values (those two are skipped). 1890Note: input data is in |Stat format ... a list of lists ("2D list") with 1891one row per measured value, first column=subject identifier, last column= 1892score, one in-between column per factor (these columns contain level 1893designations on each factor). See also stats.anova.__doc__. 1894 1895Usage: lfindwithin(data) data in |Stat format 1896""" 1897 1898 numfact = len(data[0]) - 1 1899 withinvec = 0 1900 for col in range(1, numfact): 1901 examplelevel = pstat.unique(pstat.colex(data, col))[0] 1902 rows = pstat.linexand(data, col, examplelevel) # get 1 level of this factor 1903 factsubjs = pstat.unique(pstat.colex(rows, 0)) 1904 allsubjs = pstat.unique(pstat.colex(data, 0)) 1905 if len(factsubjs) == len(allsubjs): # fewer Ss than scores on this factor? 1906 withinvec = withinvec + (1 << col) 1907 return withinvec 1908 1909######################################################### 1910######################################################### 1911####### DISPATCH LISTS AND TUPLES TO ABOVE FCNS ######### 1912######################################################### 1913######################################################### 1914 1915## CENTRAL TENDENCY: 1916geometricmean = Dispatch((lgeometricmean, (ListType, TupleType)),) 1917harmonicmean = Dispatch((lharmonicmean, (ListType, TupleType)),) 1918mean = Dispatch((lmean, (ListType, TupleType)),) 1919median = Dispatch((lmedian, (ListType, TupleType)),) 1920medianscore = Dispatch((lmedianscore, (ListType, TupleType)),) 1921mode = Dispatch((lmode, (ListType, TupleType)),) 1922 1923## MOMENTS: 1924moment = Dispatch((lmoment, (ListType, TupleType)),) 1925variation = Dispatch((lvariation, (ListType, TupleType)),) 1926skew = Dispatch((lskew, (ListType, TupleType)),) 1927kurtosis = Dispatch((lkurtosis, (ListType, TupleType)),) 1928describe = Dispatch((ldescribe, (ListType, TupleType)),) 1929 1930## FREQUENCY STATISTICS: 1931itemfreq = Dispatch((litemfreq, (ListType, TupleType)),) 1932scoreatpercentile = Dispatch((lscoreatpercentile, (ListType, TupleType)),) 1933percentileofscore = Dispatch((lpercentileofscore, (ListType, TupleType)),) 1934histogram = Dispatch((lhistogram, (ListType, TupleType)),) 1935cumfreq = Dispatch((lcumfreq, (ListType, TupleType)),) 1936relfreq = Dispatch((lrelfreq, (ListType, TupleType)),) 1937 1938## VARIABILITY: 1939obrientransform = Dispatch((lobrientransform, (ListType, TupleType)),) 1940samplevar = Dispatch((lsamplevar, (ListType, TupleType)),) 1941samplestdev = Dispatch((lsamplestdev, (ListType, TupleType)),) 1942var = Dispatch((lvar, (ListType, TupleType)),) 1943stdev = Dispatch((lstdev, (ListType, TupleType)),) 1944sterr = Dispatch((lsterr, (ListType, TupleType)),) 1945sem = Dispatch((lsem, (ListType, TupleType)),) 1946z = Dispatch((lz, (ListType, TupleType)),) 1947zs = Dispatch((lzs, (ListType, TupleType)),) 1948 1949## TRIMMING FCNS: 1950trimboth = Dispatch((ltrimboth, (ListType, TupleType)),) 1951trim1 = Dispatch((ltrim1, (ListType, TupleType)),) 1952 1953## CORRELATION FCNS: 1954paired = Dispatch((lpaired, (ListType, TupleType)),) 1955pearsonr = Dispatch((lpearsonr, (ListType, TupleType)),) 1956spearmanr = Dispatch((lspearmanr, (ListType, TupleType)),) 1957pointbiserialr = Dispatch((lpointbiserialr, (ListType, TupleType)),) 1958kendalltau = Dispatch((lkendalltau, (ListType, TupleType)),) 1959linregress = Dispatch((llinregress, (ListType, TupleType)),) 1960 1961## INFERENTIAL STATS: 1962ttest_1samp = Dispatch((lttest_1samp, (ListType, TupleType)),) 1963ttest_ind = Dispatch((lttest_ind, (ListType, TupleType)),) 1964ttest_rel = Dispatch((lttest_rel, (ListType, TupleType)),) 1965chisquare = Dispatch((lchisquare, (ListType, TupleType)),) 1966ks_2samp = Dispatch((lks_2samp, (ListType, TupleType)),) 1967mannwhitneyu = Dispatch((lmannwhitneyu, (ListType, TupleType)),) 1968ranksums = Dispatch((lranksums, (ListType, TupleType)),) 1969tiecorrect = Dispatch((ltiecorrect, (ListType, TupleType)),) 1970wilcoxont = Dispatch((lwilcoxont, (ListType, TupleType)),) 1971kruskalwallish = Dispatch((lkruskalwallish, (ListType, TupleType)),) 1972friedmanchisquare = Dispatch((lfriedmanchisquare, (ListType, TupleType)),) 1973 1974## PROBABILITY CALCS: 1975chisqprob = Dispatch((lchisqprob, (IntType, FloatType)),) 1976zprob = Dispatch((lzprob, (IntType, FloatType)),) 1977ksprob = Dispatch((lksprob, (IntType, FloatType)),) 1978fprob = Dispatch((lfprob, (IntType, FloatType)),) 1979betacf = Dispatch((lbetacf, (IntType, FloatType)),) 1980betai = Dispatch((lbetai, (IntType, FloatType)),) 1981erfcc = Dispatch((lerfcc, (IntType, FloatType)),) 1982gammln = Dispatch((lgammln, (IntType, FloatType)),) 1983 1984## ANOVA FUNCTIONS: 1985F_oneway = Dispatch((lF_oneway, (ListType, TupleType)),) 1986F_value = Dispatch((lF_value, (ListType, TupleType)),) 1987 1988## SUPPORT FUNCTIONS: 1989incr = Dispatch((lincr, (ListType, TupleType)),) 1990sum = Dispatch((lsum, (ListType, TupleType)),) 1991cumsum = Dispatch((lcumsum, (ListType, TupleType)),) 1992ss = Dispatch((lss, (ListType, TupleType)),) 1993summult = Dispatch((lsummult, (ListType, TupleType)),) 1994square_of_sums = Dispatch((lsquare_of_sums, (ListType, TupleType)),) 1995sumdiffsquared = Dispatch((lsumdiffsquared, (ListType, TupleType)),) 1996shellsort = Dispatch((lshellsort, (ListType, TupleType)),) 1997rankdata = Dispatch((lrankdata, (ListType, TupleType)),) 1998findwithin = Dispatch((lfindwithin, (ListType, TupleType)),) 1999 2000#============= THE ARRAY-VERSION OF THE STATS FUNCTIONS =============== 2001#============= THE ARRAY-VERSION OF THE STATS FUNCTIONS =============== 2002#============= THE ARRAY-VERSION OF THE STATS FUNCTIONS =============== 2003#============= THE ARRAY-VERSION OF THE STATS FUNCTIONS =============== 2004#============= THE ARRAY-VERSION OF THE STATS FUNCTIONS =============== 2005#============= THE ARRAY-VERSION OF THE STATS FUNCTIONS =============== 2006#============= THE ARRAY-VERSION OF THE STATS FUNCTIONS =============== 2007#============= THE ARRAY-VERSION OF THE STATS FUNCTIONS =============== 2008#============= THE ARRAY-VERSION OF THE STATS FUNCTIONS =============== 2009#============= THE ARRAY-VERSION OF THE STATS FUNCTIONS =============== 2010#============= THE ARRAY-VERSION OF THE STATS FUNCTIONS =============== 2011#============= THE ARRAY-VERSION OF THE STATS FUNCTIONS =============== 2012#============= THE ARRAY-VERSION OF THE STATS FUNCTIONS =============== 2013#============= THE ARRAY-VERSION OF THE STATS FUNCTIONS =============== 2014#============= THE ARRAY-VERSION OF THE STATS FUNCTIONS =============== 2015#============= THE ARRAY-VERSION OF THE STATS FUNCTIONS =============== 2016#============= THE ARRAY-VERSION OF THE STATS FUNCTIONS =============== 2017#============= THE ARRAY-VERSION OF THE STATS FUNCTIONS =============== 2018#============= THE ARRAY-VERSION OF THE STATS FUNCTIONS =============== 2019 2020try: # DEFINE THESE *ONLY* IF NUMERIC IS AVAILABLE 2021 import numpy as N 2022 import numpy.linalg as LA 2023 2024 ##################################### 2025 ######## ACENTRAL TENDENCY ######## 2026 ##################################### 2027 2028 2029 def ageometricmean(inarray, dimension=None, keepdims=0): 2030 """ 2031Calculates the geometric mean of the values in the passed array. 2032That is: n-th root of (x1 * x2 * ... * xn). Defaults to ALL values in 2033the passed array. Use dimension=None to flatten array first. REMEMBER: if 2034dimension=0, it collapses over dimension 0 ('rows' in a 2D array) only, and 2035if dimension is a sequence, it collapses over all specified dimensions. If 2036keepdims is set to 1, the resulting array will have as many dimensions as 2037inarray, with only 1 'level' per dim that was collapsed over. 2038 2039Usage: ageometricmean(inarray,dimension=None,keepdims=0) 2040Returns: geometric mean computed over dim(s) listed in dimension 2041""" 2042 inarray = N.array(inarray, N.float_) 2043 if dimension == None: 2044 inarray = N.ravel(inarray) 2045 size = len(inarray) 2046 mult = N.power(inarray, 1.0 / size) 2047 mult = N.multiply.reduce(mult) 2048 elif type(dimension) in [IntType, FloatType]: 2049 size = inarray.shape[dimension] 2050 mult = N.power(inarray, 1.0 / size) 2051 mult = N.multiply.reduce(mult, dimension) 2052 if keepdims == 1: 2053 shp = list(inarray.shape) 2054 shp[dimension] = 1 2055 sum = N.reshape(sum, shp) 2056 else: # must be a SEQUENCE of dims to average over 2057 dims = list(dimension) 2058 dims.sort() 2059 dims.reverse() 2060 size = N.array(N.multiply.reduce(N.take(inarray.shape, dims)), N.float_) 2061 mult = N.power(inarray, 1.0 / size) 2062 for dim in dims: 2063 mult = N.multiply.reduce(mult, dim) 2064 if keepdims == 1: 2065 shp = list(inarray.shape) 2066 for dim in dims: 2067 shp[dim] = 1 2068 mult = N.reshape(mult, shp) 2069 return mult 2070 2071 def aharmonicmean(inarray, dimension=None, keepdims=0): 2072 """ 2073Calculates the harmonic mean of the values in the passed array. 2074That is: n / (1/x1 + 1/x2 + ... + 1/xn). Defaults to ALL values in 2075the passed array. Use dimension=None to flatten array first. REMEMBER: if 2076dimension=0, it collapses over dimension 0 ('rows' in a 2D array) only, and 2077if dimension is a sequence, it collapses over all specified dimensions. If 2078keepdims is set to 1, the resulting array will have as many dimensions as 2079inarray, with only 1 'level' per dim that was collapsed over. 2080 2081Usage: aharmonicmean(inarray,dimension=None,keepdims=0) 2082Returns: harmonic mean computed over dim(s) in dimension 2083""" 2084 inarray = inarray.astype(N.float_) 2085 if dimension == None: 2086 inarray = N.ravel(inarray) 2087 size = len(inarray) 2088 s = N.add.reduce(1.0 / inarray) 2089 elif type(dimension) in [IntType, FloatType]: 2090 size = float(inarray.shape[dimension]) 2091 s = N.add.reduce(1.0 / inarray, dimension) 2092 if keepdims == 1: 2093 shp = list(inarray.shape) 2094 shp[dimension] = 1 2095 s = N.reshape(s, shp) 2096 else: # must be a SEQUENCE of dims to average over 2097 dims = list(dimension) 2098 dims.sort() 2099 nondims = [] 2100 for i in range(len(inarray.shape)): 2101 if i not in dims: 2102 nondims.append(i) 2103 tinarray = N.transpose(inarray, nondims + dims) # put keep-dims first 2104 idx = [0] * len(nondims) 2105 if idx == []: 2106 size = len(N.ravel(inarray)) 2107 s = asum(1.0 / inarray) 2108 if keepdims == 1: 2109 s = N.reshape([s], N.ones(len(inarray.shape))) 2110 else: 2111 idx[0] = -1 2112 loopcap = N.array(tinarray.shape[0:len(nondims)]) - 1 2113 s = N.zeros(loopcap + 1, N.float_) 2114 while incr(idx, loopcap) <> -1: 2115 s[idx] = asum(1.0 / tinarray[idx]) 2116 size = N.multiply.reduce(N.take(inarray.shape, dims)) 2117 if keepdims == 1: 2118 shp = list(inarray.shape) 2119 for dim in dims: 2120 shp[dim] = 1 2121 s = N.reshape(s, shp) 2122 return size / s 2123 2124 def amean(inarray, dimension=None, keepdims=0): 2125 """ 2126Calculates the arithmatic mean of the values in the passed array. 2127That is: 1/n * (x1 + x2 + ... + xn). Defaults to ALL values in the 2128passed array. Use dimension=None to flatten array first. REMEMBER: if 2129dimension=0, it collapses over dimension 0 ('rows' in a 2D array) only, and 2130if dimension is a sequence, it collapses over all specified dimensions. If 2131keepdims is set to 1, the resulting array will have as many dimensions as 2132inarray, with only 1 'level' per dim that was collapsed over. 2133 2134Usage: amean(inarray,dimension=None,keepdims=0) 2135Returns: arithematic mean calculated over dim(s) in dimension 2136""" 2137 if inarray.dtype in [N.int_, N.short, N.ubyte]: 2138 inarray = inarray.astype(N.float_) 2139 if dimension == None: 2140 inarray = N.ravel(inarray) 2141 sum = N.add.reduce(inarray) 2142 denom = float(len(inarray)) 2143 elif type(dimension) in [IntType, FloatType]: 2144 sum = asum(inarray, dimension) 2145 denom = float(inarray.shape[dimension]) 2146 if keepdims == 1: 2147 shp = list(inarray.shape) 2148 shp[dimension] = 1 2149 sum = N.reshape(sum, shp) 2150 else: # must be a TUPLE of dims to average over 2151 dims = list(dimension) 2152 dims.sort() 2153 dims.reverse() 2154 sum = inarray * 1.0 2155 for dim in dims: 2156 sum = N.add.reduce(sum, dim) 2157 denom = N.array(N.multiply.reduce(N.take(inarray.shape, dims)), N.float_) 2158 if keepdims == 1: 2159 shp = list(inarray.shape) 2160 for dim in dims: 2161 shp[dim] = 1 2162 sum = N.reshape(sum, shp) 2163 return sum / denom 2164 2165 def amedian(inarray, numbins=1000): 2166 """ 2167Calculates the COMPUTED median value of an array of numbers, given the 2168number of bins to use for the histogram (more bins approaches finding the 2169precise median value of the array; default number of bins = 1000). From 2170G.W. Heiman's Basic Stats, or CRC Probability & Statistics. 2171NOTE: THIS ROUTINE ALWAYS uses the entire passed array (flattens it first). 2172 2173Usage: amedian(inarray,numbins=1000) 2174Returns: median calculated over ALL values in inarray 2175""" 2176 inarray = N.ravel(inarray) 2177 (hist, smallest, binsize, extras) = ahistogram(inarray, numbins, 2178 [min(inarray), max(inarray)]) 2179 cumhist = N.cumsum(hist) # make cumulative histogram 2180 otherbins = N.greater_equal(cumhist, len(inarray) / 2.0) 2181 otherbins = list(otherbins) # list of 0/1s, 1s start at median bin 2182 cfbin = otherbins.index(1) # get 1st(!) index holding 50%ile score 2183 LRL = smallest + binsize * cfbin # get lower read limit of that bin 2184 cfbelow = N.add.reduce(hist[0:cfbin]) # cum. freq. below bin 2185 freq = hist[cfbin] # frequency IN the 50%ile bin 2186 median = LRL + ( 2187 (len(inarray) / 2.0 - cfbelow) / float(freq)) * binsize # MEDIAN 2188 return median 2189 2190 def amedianscore(inarray, dimension=None): 2191 """ 2192Returns the 'middle' score of the passed array. If there is an even 2193number of scores, the mean of the 2 middle scores is returned. Can function 2194with 1D arrays, or on the FIRST dimension of 2D arrays (i.e., dimension can 2195be None, to pre-flatten the array, or else dimension must equal 0). 2196 2197Usage: amedianscore(inarray,dimension=None) 2198Returns: 'middle' score of the array, or the mean of the 2 middle scores 2199""" 2200 if dimension == None: 2201 inarray = N.ravel(inarray) 2202 dimension = 0 2203 inarray = N.sort(inarray, dimension) 2204 if inarray.shape[dimension] % 2 == 0: # if even number of elements 2205 indx = inarray.shape[dimension] / 2 # integer division correct 2206 median = N.asarray(inarray[indx] + inarray[indx - 1]) / 2.0 2207 else: 2208 indx = inarray.shape[dimension] / 2 # integer division correct 2209 median = N.take(inarray, [indx], dimension) 2210 if median.shape == (1,): 2211 median = median[0] 2212 return median 2213 2214 def amode(a, dimension=None): 2215 """ 2216Returns an array of the modal (most common) score in the passed array. 2217If there is more than one such score, ONLY THE FIRST is returned. 2218The bin-count for the modal values is also returned. Operates on whole 2219array (dimension=None), or on a given dimension. 2220 2221Usage: amode(a, dimension=None) 2222Returns: array of bin-counts for mode(s), array of corresponding modal values 2223""" 2224 2225 if dimension == None: 2226 a = N.ravel(a) 2227 dimension = 0 2228 scores = pstat.aunique(N.ravel(a)) # get ALL unique values 2229 testshape = list(a.shape) 2230 testshape[dimension] = 1 2231 oldmostfreq = N.zeros(testshape) 2232 oldcounts = N.zeros(testshape) 2233 for score in scores: 2234 template = N.equal(a, score) 2235 counts = asum(template, dimension, 1) 2236 mostfrequent = N.where(counts > oldcounts, score, oldmostfreq) 2237 oldcounts = N.where(counts > oldcounts, counts, oldcounts) 2238 oldmostfreq = mostfrequent 2239 return oldcounts, mostfrequent 2240 2241 def atmean(a, limits=None, inclusive=(1, 1)): 2242 """ 2243Returns the arithmetic mean of all values in an array, ignoring values 2244strictly outside the sequence passed to 'limits'. Note: either limit 2245in the sequence, or the value of limits itself, can be set to None. The 2246inclusive list/tuple determines whether the lower and upper limiting bounds 2247(respectively) are open/exclusive (0) or closed/inclusive (1). 2248 2249Usage: atmean(a,limits=None,inclusive=(1,1)) 2250""" 2251 if a.dtype in [N.int_, N.short, N.ubyte]: 2252 a = a.astype(N.float_) 2253 if limits == None: 2254 return mean(a) 2255 assert type(limits) in [ListType, TupleType, N.ndarray 2256 ], 'Wrong type for limits in atmean' 2257 if inclusive[0]: 2258 lowerfcn = N.greater_equal 2259 else: 2260 lowerfcn = N.greater 2261 if inclusive[1]: 2262 upperfcn = N.less_equal 2263 else: 2264 upperfcn = N.less 2265 if limits[0] > N.maximum.reduce(N.ravel(a)) or limits[1] < N.minimum.reduce( 2266 N.ravel(a)): 2267 raise ValueError, 'No array values within given limits (atmean).' 2268 elif limits[0] == None and limits[1] <> None: 2269 mask = upperfcn(a, limits[1]) 2270 elif limits[0] <> None and limits[1] == None: 2271 mask = lowerfcn(a, limits[0]) 2272 elif limits[0] <> None and limits[1] <> None: 2273 mask = lowerfcn(a, limits[0]) * upperfcn(a, limits[1]) 2274 s = float(N.add.reduce(N.ravel(a * mask))) 2275 n = float(N.add.reduce(N.ravel(mask))) 2276 return s / n 2277 2278 def atvar(a, limits=None, inclusive=(1, 1)): 2279 """ 2280Returns the sample variance of values in an array, (i.e., using N-1), 2281ignoring values strictly outside the sequence passed to 'limits'. 2282Note: either limit in the sequence, or the value of limits itself, 2283can be set to None. The inclusive list/tuple determines whether the lower 2284and upper limiting bounds (respectively) are open/exclusive (0) or 2285closed/inclusive (1). ASSUMES A FLAT ARRAY (OR ELSE PREFLATTENS). 2286 2287Usage: atvar(a,limits=None,inclusive=(1,1)) 2288""" 2289 a = a.astype(N.float_) 2290 if limits == None or limits == [None, None]: 2291 return avar(a) 2292 assert type(limits) in [ListType, TupleType, N.ndarray 2293 ], 'Wrong type for limits in atvar' 2294 if inclusive[0]: 2295 lowerfcn = N.greater_equal 2296 else: 2297 lowerfcn = N.greater 2298 if inclusive[1]: 2299 upperfcn = N.less_equal 2300 else: 2301 upperfcn = N.less 2302 if limits[0] > N.maximum.reduce(N.ravel(a)) or limits[1] < N.minimum.reduce( 2303 N.ravel(a)): 2304 raise ValueError, 'No array values within given limits (atvar).' 2305 elif limits[0] == None and limits[1] <> None: 2306 mask = upperfcn(a, limits[1]) 2307 elif limits[0] <> None and limits[1] == None: 2308 mask = lowerfcn(a, limits[0]) 2309 elif limits[0] <> None and limits[1] <> None: 2310 mask = lowerfcn(a, limits[0]) * upperfcn(a, limits[1]) 2311 2312 a = N.compress(mask, a) # squish out excluded values 2313 return avar(a) 2314 2315 def atmin(a, lowerlimit=None, dimension=None, inclusive=1): 2316 """ 2317Returns the minimum value of a, along dimension, including only values less 2318than (or equal to, if inclusive=1) lowerlimit. If the limit is set to None, 2319all values in the array are used. 2320 2321Usage: atmin(a,lowerlimit=None,dimension=None,inclusive=1) 2322""" 2323 if inclusive: 2324 lowerfcn = N.greater 2325 else: 2326 lowerfcn = N.greater_equal 2327 if dimension == None: 2328 a = N.ravel(a) 2329 dimension = 0 2330 if lowerlimit == None: 2331 lowerlimit = N.minimum.reduce(N.ravel(a)) - 11 2332 biggest = N.maximum.reduce(N.ravel(a)) 2333 ta = N.where(lowerfcn(a, lowerlimit), a, biggest) 2334 return N.minimum.reduce(ta, dimension) 2335 2336 def atmax(a, upperlimit, dimension=None, inclusive=1): 2337 """ 2338Returns the maximum value of a, along dimension, including only values greater 2339than (or equal to, if inclusive=1) upperlimit. If the limit is set to None, 2340a limit larger than the max value in the array is used. 2341 2342Usage: atmax(a,upperlimit,dimension=None,inclusive=1) 2343""" 2344 if inclusive: 2345 upperfcn = N.less 2346 else: 2347 upperfcn = N.less_equal 2348 if dimension == None: 2349 a = N.ravel(a) 2350 dimension = 0 2351 if upperlimit == None: 2352 upperlimit = N.maximum.reduce(N.ravel(a)) + 1 2353 smallest = N.minimum.reduce(N.ravel(a)) 2354 ta = N.where(upperfcn(a, upperlimit), a, smallest) 2355 return N.maximum.reduce(ta, dimension) 2356 2357 def atstdev(a, limits=None, inclusive=(1, 1)): 2358 """ 2359Returns the standard deviation of all values in an array, ignoring values 2360strictly outside the sequence passed to 'limits'. Note: either limit 2361in the sequence, or the value of limits itself, can be set to None. The 2362inclusive list/tuple determines whether the lower and upper limiting bounds 2363(respectively) are open/exclusive (0) or closed/inclusive (1). 2364 2365Usage: atstdev(a,limits=None,inclusive=(1,1)) 2366""" 2367 return N.sqrt(tvar(a, limits, inclusive)) 2368 2369 def atsem(a, limits=None, inclusive=(1, 1)): 2370 """ 2371Returns the standard error of the mean for the values in an array, 2372(i.e., using N for the denominator), ignoring values strictly outside 2373the sequence passed to 'limits'. Note: either limit in the sequence, 2374or the value of limits itself, can be set to None. The inclusive list/tuple 2375determines whether the lower and upper limiting bounds (respectively) are 2376open/exclusive (0) or closed/inclusive (1). 2377 2378Usage: atsem(a,limits=None,inclusive=(1,1)) 2379""" 2380 sd = tstdev(a, limits, inclusive) 2381 if limits == None or limits == [None, None]: 2382 n = float(len(N.ravel(a))) 2383 limits = [min(a) - 1, max(a) + 1] 2384 assert type(limits) in [ListType, TupleType, N.ndarray 2385 ], 'Wrong type for limits in atsem' 2386 if inclusive[0]: 2387 lowerfcn = N.greater_equal 2388 else: 2389 lowerfcn = N.greater 2390 if inclusive[1]: 2391 upperfcn = N.less_equal 2392 else: 2393 upperfcn = N.less 2394 if limits[0] > N.maximum.reduce(N.ravel(a)) or limits[1] < N.minimum.reduce( 2395 N.ravel(a)): 2396 raise ValueError, 'No array values within given limits (atsem).' 2397 elif limits[0] == None and limits[1] <> None: 2398 mask = upperfcn(a, limits[1]) 2399 elif limits[0] <> None and limits[1] == None: 2400 mask = lowerfcn(a, limits[0]) 2401 elif limits[0] <> None and limits[1] <> None: 2402 mask = lowerfcn(a, limits[0]) * upperfcn(a, limits[1]) 2403 term1 = N.add.reduce(N.ravel(a * a * mask)) 2404 n = float(N.add.reduce(N.ravel(mask))) 2405 return sd / math.sqrt(n) 2406 2407##################################### 2408############ AMOMENTS ############# 2409##################################### 2410 2411 def amoment(a, moment=1, dimension=None): 2412 """ 2413Calculates the nth moment about the mean for a sample (defaults to the 24141st moment). Generally used to calculate coefficients of skewness and 2415kurtosis. Dimension can equal None (ravel array first), an integer 2416(the dimension over which to operate), or a sequence (operate over 2417multiple dimensions). 2418 2419Usage: amoment(a,moment=1,dimension=None) 2420Returns: appropriate moment along given dimension 2421""" 2422 if dimension == None: 2423 a = N.ravel(a) 2424 dimension = 0 2425 if moment == 1: 2426 return 0.0 2427 else: 2428 mn = amean(a, dimension, 1) # 1=keepdims 2429 s = N.power((a - mn), moment) 2430 return amean(s, dimension) 2431 2432 def avariation(a, dimension=None): 2433 """ 2434Returns the coefficient of variation, as defined in CRC Standard 2435Probability and Statistics, p.6. Dimension can equal None (ravel array 2436first), an integer (the dimension over which to operate), or a 2437sequence (operate over multiple dimensions). 2438 2439Usage: avariation(a,dimension=None) 2440""" 2441 return 100.0 * asamplestdev(a, dimension) / amean(a, dimension) 2442 2443 def askew(a, dimension=None): 2444 """ 2445Returns the skewness of a distribution (normal ==> 0.0; >0 means extra 2446weight in left tail). Use askewtest() to see if it's close enough. 2447Dimension can equal None (ravel array first), an integer (the 2448dimension over which to operate), or a sequence (operate over multiple 2449dimensions). 2450 2451Usage: askew(a, dimension=None) 2452Returns: skew of vals in a along dimension, returning ZERO where all vals equal 2453""" 2454 denom = N.power(amoment(a, 2, dimension), 1.5) 2455 zero = N.equal(denom, 0) 2456 if type(denom) == N.ndarray and asum(zero) <> 0: 2457 print 'Number of zeros in askew: ', asum(zero) 2458 denom = denom + zero # prevent divide-by-zero 2459 return N.where(zero, 0, amoment(a, 3, dimension) / denom) 2460 2461 def akurtosis(a, dimension=None): 2462 """ 2463Returns the kurtosis of a distribution (normal ==> 3.0; >3 means 2464heavier in the tails, and usually more peaked). Use akurtosistest() 2465to see if it's close enough. Dimension can equal None (ravel array 2466first), an integer (the dimension over which to operate), or a 2467sequence (operate over multiple dimensions). 2468 2469Usage: akurtosis(a,dimension=None) 2470Returns: kurtosis of values in a along dimension, and ZERO where all vals equal 2471""" 2472 denom = N.power(amoment(a, 2, dimension), 2) 2473 zero = N.equal(denom, 0) 2474 if type(denom) == N.ndarray and asum(zero) <> 0: 2475 print 'Number of zeros in akurtosis: ', asum(zero) 2476 denom = denom + zero # prevent divide-by-zero 2477 return N.where(zero, 0, amoment(a, 4, dimension) / denom) 2478 2479 def adescribe(inarray, dimension=None): 2480 """ 2481Returns several descriptive statistics of the passed array. Dimension 2482can equal None (ravel array first), an integer (the dimension over 2483which to operate), or a sequence (operate over multiple dimensions). 2484 2485Usage: adescribe(inarray,dimension=None) 2486Returns: n, (min,max), mean, standard deviation, skew, kurtosis 2487""" 2488 if dimension == None: 2489 inarray = N.ravel(inarray) 2490 dimension = 0 2491 n = inarray.shape[dimension] 2492 mm = (N.minimum.reduce(inarray), N.maximum.reduce(inarray)) 2493 m = amean(inarray, dimension) 2494 sd = astdev(inarray, dimension) 2495 skew = askew(inarray, dimension) 2496 kurt = akurtosis(inarray, dimension) 2497 return n, mm, m, sd, skew, kurt 2498 2499##################################### 2500######## NORMALITY TESTS ########## 2501##################################### 2502 2503 def askewtest(a, dimension=None): 2504 """ 2505Tests whether the skew is significantly different from a normal 2506distribution. Dimension can equal None (ravel array first), an 2507integer (the dimension over which to operate), or a sequence (operate 2508over multiple dimensions). 2509 2510Usage: askewtest(a,dimension=None) 2511Returns: z-score and 2-tail z-probability 2512""" 2513 if dimension == None: 2514 a = N.ravel(a) 2515 dimension = 0 2516 b2 = askew(a, dimension) 2517 n = float(a.shape[dimension]) 2518 y = b2 * N.sqrt(((n + 1) * (n + 3)) / (6.0 * (n - 2))) 2519 beta2 = (3.0 * (n * n + 27 * n - 70) * (n + 1) * 2520 (n + 3)) / ((n - 2.0) * (n + 5) * (n + 7) * (n + 9)) 2521 W2 = -1 + N.sqrt(2 * (beta2 - 1)) 2522 delta = 1 / N.sqrt(N.log(N.sqrt(W2))) 2523 alpha = N.sqrt(2 / (W2 - 1)) 2524 y = N.where(y == 0, 1, y) 2525 Z = delta * N.log(y / alpha + N.sqrt((y / alpha)**2 + 1)) 2526 return Z, (1.0 - zprob(Z)) * 2 2527 2528 def akurtosistest(a, dimension=None): 2529 """ 2530Tests whether a dataset has normal kurtosis (i.e., 2531kurtosis=3(n-1)/(n+1)) Valid only for n>20. Dimension can equal None 2532(ravel array first), an integer (the dimension over which to operate), 2533or a sequence (operate over multiple dimensions). 2534 2535Usage: akurtosistest(a,dimension=None) 2536Returns: z-score and 2-tail z-probability, returns 0 for bad pixels 2537""" 2538 if dimension == None: 2539 a = N.ravel(a) 2540 dimension = 0 2541 n = float(a.shape[dimension]) 2542 if n < 20: 2543 print 'akurtosistest only valid for n>=20 ... continuing anyway, n=', n 2544 b2 = akurtosis(a, dimension) 2545 E = 3.0 * (n - 1) / (n + 1) 2546 varb2 = 24.0 * n * (n - 2) * (n - 3) / ((n + 1) * (n + 1) * (n + 3) * 2547 (n + 5)) 2548 x = (b2 - E) / N.sqrt(varb2) 2549 sqrtbeta1 = 6.0 * (n * n - 5 * n + 2) / ((n + 7) * (n + 9)) * N.sqrt( 2550 (6.0 * (n + 3) * (n + 5)) / (n * (n - 2) * (n - 3))) 2551 A = 6.0 + 8.0 / sqrtbeta1 * (2.0 / sqrtbeta1 + 2552 N.sqrt(1 + 4.0 / (sqrtbeta1**2))) 2553 term1 = 1 - 2 / (9.0 * A) 2554 denom = 1 + x * N.sqrt(2 / (A - 4.0)) 2555 denom = N.where(N.less(denom, 0), 99, denom) 2556 term2 = N.where( 2557 N.equal(denom, 0), term1, N.power( 2558 (1 - 2.0 / A) / denom, 1 / 3.0)) 2559 Z = (term1 - term2) / N.sqrt(2 / (9.0 * A)) 2560 Z = N.where(N.equal(denom, 99), 0, Z) 2561 return Z, (1.0 - zprob(Z)) * 2 2562 2563 def anormaltest(a, dimension=None): 2564 """ 2565Tests whether skew and/OR kurtosis of dataset differs from normal 2566curve. Can operate over multiple dimensions. Dimension can equal 2567None (ravel array first), an integer (the dimension over which to 2568operate), or a sequence (operate over multiple dimensions). 2569 2570Usage: anormaltest(a,dimension=None) 2571Returns: z-score and 2-tail probability 2572""" 2573 if dimension == None: 2574 a = N.ravel(a) 2575 dimension = 0 2576 s, p = askewtest(a, dimension) 2577 k, p = akurtosistest(a, dimension) 2578 k2 = N.power(s, 2) + N.power(k, 2) 2579 return k2, achisqprob(k2, 2) 2580 2581##################################### 2582###### AFREQUENCY FUNCTIONS ####### 2583##################################### 2584 2585 def aitemfreq(a): 2586 """ 2587Returns a 2D array of item frequencies. Column 1 contains item values, 2588column 2 contains their respective counts. Assumes a 1D array is passed. 2589@@@sorting OK? 2590 2591Usage: aitemfreq(a) 2592Returns: a 2D frequency table (col [0:n-1]=scores, col n=frequencies) 2593""" 2594 scores = pstat.aunique(a) 2595 scores = N.sort(scores) 2596 freq = N.zeros(len(scores)) 2597 for i in range(len(scores)): 2598 freq[i] = N.add.reduce(N.equal(a, scores[i])) 2599 return N.array(pstat.aabut(scores, freq)) 2600 2601 def ascoreatpercentile(inarray, percent): 2602 """ 2603Usage: ascoreatpercentile(inarray,percent) 0<percent<100 2604Returns: score at given percentile, relative to inarray distribution 2605""" 2606 percent = percent / 100.0 2607 targetcf = percent * len(inarray) 2608 h, lrl, binsize, extras = histogram(inarray) 2609 cumhist = cumsum(h * 1) 2610 for i in range(len(cumhist)): 2611 if cumhist[i] >= targetcf: 2612 break 2613 score = binsize * ( 2614 (targetcf - cumhist[i - 1]) / float(h[i])) + (lrl + binsize * i) 2615 return score 2616 2617 def apercentileofscore(inarray, score, histbins=10, defaultlimits=None): 2618 """ 2619Note: result of this function depends on the values used to histogram 2620the data(!). 2621 2622Usage: apercentileofscore(inarray,score,histbins=10,defaultlimits=None) 2623Returns: percentile-position of score (0-100) relative to inarray 2624""" 2625 h, lrl, binsize, extras = histogram(inarray, histbins, defaultlimits) 2626 cumhist = cumsum(h * 1) 2627 i = int((score - lrl) / float(binsize)) 2628 pct = (cumhist[i - 1] + ((score - (lrl + binsize * i)) / float(binsize)) * 2629 h[i]) / float(len(inarray)) * 100 2630 return pct 2631 2632 def ahistogram(inarray, numbins=10, defaultlimits=None, printextras=1): 2633 """ 2634Returns (i) an array of histogram bin counts, (ii) the smallest value 2635of the histogram binning, and (iii) the bin width (the last 2 are not 2636necessarily integers). Default number of bins is 10. Defaultlimits 2637can be None (the routine picks bins spanning all the numbers in the 2638inarray) or a 2-sequence (lowerlimit, upperlimit). Returns all of the 2639following: array of bin values, lowerreallimit, binsize, extrapoints. 2640 2641Usage: ahistogram(inarray,numbins=10,defaultlimits=None,printextras=1) 2642Returns: (array of bin counts, bin-minimum, min-width, #-points-outside-range) 2643""" 2644 inarray = N.ravel(inarray) # flatten any >1D arrays 2645 if (defaultlimits <> None): 2646 lowerreallimit = defaultlimits[0] 2647 upperreallimit = defaultlimits[1] 2648 binsize = (upperreallimit - lowerreallimit) / float(numbins) 2649 else: 2650 Min = N.minimum.reduce(inarray) 2651 Max = N.maximum.reduce(inarray) 2652 estbinwidth = float(Max - Min) / float(numbins) + 1e-6 2653 binsize = (Max - Min + estbinwidth) / float(numbins) 2654 lowerreallimit = Min - binsize / 2.0 #lower real limit,1st bin 2655 bins = N.zeros(numbins) 2656 extrapoints = 0 2657 for num in inarray: 2658 try: 2659 if (num - lowerreallimit) < 0: 2660 extrapoints = extrapoints + 1 2661 else: 2662 bintoincrement = int((num - lowerreallimit) / float(binsize)) 2663 bins[bintoincrement] = bins[bintoincrement] + 1 2664 except: # point outside lower/upper limits 2665 extrapoints = extrapoints + 1 2666 if (extrapoints > 0 and printextras == 1): 2667 print '\nPoints outside given histogram range =', extrapoints 2668 return (bins, lowerreallimit, binsize, extrapoints) 2669 2670 def acumfreq(a, numbins=10, defaultreallimits=None): 2671 """ 2672Returns a cumulative frequency histogram, using the histogram function. 2673Defaultreallimits can be None (use all data), or a 2-sequence containing 2674lower and upper limits on values to include. 2675 2676Usage: acumfreq(a,numbins=10,defaultreallimits=None) 2677Returns: array of cumfreq bin values, lowerreallimit, binsize, extrapoints 2678""" 2679 h, l, b, e = histogram(a, numbins, defaultreallimits) 2680 cumhist = cumsum(h * 1) 2681 return cumhist, l, b, e 2682 2683 def arelfreq(a, numbins=10, defaultreallimits=None): 2684 """ 2685Returns a relative frequency histogram, using the histogram function. 2686Defaultreallimits can be None (use all data), or a 2-sequence containing 2687lower and upper limits on values to include. 2688 2689Usage: arelfreq(a,numbins=10,defaultreallimits=None) 2690Returns: array of cumfreq bin values, lowerreallimit, binsize, extrapoints 2691""" 2692 h, l, b, e = histogram(a, numbins, defaultreallimits) 2693 h = N.array(h / float(a.shape[0])) 2694 return h, l, b, e 2695 2696##################################### 2697###### AVARIABILITY FUNCTIONS ##### 2698##################################### 2699 2700 def aobrientransform(*args): 2701 """ 2702Computes a transform on input data (any number of columns). Used to 2703test for homogeneity of variance prior to running one-way stats. Each 2704array in *args is one level of a factor. If an F_oneway() run on the 2705transformed data and found significant, variances are unequal. From 2706Maxwell and Delaney, p.112. 2707 2708Usage: aobrientransform(*args) *args = 1D arrays, one per level of factor 2709Returns: transformed data for use in an ANOVA 2710""" 2711 TINY = 1e-10 2712 k = len(args) 2713 n = N.zeros(k, N.float_) 2714 v = N.zeros(k, N.float_) 2715 m = N.zeros(k, N.float_) 2716 nargs = [] 2717 for i in range(k): 2718 nargs.append(args[i].astype(N.float_)) 2719 n[i] = float(len(nargs[i])) 2720 v[i] = var(nargs[i]) 2721 m[i] = mean(nargs[i]) 2722 for j in range(k): 2723 for i in range(n[j]): 2724 t1 = (n[j] - 1.5) * n[j] * (nargs[j][i] - m[j])**2 2725 t2 = 0.5 * v[j] * (n[j] - 1.0) 2726 t3 = (n[j] - 1.0) * (n[j] - 2.0) 2727 nargs[j][i] = (t1 - t2) / float(t3) 2728 check = 1 2729 for j in range(k): 2730 if v[j] - mean(nargs[j]) > TINY: 2731 check = 0 2732 if check <> 1: 2733 raise ValueError, 'Lack of convergence in obrientransform.' 2734 else: 2735 return N.array(nargs) 2736 2737 def asamplevar(inarray, dimension=None, keepdims=0): 2738 """ 2739Returns the sample standard deviation of the values in the passed 2740array (i.e., using N). Dimension can equal None (ravel array first), 2741an integer (the dimension over which to operate), or a sequence 2742(operate over multiple dimensions). Set keepdims=1 to return an array 2743with the same number of dimensions as inarray. 2744 2745Usage: asamplevar(inarray,dimension=None,keepdims=0) 2746""" 2747 if dimension == None: 2748 inarray = N.ravel(inarray) 2749 dimension = 0 2750 if dimension == 1: 2751 mn = amean(inarray, dimension)[:, N.NewAxis] 2752 else: 2753 mn = amean(inarray, dimension, keepdims=1) 2754 deviations = inarray - mn 2755 if type(dimension) == ListType: 2756 n = 1 2757 for d in dimension: 2758 n = n * inarray.shape[d] 2759 else: 2760 n = inarray.shape[dimension] 2761 svar = ass(deviations, dimension, keepdims) / float(n) 2762 return svar 2763 2764 def asamplestdev(inarray, dimension=None, keepdims=0): 2765 """ 2766Returns the sample standard deviation of the values in the passed 2767array (i.e., using N). Dimension can equal None (ravel array first), 2768an integer (the dimension over which to operate), or a sequence 2769(operate over multiple dimensions). Set keepdims=1 to return an array 2770with the same number of dimensions as inarray. 2771 2772Usage: asamplestdev(inarray,dimension=None,keepdims=0) 2773""" 2774 return N.sqrt(asamplevar(inarray, dimension, keepdims)) 2775 2776 def asignaltonoise(instack, dimension=0): 2777 """ 2778Calculates signal-to-noise. Dimension can equal None (ravel array 2779first), an integer (the dimension over which to operate), or a 2780sequence (operate over multiple dimensions). 2781 2782Usage: asignaltonoise(instack,dimension=0): 2783Returns: array containing the value of (mean/stdev) along dimension, 2784 or 0 when stdev=0 2785""" 2786 m = mean(instack, dimension) 2787 sd = stdev(instack, dimension) 2788 return N.where(sd == 0, 0, m / sd) 2789 2790 def acov(x, y, dimension=None, keepdims=0): 2791 """ 2792Returns the estimated covariance of the values in the passed 2793array (i.e., N-1). Dimension can equal None (ravel array first), an 2794integer (the dimension over which to operate), or a sequence (operate 2795over multiple dimensions). Set keepdims=1 to return an array with the 2796same number of dimensions as inarray. 2797 2798Usage: acov(x,y,dimension=None,keepdims=0) 2799""" 2800 if dimension == None: 2801 x = N.ravel(x) 2802 y = N.ravel(y) 2803 dimension = 0 2804 xmn = amean(x, dimension, 1) # keepdims 2805 xdeviations = x - xmn 2806 ymn = amean(y, dimension, 1) # keepdims 2807 ydeviations = y - ymn 2808 if type(dimension) == ListType: 2809 n = 1 2810 for d in dimension: 2811 n = n * x.shape[d] 2812 else: 2813 n = x.shape[dimension] 2814 covar = N.sum(xdeviations * ydeviations) / float(n - 1) 2815 return covar 2816 2817 def avar(inarray, dimension=None, keepdims=0): 2818 """ 2819Returns the estimated population variance of the values in the passed 2820array (i.e., N-1). Dimension can equal None (ravel array first), an 2821integer (the dimension over which to operate), or a sequence (operate 2822over multiple dimensions). Set keepdims=1 to return an array with the 2823same number of dimensions as inarray. 2824 2825Usage: avar(inarray,dimension=None,keepdims=0) 2826""" 2827 if dimension == None: 2828 inarray = N.ravel(inarray) 2829 dimension = 0 2830 mn = amean(inarray, dimension, 1) 2831 deviations = inarray - mn 2832 if type(dimension) == ListType: 2833 n = 1 2834 for d in dimension: 2835 n = n * inarray.shape[d] 2836 else: 2837 n = inarray.shape[dimension] 2838 var = ass(deviations, dimension, keepdims) / float(n - 1) 2839 return var 2840 2841 def astdev(inarray, dimension=None, keepdims=0): 2842 """ 2843Returns the estimated population standard deviation of the values in 2844the passed array (i.e., N-1). Dimension can equal None (ravel array 2845first), an integer (the dimension over which to operate), or a 2846sequence (operate over multiple dimensions). Set keepdims=1 to return 2847an array with the same number of dimensions as inarray. 2848 2849Usage: astdev(inarray,dimension=None,keepdims=0) 2850""" 2851 return N.sqrt(avar(inarray, dimension, keepdims)) 2852 2853 def asterr(inarray, dimension=None, keepdims=0): 2854 """ 2855Returns the estimated population standard error of the values in the 2856passed array (i.e., N-1). Dimension can equal None (ravel array 2857first), an integer (the dimension over which to operate), or a 2858sequence (operate over multiple dimensions). Set keepdims=1 to return 2859an array with the same number of dimensions as inarray. 2860 2861Usage: asterr(inarray,dimension=None,keepdims=0) 2862""" 2863 if dimension == None: 2864 inarray = N.ravel(inarray) 2865 dimension = 0 2866 return astdev(inarray, dimension, 2867 keepdims) / float(N.sqrt(inarray.shape[dimension])) 2868 2869 def asem(inarray, dimension=None, keepdims=0): 2870 """ 2871Returns the standard error of the mean (i.e., using N) of the values 2872in the passed array. Dimension can equal None (ravel array first), an 2873integer (the dimension over which to operate), or a sequence (operate 2874over multiple dimensions). Set keepdims=1 to return an array with the 2875same number of dimensions as inarray. 2876 2877Usage: asem(inarray,dimension=None, keepdims=0) 2878""" 2879 if dimension == None: 2880 inarray = N.ravel(inarray) 2881 dimension = 0 2882 if type(dimension) == ListType: 2883 n = 1 2884 for d in dimension: 2885 n = n * inarray.shape[d] 2886 else: 2887 n = inarray.shape[dimension] 2888 s = asamplestdev(inarray, dimension, keepdims) / N.sqrt(n - 1) 2889 return s 2890 2891 def az(a, score): 2892 """ 2893Returns the z-score of a given input score, given thearray from which 2894that score came. Not appropriate for population calculations, nor for 2895arrays > 1D. 2896 2897Usage: az(a, score) 2898""" 2899 z = (score - amean(a)) / asamplestdev(a) 2900 return z 2901 2902 def azs(a): 2903 """ 2904Returns a 1D array of z-scores, one for each score in the passed array, 2905computed relative to the passed array. 2906 2907Usage: azs(a) 2908""" 2909 zscores = [] 2910 for item in a: 2911 zscores.append(z(a, item)) 2912 return N.array(zscores) 2913 2914 def azmap(scores, compare, dimension=0): 2915 """ 2916Returns an array of z-scores the shape of scores (e.g., [x,y]), compared to 2917array passed to compare (e.g., [time,x,y]). Assumes collapsing over dim 0 2918of the compare array. 2919 2920Usage: azs(scores, compare, dimension=0) 2921""" 2922 mns = amean(compare, dimension) 2923 sstd = asamplestdev(compare, 0) 2924 return (scores - mns) / sstd 2925 2926##################################### 2927####### ATRIMMING FUNCTIONS ####### 2928##################################### 2929 2930## deleted around() as it's in numpy now 2931 2932 def athreshold(a, threshmin=None, threshmax=None, newval=0): 2933 """ 2934Like Numeric.clip() except that values <threshmid or >threshmax are replaced 2935by newval instead of by threshmin/threshmax (respectively). 2936 2937Usage: athreshold(a,threshmin=None,threshmax=None,newval=0) 2938Returns: a, with values <threshmin or >threshmax replaced with newval 2939""" 2940 mask = N.zeros(a.shape) 2941 if threshmin <> None: 2942 mask = mask + N.where(a < threshmin, 1, 0) 2943 if threshmax <> None: 2944 mask = mask + N.where(a > threshmax, 1, 0) 2945 mask = N.clip(mask, 0, 1) 2946 return N.where(mask, newval, a) 2947 2948 def atrimboth(a, proportiontocut): 2949 """ 2950Slices off the passed proportion of items from BOTH ends of the passed 2951array (i.e., with proportiontocut=0.1, slices 'leftmost' 10% AND 2952'rightmost' 10% of scores. You must pre-sort the array if you want 2953"proper" trimming. Slices off LESS if proportion results in a 2954non-integer slice index (i.e., conservatively slices off 2955proportiontocut). 2956 2957Usage: atrimboth (a,proportiontocut) 2958Returns: trimmed version of array a 2959""" 2960 lowercut = int(proportiontocut * len(a)) 2961 uppercut = len(a) - lowercut 2962 return a[lowercut:uppercut] 2963 2964 def atrim1(a, proportiontocut, tail='right'): 2965 """ 2966Slices off the passed proportion of items from ONE end of the passed 2967array (i.e., if proportiontocut=0.1, slices off 'leftmost' or 'rightmost' 296810% of scores). Slices off LESS if proportion results in a non-integer 2969slice index (i.e., conservatively slices off proportiontocut). 2970 2971Usage: atrim1(a,proportiontocut,tail='right') or set tail='left' 2972Returns: trimmed version of array a 2973""" 2974 if string.lower(tail) == 'right': 2975 lowercut = 0 2976 uppercut = len(a) - int(proportiontocut * len(a)) 2977 elif string.lower(tail) == 'left': 2978 lowercut = int(proportiontocut * len(a)) 2979 uppercut = len(a) 2980 return a[lowercut:uppercut] 2981 2982##################################### 2983##### ACORRELATION FUNCTIONS ###### 2984##################################### 2985 2986 def acovariance(X): 2987 """ 2988Computes the covariance matrix of a matrix X. Requires a 2D matrix input. 2989 2990Usage: acovariance(X) 2991Returns: covariance matrix of X 2992""" 2993 if len(X.shape) <> 2: 2994 raise TypeError, 'acovariance requires 2D matrices' 2995 n = X.shape[0] 2996 mX = amean(X, 0) 2997 return N.dot(N.transpose(X), X) / float(n) - N.multiply.outer(mX, mX) 2998 2999 def acorrelation(X): 3000 """ 3001Computes the correlation matrix of a matrix X. Requires a 2D matrix input. 3002 3003Usage: acorrelation(X) 3004Returns: correlation matrix of X 3005""" 3006 C = acovariance(X) 3007 V = N.diagonal(C) 3008 return C / N.sqrt(N.multiply.outer(V, V)) 3009 3010 def apaired(x, y): 3011 """ 3012Interactively determines the type of data in x and y, and then runs the 3013appropriated statistic for paired group data. 3014 3015Usage: apaired(x,y) x,y = the two arrays of values to be compared 3016Returns: appropriate statistic name, value, and probability 3017""" 3018 samples = '' 3019 while samples not in ['i', 'r', 'I', 'R', 'c', 'C']: 3020 print '\nIndependent or related samples, or correlation (i,r,c): ', 3021 samples = raw_input() 3022 3023 if samples in ['i', 'I', 'r', 'R']: 3024 print '\nComparing variances ...', 3025 # USE O'BRIEN'S TEST FOR HOMOGENEITY OF VARIANCE, Maxwell & delaney, p.112 3026 r = obrientransform(x, y) 3027 f, p = F_oneway(pstat.colex(r, 0), pstat.colex(r, 1)) 3028 if p < 0.05: 3029 vartype = 'unequal, p=' + str(round(p, 4)) 3030 else: 3031 vartype = 'equal' 3032 print vartype 3033 if samples in ['i', 'I']: 3034 if vartype[0] == 'e': 3035 t, p = ttest_ind(x, y, None, 0) 3036 print '\nIndependent samples t-test: ', round(t, 4), round(p, 4) 3037 else: 3038 if len(x) > 20 or len(y) > 20: 3039 z, p = ranksums(x, y) 3040 print '\nRank Sums test (NONparametric, n>20): ', round( 3041 z, 4), round(p, 4) 3042 else: 3043 u, p = mannwhitneyu(x, y) 3044 print '\nMann-Whitney U-test (NONparametric, ns<20): ', round( 3045 u, 4), round(p, 4) 3046 3047 else: # RELATED SAMPLES 3048 if vartype[0] == 'e': 3049 t, p = ttest_rel(x, y, 0) 3050 print '\nRelated samples t-test: ', round(t, 4), round(p, 4) 3051 else: 3052 t, p = ranksums(x, y) 3053 print '\nWilcoxon T-test (NONparametric): ', round(t, 4), round(p, 4) 3054 else: # CORRELATION ANALYSIS 3055 corrtype = '' 3056 while corrtype not in ['c', 'C', 'r', 'R', 'd', 'D']: 3057 print '\nIs the data Continuous, Ranked, or Dichotomous (c,r,d): ', 3058 corrtype = raw_input() 3059 if corrtype in ['c', 'C']: 3060 m, b, r, p, see = linregress(x, y) 3061 print '\nLinear regression for continuous variables ...' 3062 lol = [ 3063 ['Slope', 'Intercept', 'r', 'Prob', 'SEestimate'], 3064 [round(m, 4), round(b, 4), round(r, 4), round(p, 4), round(see, 4)] 3065 ] 3066 pstat.printcc(lol) 3067 elif corrtype in ['r', 'R']: 3068 r, p = spearmanr(x, y) 3069 print '\nCorrelation for ranked variables ...' 3070 print "Spearman's r: ", round(r, 4), round(p, 4) 3071 else: # DICHOTOMOUS 3072 r, p = pointbiserialr(x, y) 3073 print '\nAssuming x contains a dichotomous variable ...' 3074 print 'Point Biserial r: ', round(r, 4), round(p, 4) 3075 print '\n\n' 3076 return None 3077 3078 def dices(x, y): 3079 """ 3080Calculates Dice's coefficient ... (2*number of common terms)/(number of terms in 3081x + 3082number of terms in y). Returns a value between 0 (orthogonal) and 1. 3083 3084Usage: dices(x,y) 3085""" 3086 import sets 3087 x = sets.Set(x) 3088 y = sets.Set(y) 3089 common = len(x.intersection(y)) 3090 total = float(len(x) + len(y)) 3091 return 2 * common / total 3092 3093 def icc(x, y=None, verbose=0): 3094 """ 3095Calculates intraclass correlation coefficients using simple, Type I sums of 3096squares. 3097If only one variable is passed, assumed it's an Nx2 matrix 3098 3099Usage: icc(x,y=None,verbose=0) 3100Returns: icc rho, prob ####PROB IS A GUESS BASED ON PEARSON 3101""" 3102 TINY = 1.0e-20 3103 if y: 3104 all = N.concatenate([x, y], 0) 3105 else: 3106 all = x + 0 3107 x = all[:, 0] 3108 y = all[:, 1] 3109 totalss = ass(all - mean(all)) 3110 pairmeans = (x + y) / 2. 3111 withinss = ass(x - pairmeans) + ass(y - pairmeans) 3112 withindf = float(len(x)) 3113 betwdf = float(len(x) - 1) 3114 withinms = withinss / withindf 3115 betweenms = (totalss - withinss) / betwdf 3116 rho = (betweenms - withinms) / (withinms + betweenms) 3117 t = rho * math.sqrt(betwdf / ((1.0 - rho + TINY) * (1.0 + rho + TINY))) 3118 prob = abetai(0.5 * betwdf, 0.5, betwdf / (betwdf + t * t), verbose) 3119 return rho, prob 3120 3121 def alincc(x, y): 3122 """ 3123Calculates Lin's concordance correlation coefficient. 3124 3125Usage: alincc(x,y) where x, y are equal-length arrays 3126Returns: Lin's CC 3127""" 3128 x = N.ravel(x) 3129 y = N.ravel(y) 3130 covar = acov(x, y) * (len(x) - 1) / float(len(x)) # correct denom to n 3131 xvar = avar(x) * (len(x) - 1) / float(len(x)) # correct denom to n 3132 yvar = avar(y) * (len(y) - 1) / float(len(y)) # correct denom to n 3133 lincc = (2 * covar) / ((xvar + yvar) + ((amean(x) - amean(y))**2)) 3134 return lincc 3135 3136 def apearsonr(x, y, verbose=1): 3137 """ 3138Calculates a Pearson correlation coefficient and returns p. Taken 3139from Heiman's Basic Statistics for the Behav. Sci (2nd), p.195. 3140 3141Usage: apearsonr(x,y,verbose=1) where x,y are equal length arrays 3142Returns: Pearson's r, two-tailed p-value 3143""" 3144 TINY = 1.0e-20 3145 n = len(x) 3146 xmean = amean(x) 3147 ymean = amean(y) 3148 r_num = n * (N.add.reduce(x * y)) - N.add.reduce(x) * N.add.reduce(y) 3149 r_den = math.sqrt((n * ass(x) - asquare_of_sums(x)) * 3150 (n * ass(y) - asquare_of_sums(y))) 3151 r = (r_num / r_den) 3152 df = n - 2 3153 t = r * math.sqrt(df / ((1.0 - r + TINY) * (1.0 + r + TINY))) 3154 prob = abetai(0.5 * df, 0.5, df / (df + t * t), verbose) 3155 return r, prob 3156 3157 def aspearmanr(x, y): 3158 """ 3159Calculates a Spearman rank-order correlation coefficient. Taken 3160from Heiman's Basic Statistics for the Behav. Sci (1st), p.192. 3161 3162Usage: aspearmanr(x,y) where x,y are equal-length arrays 3163Returns: Spearman's r, two-tailed p-value 3164""" 3165 TINY = 1e-30 3166 n = len(x) 3167 rankx = rankdata(x) 3168 ranky = rankdata(y) 3169 dsq = N.add.reduce((rankx - ranky)**2) 3170 rs = 1 - 6 * dsq / float(n * (n**2 - 1)) 3171 t = rs * math.sqrt((n - 2) / ((rs + 1.0) * (1.0 - rs))) 3172 df = n - 2 3173 probrs = abetai(0.5 * df, 0.5, df / (df + t * t)) 3174 # probability values for rs are from part 2 of the spearman function in 3175 # Numerical Recipies, p.510. They close to tables, but not exact.(?) 3176 return rs, probrs 3177 3178 def apointbiserialr(x, y): 3179 """ 3180Calculates a point-biserial correlation coefficient and the associated 3181probability value. Taken from Heiman's Basic Statistics for the Behav. 3182Sci (1st), p.194. 3183 3184Usage: apointbiserialr(x,y) where x,y are equal length arrays 3185Returns: Point-biserial r, two-tailed p-value 3186""" 3187 TINY = 1e-30 3188 categories = pstat.aunique(x) 3189 data = pstat.aabut(x, y) 3190 if len(categories) <> 2: 3191 raise ValueError, ('Exactly 2 categories required (in x) for ' 3192 'pointbiserialr().') 3193 else: # there are 2 categories, continue 3194 codemap = pstat.aabut(categories, N.arange(2)) 3195 recoded = pstat.arecode(data, codemap, 0) 3196 x = pstat.alinexand(data, 0, categories[0]) 3197 y = pstat.alinexand(data, 0, categories[1]) 3198 xmean = amean(pstat.acolex(x, 1)) 3199 ymean = amean(pstat.acolex(y, 1)) 3200 n = len(data) 3201 adjust = math.sqrt((len(x) / float(n)) * (len(y) / float(n))) 3202 rpb = (ymean - xmean) / asamplestdev(pstat.acolex(data, 1)) * adjust 3203 df = n - 2 3204 t = rpb * math.sqrt(df / ((1.0 - rpb + TINY) * (1.0 + rpb + TINY))) 3205 prob = abetai(0.5 * df, 0.5, df / (df + t * t)) 3206 return rpb, prob 3207 3208 def akendalltau(x, y): 3209 """ 3210Calculates Kendall's tau ... correlation of ordinal data. Adapted 3211from function kendl1 in Numerical Recipies. Needs good test-cases.@@@ 3212 3213Usage: akendalltau(x,y) 3214Returns: Kendall's tau, two-tailed p-value 3215""" 3216 n1 = 0 3217 n2 = 0 3218 iss = 0 3219 for j in range(len(x) - 1): 3220 for k in range(j, len(y)): 3221 a1 = x[j] - x[k] 3222 a2 = y[j] - y[k] 3223 aa = a1 * a2 3224 if (aa): # neither array has a tie 3225 n1 = n1 + 1 3226 n2 = n2 + 1 3227 if aa > 0: 3228 iss = iss + 1 3229 else: 3230 iss = iss - 1 3231 else: 3232 if (a1): 3233 n1 = n1 + 1 3234 else: 3235 n2 = n2 + 1 3236 tau = iss / math.sqrt(n1 * n2) 3237 svar = (4.0 * len(x) + 10.0) / (9.0 * len(x) * (len(x) - 1)) 3238 z = tau / math.sqrt(svar) 3239 prob = erfcc(abs(z) / 1.4142136) 3240 return tau, prob 3241 3242 def alinregress(*args): 3243 """ 3244Calculates a regression line on two arrays, x and y, corresponding to x,y 3245pairs. If a single 2D array is passed, alinregress finds dim with 2 levels 3246and splits data into x,y pairs along that dim. 3247 3248Usage: alinregress(*args) args=2 equal-length arrays, or one 2D array 3249Returns: slope, intercept, r, two-tailed prob, sterr-of-the-estimate, n 3250""" 3251 TINY = 1.0e-20 3252 if len(args) == 1: # more than 1D array? 3253 args = args[0] 3254 if len(args) == 2: 3255 x = args[0] 3256 y = args[1] 3257 else: 3258 x = args[:, 0] 3259 y = args[:, 1] 3260 else: 3261 x = args[0] 3262 y = args[1] 3263 n = len(x) 3264 xmean = amean(x) 3265 ymean = amean(y) 3266 r_num = n * (N.add.reduce(x * y)) - N.add.reduce(x) * N.add.reduce(y) 3267 r_den = math.sqrt((n * ass(x) - asquare_of_sums(x)) * 3268 (n * ass(y) - asquare_of_sums(y))) 3269 r = r_num / r_den 3270 z = 0.5 * math.log((1.0 + r + TINY) / (1.0 - r + TINY)) 3271 df = n - 2 3272 t = r * math.sqrt(df / ((1.0 - r + TINY) * (1.0 + r + TINY))) 3273 prob = abetai(0.5 * df, 0.5, df / (df + t * t)) 3274 slope = r_num / (float(n) * ass(x) - asquare_of_sums(x)) 3275 intercept = ymean - slope * xmean 3276 sterrest = math.sqrt(1 - r * r) * asamplestdev(y) 3277 return slope, intercept, r, prob, sterrest, n 3278 3279 def amasslinregress(*args): 3280 """ 3281Calculates a regression line on one 1D array (x) and one N-D array (y). 3282 3283Returns: slope, intercept, r, two-tailed prob, sterr-of-the-estimate, n 3284""" 3285 TINY = 1.0e-20 3286 if len(args) == 1: # more than 1D array? 3287 args = args[0] 3288 if len(args) == 2: 3289 x = N.ravel(args[0]) 3290 y = args[1] 3291 else: 3292 x = N.ravel(args[:, 0]) 3293 y = args[:, 1] 3294 else: 3295 x = args[0] 3296 y = args[1] 3297 x = x.astype(N.float_) 3298 y = y.astype(N.float_) 3299 n = len(x) 3300 xmean = amean(x) 3301 ymean = amean(y, 0) 3302 shp = N.ones(len(y.shape)) 3303 shp[0] = len(x) 3304 x.shape = shp 3305 print x.shape, y.shape 3306 r_num = n * (N.add.reduce(x * y, 0)) - N.add.reduce(x) * N.add.reduce(y, 0) 3307 r_den = N.sqrt((n * ass(x) - asquare_of_sums(x)) * 3308 (n * ass(y, 0) - asquare_of_sums(y, 0))) 3309 zerodivproblem = N.equal(r_den, 0) 3310 r_den = N.where(zerodivproblem, 1, r_den 3311 ) # avoid zero-division in 1st place 3312 r = r_num / r_den # need to do this nicely for matrix division 3313 r = N.where(zerodivproblem, 0.0, r) 3314 z = 0.5 * N.log((1.0 + r + TINY) / (1.0 - r + TINY)) 3315 df = n - 2 3316 t = r * N.sqrt(df / ((1.0 - r + TINY) * (1.0 + r + TINY))) 3317 prob = abetai(0.5 * df, 0.5, df / (df + t * t)) 3318 3319 ss = float(n) * ass(x) - asquare_of_sums(x) 3320 s_den = N.where(ss == 0, 1, ss) # avoid zero-division in 1st place 3321 slope = r_num / s_den 3322 intercept = ymean - slope * xmean 3323 sterrest = N.sqrt(1 - r * r) * asamplestdev(y, 0) 3324 return slope, intercept, r, prob, sterrest, n 3325 3326##################################### 3327##### AINFERENTIAL STATISTICS ##### 3328##################################### 3329 3330 def attest_1samp(a, popmean, printit=0, name='Sample', writemode='a'): 3331 """ 3332Calculates the t-obtained for the independent samples T-test on ONE group 3333of scores a, given a population mean. If printit=1, results are printed 3334to the screen. If printit='filename', the results are output to 'filename' 3335using the given writemode (default=append). Returns t-value, and prob. 3336 3337Usage: attest_1samp(a,popmean,Name='Sample',printit=0,writemode='a') 3338Returns: t-value, two-tailed prob 3339""" 3340 if type(a) != N.ndarray: 3341 a = N.array(a) 3342 x = amean(a) 3343 v = avar(a) 3344 n = len(a) 3345 df = n - 1 3346 svar = ((n - 1) * v) / float(df) 3347 t = (x - popmean) / math.sqrt(svar * (1.0 / n)) 3348 prob = abetai(0.5 * df, 0.5, df / (df + t * t)) 3349 3350 if printit <> 0: 3351 statname = 'Single-sample T-test.' 3352 outputpairedstats(printit, writemode, 'Population', '--', popmean, 0, 0, 3353 0, name, n, x, v, N.minimum.reduce(N.ravel(a)), 3354 N.maximum.reduce(N.ravel(a)), statname, t, prob) 3355 return t, prob 3356 3357 def attest_ind(a, 3358 b, 3359 dimension=None, 3360 printit=0, 3361 name1='Samp1', 3362 name2='Samp2', 3363 writemode='a'): 3364 """ 3365Calculates the t-obtained T-test on TWO INDEPENDENT samples of scores 3366a, and b. From Numerical Recipies, p.483. If printit=1, results are 3367printed to the screen. If printit='filename', the results are output 3368to 'filename' using the given writemode (default=append). Dimension 3369can equal None (ravel array first), or an integer (the dimension over 3370which to operate on a and b). 3371 3372Usage: attest_ind (a,b,dimension=None,printit=0, 3373 Name1='Samp1',Name2='Samp2',writemode='a') 3374Returns: t-value, two-tailed p-value 3375""" 3376 if dimension == None: 3377 a = N.ravel(a) 3378 b = N.ravel(b) 3379 dimension = 0 3380 x1 = amean(a, dimension) 3381 x2 = amean(b, dimension) 3382 v1 = avar(a, dimension) 3383 v2 = avar(b, dimension) 3384 n1 = a.shape[dimension] 3385 n2 = b.shape[dimension] 3386 df = n1 + n2 - 2 3387 svar = ((n1 - 1) * v1 + (n2 - 1) * v2) / float(df) 3388 zerodivproblem = N.equal(svar, 0) 3389 svar = N.where(zerodivproblem, 1, svar) # avoid zero-division in 1st place 3390 t = (x1 - x2) / N.sqrt(svar * 3391 (1.0 / n1 + 1.0 / n2)) # N-D COMPUTATION HERE!!!!!! 3392 t = N.where(zerodivproblem, 1.0, t) # replace NaN/wrong t-values with 1.0 3393 probs = abetai(0.5 * df, 0.5, float(df) / (df + t * t)) 3394 3395 if type(t) == N.ndarray: 3396 probs = N.reshape(probs, t.shape) 3397 if probs.shape == (1,): 3398 probs = probs[0] 3399 3400 if printit <> 0: 3401 if type(t) == N.ndarray: 3402 t = t[0] 3403 if type(probs) == N.ndarray: 3404 probs = probs[0] 3405 statname = 'Independent samples T-test.' 3406 outputpairedstats(printit, writemode, name1, n1, x1, v1, 3407 N.minimum.reduce(N.ravel(a)), 3408 N.maximum.reduce(N.ravel(a)), name2, n2, x2, v2, 3409 N.minimum.reduce(N.ravel(b)), 3410 N.maximum.reduce(N.ravel(b)), statname, t, probs) 3411 return 3412 return t, probs 3413 3414 def ap2t(pval, df): 3415 """ 3416Tries to compute a t-value from a p-value (or pval array) and associated df. 3417SLOW for large numbers of elements(!) as it re-computes p-values 20 times 3418(smaller step-sizes) at which point it decides it's done. Keeps the signs 3419of the input array. Returns 1000 (or -1000) if t>100. 3420 3421Usage: ap2t(pval,df) 3422Returns: an array of t-values with the shape of pval 3423 """ 3424 pval = N.array(pval) 3425 signs = N.sign(pval) 3426 pval = abs(pval) 3427 t = N.ones(pval.shape, N.float_) * 50 3428 step = N.ones(pval.shape, N.float_) * 25 3429 print 'Initial ap2t() prob calc' 3430 prob = abetai(0.5 * df, 0.5, float(df) / (df + t * t)) 3431 print 'ap2t() iter: ', 3432 for i in range(10): 3433 print i, ' ', 3434 t = N.where(pval < prob, t + step, t - step) 3435 prob = abetai(0.5 * df, 0.5, float(df) / (df + t * t)) 3436 step = step / 2 3437 print 3438 # since this is an ugly hack, we get ugly boundaries 3439 t = N.where(t > 99.9, 1000, t) # hit upper-boundary 3440 t = t + signs 3441 return t #, prob, pval 3442 3443 def attest_rel(a, 3444 b, 3445 dimension=None, 3446 printit=0, 3447 name1='Samp1', 3448 name2='Samp2', 3449 writemode='a'): 3450 """ 3451Calculates the t-obtained T-test on TWO RELATED samples of scores, a 3452and b. From Numerical Recipies, p.483. If printit=1, results are 3453printed to the screen. If printit='filename', the results are output 3454to 'filename' using the given writemode (default=append). Dimension 3455can equal None (ravel array first), or an integer (the dimension over 3456which to operate on a and b). 3457 3458Usage: attest_rel(a,b,dimension=None,printit=0, 3459 name1='Samp1',name2='Samp2',writemode='a') 3460Returns: t-value, two-tailed p-value 3461""" 3462 if dimension == None: 3463 a = N.ravel(a) 3464 b = N.ravel(b) 3465 dimension = 0 3466 if len(a) <> len(b): 3467 raise ValueError, 'Unequal length arrays.' 3468 x1 = amean(a, dimension) 3469 x2 = amean(b, dimension) 3470 v1 = avar(a, dimension) 3471 v2 = avar(b, dimension) 3472 n = a.shape[dimension] 3473 df = float(n - 1) 3474 d = (a - b).astype('d') 3475 3476 denom = N.sqrt( 3477 (n * N.add.reduce(d * d, dimension) - N.add.reduce(d, dimension)**2) / 3478 df) 3479 zerodivproblem = N.equal(denom, 0) 3480 denom = N.where(zerodivproblem, 1, denom 3481 ) # avoid zero-division in 1st place 3482 t = N.add.reduce(d, dimension) / denom # N-D COMPUTATION HERE!!!!!! 3483 t = N.where(zerodivproblem, 1.0, t) # replace NaN/wrong t-values with 1.0 3484 probs = abetai(0.5 * df, 0.5, float(df) / (df + t * t)) 3485 if type(t) == N.ndarray: 3486 probs = N.reshape(probs, t.shape) 3487 if probs.shape == (1,): 3488 probs = probs[0] 3489 3490 if printit <> 0: 3491 statname = 'Related samples T-test.' 3492 outputpairedstats(printit, writemode, name1, n, x1, v1, 3493 N.minimum.reduce(N.ravel(a)), 3494 N.maximum.reduce(N.ravel(a)), name2, n, x2, v2, 3495 N.minimum.reduce(N.ravel(b)), 3496 N.maximum.reduce(N.ravel(b)), statname, t, probs) 3497 return 3498 return t, probs 3499 3500 def achisquare(f_obs, f_exp=None): 3501 """ 3502Calculates a one-way chi square for array of observed frequencies and returns 3503the result. If no expected frequencies are given, the total N is assumed to 3504be equally distributed across all groups. 3505@@@NOT RIGHT?? 3506 3507Usage: achisquare(f_obs, f_exp=None) f_obs = array of observed cell freq. 3508Returns: chisquare-statistic, associated p-value 3509""" 3510 3511 k = len(f_obs) 3512 if f_exp == None: 3513 f_exp = N.array([sum(f_obs) / float(k)] * len(f_obs), N.float_) 3514 f_exp = f_exp.astype(N.float_) 3515 chisq = N.add.reduce((f_obs - f_exp)**2 / f_exp) 3516 return chisq, achisqprob(chisq, k - 1) 3517 3518 def aks_2samp(data1, data2): 3519 """ 3520Computes the Kolmogorov-Smirnof statistic on 2 samples. Modified from 3521Numerical Recipies in C, page 493. Returns KS D-value, prob. Not ufunc- 3522like. 3523 3524Usage: aks_2samp(data1,data2) where data1 and data2 are 1D arrays 3525Returns: KS D-value, p-value 3526""" 3527 j1 = 0 # N.zeros(data1.shape[1:]) TRIED TO MAKE THIS UFUNC-LIKE 3528 j2 = 0 # N.zeros(data2.shape[1:]) 3529 fn1 = 0.0 # N.zeros(data1.shape[1:],N.float_) 3530 fn2 = 0.0 # N.zeros(data2.shape[1:],N.float_) 3531 n1 = data1.shape[0] 3532 n2 = data2.shape[0] 3533 en1 = n1 * 1 3534 en2 = n2 * 1 3535 d = N.zeros(data1.shape[1:], N.float_) 3536 data1 = N.sort(data1, 0) 3537 data2 = N.sort(data2, 0) 3538 while j1 < n1 and j2 < n2: 3539 d1 = data1[j1] 3540 d2 = data2[j2] 3541 if d1 <= d2: 3542 fn1 = (j1) / float(en1) 3543 j1 = j1 + 1 3544 if d2 <= d1: 3545 fn2 = (j2) / float(en2) 3546 j2 = j2 + 1 3547 dt = (fn2 - fn1) 3548 if abs(dt) > abs(d): 3549 d = dt 3550# try: 3551 en = math.sqrt(en1 * en2 / float(en1 + en2)) 3552 prob = aksprob((en + 0.12 + 0.11 / en) * N.fabs(d)) 3553 # except: 3554 # prob = 1.0 3555 return d, prob 3556 3557 def amannwhitneyu(x, y): 3558 """ 3559Calculates a Mann-Whitney U statistic on the provided scores and 3560returns the result. Use only when the n in each condition is < 20 and 3561you have 2 independent samples of ranks. REMEMBER: Mann-Whitney U is 3562significant if the u-obtained is LESS THAN or equal to the critical 3563value of U. 3564 3565Usage: amannwhitneyu(x,y) where x,y are arrays of values for 2 conditions 3566Returns: u-statistic, one-tailed p-value (i.e., p(z(U))) 3567""" 3568 n1 = len(x) 3569 n2 = len(y) 3570 ranked = rankdata(N.concatenate((x, y))) 3571 rankx = ranked[0:n1] # get the x-ranks 3572 ranky = ranked[n1:] # the rest are y-ranks 3573 u1 = n1 * n2 + (n1 * (n1 + 1)) / 2.0 - sum(rankx) # calc U for x 3574 u2 = n1 * n2 - u1 # remainder is U for y 3575 bigu = max(u1, u2) 3576 smallu = min(u1, u2) 3577 proportion = bigu / float(n1 * n2) 3578 T = math.sqrt(tiecorrect(ranked)) # correction factor for tied scores 3579 if T == 0: 3580 raise ValueError, 'All numbers are identical in amannwhitneyu' 3581 sd = math.sqrt(T * n1 * n2 * (n1 + n2 + 1) / 12.0) 3582 z = abs((bigu - n1 * n2 / 2.0) / sd) # normal approximation for prob calc 3583 return smallu, 1.0 - azprob(z), proportion 3584 3585 def atiecorrect(rankvals): 3586 """ 3587Tie-corrector for ties in Mann Whitney U and Kruskal Wallis H tests. 3588See Siegel, S. (1956) Nonparametric Statistics for the Behavioral 3589Sciences. New York: McGraw-Hill. Code adapted from |Stat rankind.c 3590code. 3591 3592Usage: atiecorrect(rankvals) 3593Returns: T correction factor for U or H 3594""" 3595 sorted, posn = ashellsort(N.array(rankvals)) 3596 n = len(sorted) 3597 T = 0.0 3598 i = 0 3599 while (i < n - 1): 3600 if sorted[i] == sorted[i + 1]: 3601 nties = 1 3602 while (i < n - 1) and (sorted[i] == sorted[i + 1]): 3603 nties = nties + 1 3604 i = i + 1 3605 T = T + nties**3 - nties 3606 i = i + 1 3607 T = T / float(n**3 - n) 3608 return 1.0 - T 3609 3610 def aranksums(x, y): 3611 """ 3612Calculates the rank sums statistic on the provided scores and returns 3613the result. 3614 3615Usage: aranksums(x,y) where x,y are arrays of values for 2 conditions 3616Returns: z-statistic, two-tailed p-value 3617""" 3618 n1 = len(x) 3619 n2 = len(y) 3620 alldata = N.concatenate((x, y)) 3621 ranked = arankdata(alldata) 3622 x = ranked[:n1] 3623 y = ranked[n1:] 3624 s = sum(x) 3625 expected = n1 * (n1 + n2 + 1) / 2.0 3626 z = (s - expected) / math.sqrt(n1 * n2 * (n1 + n2 + 1) / 12.0) 3627 prob = 2 * (1.0 - azprob(abs(z))) 3628 return z, prob 3629 3630 def awilcoxont(x, y): 3631 """ 3632Calculates the Wilcoxon T-test for related samples and returns the 3633result. A non-parametric T-test. 3634 3635Usage: awilcoxont(x,y) where x,y are equal-length arrays for 2 conditions 3636Returns: t-statistic, two-tailed p-value 3637""" 3638 if len(x) <> len(y): 3639 raise ValueError, 'Unequal N in awilcoxont. Aborting.' 3640 d = x - y 3641 d = N.compress(N.not_equal(d, 0), d) # Keep all non-zero differences 3642 count = len(d) 3643 absd = abs(d) 3644 absranked = arankdata(absd) 3645 r_plus = 0.0 3646 r_minus = 0.0 3647 for i in range(len(absd)): 3648 if d[i] < 0: 3649 r_minus = r_minus + absranked[i] 3650 else: 3651 r_plus = r_plus + absranked[i] 3652 wt = min(r_plus, r_minus) 3653 mn = count * (count + 1) * 0.25 3654 se = math.sqrt(count * (count + 1) * (2.0 * count + 1.0) / 24.0) 3655 z = math.fabs(wt - mn) / se 3656 z = math.fabs(wt - mn) / se 3657 prob = 2 * (1.0 - zprob(abs(z))) 3658 return wt, prob 3659 3660 def akruskalwallish(*args): 3661 """ 3662The Kruskal-Wallis H-test is a non-parametric ANOVA for 3 or more 3663groups, requiring at least 5 subjects in each group. This function 3664calculates the Kruskal-Wallis H and associated p-value for 3 or more 3665independent samples. 3666 3667Usage: akruskalwallish(*args) args are separate arrays for 3+ conditions 3668Returns: H-statistic (corrected for ties), associated p-value 3669""" 3670 assert len(args) == 3, 'Need at least 3 groups in stats.akruskalwallish()' 3671 args = list(args) 3672 n = [0] * len(args) 3673 n = map(len, args) 3674 all = [] 3675 for i in range(len(args)): 3676 all = all + args[i].tolist() 3677 ranked = rankdata(all) 3678 T = tiecorrect(ranked) 3679 for i in range(len(args)): 3680 args[i] = ranked[0:n[i]] 3681 del ranked[0:n[i]] 3682 rsums = [] 3683 for i in range(len(args)): 3684 rsums.append(sum(args[i])**2) 3685 rsums[i] = rsums[i] / float(n[i]) 3686 ssbn = sum(rsums) 3687 totaln = sum(n) 3688 h = 12.0 / (totaln * (totaln + 1)) * ssbn - 3 * (totaln + 1) 3689 df = len(args) - 1 3690 if T == 0: 3691 raise ValueError, 'All numbers are identical in akruskalwallish' 3692 h = h / float(T) 3693 return h, chisqprob(h, df) 3694 3695 def afriedmanchisquare(*args): 3696 """ 3697Friedman Chi-Square is a non-parametric, one-way within-subjects 3698ANOVA. This function calculates the Friedman Chi-square test for 3699repeated measures and returns the result, along with the associated 3700probability value. It assumes 3 or more repeated measures. Only 3 3701levels requires a minimum of 10 subjects in the study. Four levels 3702requires 5 subjects per level(??). 3703 3704Usage: afriedmanchisquare(*args) args are separate arrays for 2+ conditions 3705Returns: chi-square statistic, associated p-value 3706""" 3707 k = len(args) 3708 if k < 3: 3709 raise ValueError, ('\nLess than 3 levels. Friedman test not ' 3710 'appropriate.\n') 3711 n = len(args[0]) 3712 data = apply(pstat.aabut, args) 3713 data = data.astype(N.float_) 3714 for i in range(len(data)): 3715 data[i] = arankdata(data[i]) 3716 ssbn = asum(asum(args, 1)**2) 3717 chisq = 12.0 / (k * n * (k + 1)) * ssbn - 3 * n * (k + 1) 3718 return chisq, achisqprob(chisq, k - 1) 3719 3720##################################### 3721#### APROBABILITY CALCULATIONS #### 3722##################################### 3723 3724 def achisqprob(chisq, df): 3725 """ 3726Returns the (1-tail) probability value associated with the provided chi-square 3727value and df. Heavily modified from chisq.c in Gary Perlman's |Stat. Can 3728handle multiple dimensions. 3729 3730Usage: achisqprob(chisq,df) chisq=chisquare stat., df=degrees of freedom 3731""" 3732 BIG = 200.0 3733 3734 def ex(x): 3735 BIG = 200.0 3736 exponents = N.where(N.less(x, -BIG), -BIG, x) 3737 return N.exp(exponents) 3738 3739 if type(chisq) == N.ndarray: 3740 arrayflag = 1 3741 else: 3742 arrayflag = 0 3743 chisq = N.array([chisq]) 3744 if df < 1: 3745 return N.ones(chisq.shape, N.float) 3746 probs = N.zeros(chisq.shape, N.float_) 3747 probs = N.where( 3748 N.less_equal(chisq, 0), 1.0, probs) # set prob=1 for chisq<0 3749 a = 0.5 * chisq 3750 if df > 1: 3751 y = ex(-a) 3752 if df % 2 == 0: 3753 even = 1 3754 s = y * 1 3755 s2 = s * 1 3756 else: 3757 even = 0 3758 s = 2.0 * azprob(-N.sqrt(chisq)) 3759 s2 = s * 1 3760 if (df > 2): 3761 chisq = 0.5 * (df - 1.0) 3762 if even: 3763 z = N.ones(probs.shape, N.float_) 3764 else: 3765 z = 0.5 * N.ones(probs.shape, N.float_) 3766 if even: 3767 e = N.zeros(probs.shape, N.float_) 3768 else: 3769 e = N.log(N.sqrt(N.pi)) * N.ones(probs.shape, N.float_) 3770 c = N.log(a) 3771 mask = N.zeros(probs.shape) 3772 a_big = N.greater(a, BIG) 3773 a_big_frozen = -1 * N.ones(probs.shape, N.float_) 3774 totalelements = N.multiply.reduce(N.array(probs.shape)) 3775 while asum(mask) <> totalelements: 3776 e = N.log(z) + e 3777 s = s + ex(c * z - a - e) 3778 z = z + 1.0 3779 # print z, e, s 3780 newmask = N.greater(z, chisq) 3781 a_big_frozen = N.where(newmask * N.equal(mask, 0) * a_big, s, 3782 a_big_frozen) 3783 mask = N.clip(newmask + mask, 0, 1) 3784 if even: 3785 z = N.ones(probs.shape, N.float_) 3786 e = N.ones(probs.shape, N.float_) 3787 else: 3788 z = 0.5 * N.ones(probs.shape, N.float_) 3789 e = 1.0 / N.sqrt(N.pi) / N.sqrt(a) * N.ones(probs.shape, N.float_) 3790 c = 0.0 3791 mask = N.zeros(probs.shape) 3792 a_notbig_frozen = -1 * N.ones(probs.shape, N.float_) 3793 while asum(mask) <> totalelements: 3794 e = e * (a / z.astype(N.float_)) 3795 c = c + e 3796 z = z + 1.0 3797 # print '#2', z, e, c, s, c*y+s2 3798 newmask = N.greater(z, chisq) 3799 a_notbig_frozen = N.where(newmask * N.equal(mask, 0) * (1 - a_big), 3800 c * y + s2, a_notbig_frozen) 3801 mask = N.clip(newmask + mask, 0, 1) 3802 probs = N.where( 3803 N.equal(probs, 1), 1, N.where( 3804 N.greater(a, BIG), a_big_frozen, a_notbig_frozen)) 3805 return probs 3806 else: 3807 return s 3808 3809 def aerfcc(x): 3810 """ 3811Returns the complementary error function erfc(x) with fractional error 3812everywhere less than 1.2e-7. Adapted from Numerical Recipies. Can 3813handle multiple dimensions. 3814 3815Usage: aerfcc(x) 3816""" 3817 z = abs(x) 3818 t = 1.0 / (1.0 + 0.5 * z) 3819 ans = t * N.exp(-z * z - 1.26551223 + t * (1.00002368 + t * ( 3820 0.37409196 + t * (0.09678418 + t * (-0.18628806 + t * ( 3821 0.27886807 + t * (-1.13520398 + t * (1.48851587 + t * ( 3822 -0.82215223 + t * 0.17087277))))))))) 3823 return N.where(N.greater_equal(x, 0), ans, 2.0 - ans) 3824 3825 def azprob(z): 3826 """ 3827Returns the area under the normal curve 'to the left of' the given z value. 3828Thus, 3829 for z<0, zprob(z) = 1-tail probability 3830 for z>0, 1.0-zprob(z) = 1-tail probability 3831 for any z, 2.0*(1.0-zprob(abs(z))) = 2-tail probability 3832Adapted from z.c in Gary Perlman's |Stat. Can handle multiple dimensions. 3833 3834Usage: azprob(z) where z is a z-value 3835""" 3836 3837 def yfunc(y): 3838 x = ((((((( 3839 ((((((-0.000045255659 * y + 0.000152529290) * y - 0.000019538132) * y 3840 - 0.000676904986) * y + 0.001390604284) * y - 0.000794620820) * y 3841 - 0.002034254874) * y + 0.006549791214) * y - 0.010557625006) * y + 3842 0.011630447319) * y - 0.009279453341) * y + 0.005353579108) * y - 3843 0.002141268741) * y + 0.000535310849) * y + 0.999936657524 3844 return x 3845 3846 def wfunc(w): 3847 x = ((((((((0.000124818987 * w - 0.001075204047) * w + 0.005198775019) * w 3848 - 0.019198292004) * w + 0.059054035642) * w - 0.151968751364) * 3849 w + 0.319152932694) * w - 0.531923007300) * w + 3850 0.797884560593) * N.sqrt(w) * 2.0 3851 return x 3852 3853 Z_MAX = 6.0 # maximum meaningful z-value 3854 x = N.zeros(z.shape, N.float_) # initialize 3855 y = 0.5 * N.fabs(z) 3856 x = N.where(N.less(y, 1.0), wfunc(y * y), yfunc(y - 2.0)) # get x's 3857 x = N.where(N.greater(y, Z_MAX * 0.5), 1.0, x) # kill those with big Z 3858 prob = N.where(N.greater(z, 0), (x + 1) * 0.5, (1 - x) * 0.5) 3859 return prob 3860 3861 def aksprob(alam): 3862 """ 3863Returns the probability value for a K-S statistic computed via ks_2samp. 3864Adapted from Numerical Recipies. Can handle multiple dimensions. 3865 3866Usage: aksprob(alam) 3867""" 3868 if type(alam) == N.ndarray: 3869 frozen = -1 * N.ones(alam.shape, N.float64) 3870 alam = alam.astype(N.float64) 3871 arrayflag = 1 3872 else: 3873 frozen = N.array(-1.) 3874 alam = N.array(alam, N.float64) 3875 arrayflag = 1 3876 mask = N.zeros(alam.shape) 3877 fac = 2.0 * N.ones(alam.shape, N.float_) 3878 sum = N.zeros(alam.shape, N.float_) 3879 termbf = N.zeros(alam.shape, N.float_) 3880 a2 = N.array(-2.0 * alam * alam, N.float64) 3881 totalelements = N.multiply.reduce(N.array(mask.shape)) 3882 for j in range(1, 201): 3883 if asum(mask) == totalelements: 3884 break 3885 exponents = (a2 * j * j) 3886 overflowmask = N.less(exponents, -746) 3887 frozen = N.where(overflowmask, 0, frozen) 3888 mask = mask + overflowmask 3889 term = fac * N.exp(exponents) 3890 sum = sum + term 3891 newmask = N.where( 3892 N.less_equal( 3893 abs(term), (0.001 * termbf)) + N.less( 3894 abs(term), 1.0e-8 * sum), 1, 0) 3895 frozen = N.where(newmask * N.equal(mask, 0), sum, frozen) 3896 mask = N.clip(mask + newmask, 0, 1) 3897 fac = -fac 3898 termbf = abs(term) 3899 if arrayflag: 3900 return N.where( 3901 N.equal(frozen, -1), 1.0, frozen) # 1.0 if doesn't converge 3902 else: 3903 return N.where( 3904 N.equal(frozen, -1), 1.0, frozen)[0] # 1.0 if doesn't converge 3905 3906 def afprob(dfnum, dfden, F): 3907 """ 3908Returns the 1-tailed significance level (p-value) of an F statistic 3909given the degrees of freedom for the numerator (dfR-dfF) and the degrees 3910of freedom for the denominator (dfF). Can handle multiple dims for F. 3911 3912Usage: afprob(dfnum, dfden, F) where usually dfnum=dfbn, dfden=dfwn 3913""" 3914 if type(F) == N.ndarray: 3915 return abetai(0.5 * dfden, 0.5 * dfnum, dfden / (1.0 * dfden + dfnum * F)) 3916 else: 3917 return abetai(0.5 * dfden, 0.5 * dfnum, dfden / float(dfden + dfnum * F)) 3918 3919 def abetacf(a, b, x, verbose=1): 3920 """ 3921Evaluates the continued fraction form of the incomplete Beta function, 3922betai. (Adapted from: Numerical Recipies in C.) Can handle multiple 3923dimensions for x. 3924 3925Usage: abetacf(a,b,x,verbose=1) 3926""" 3927 ITMAX = 200 3928 EPS = 3.0e-7 3929 3930 arrayflag = 1 3931 if type(x) == N.ndarray: 3932 frozen = N.ones(x.shape, 3933 N.float_) * -1 #start out w/ -1s, should replace all 3934 else: 3935 arrayflag = 0 3936 frozen = N.array([-1]) 3937 x = N.array([x]) 3938 mask = N.zeros(x.shape) 3939 bm = az = am = 1.0 3940 qab = a + b 3941 qap = a + 1.0 3942 qam = a - 1.0 3943 bz = 1.0 - qab * x / qap 3944 for i in range(ITMAX + 1): 3945 if N.sum(N.ravel(N.equal(frozen, -1))) == 0: 3946 break 3947 em = float(i + 1) 3948 tem = em + em 3949 d = em * (b - em) * x / ((qam + tem) * (a + tem)) 3950 ap = az + d * am 3951 bp = bz + d * bm 3952 d = -(a + em) * (qab + em) * x / ((qap + tem) * (a + tem)) 3953 app = ap + d * az 3954 bpp = bp + d * bz 3955 aold = az * 1 3956 am = ap / bpp 3957 bm = bp / bpp 3958 az = app / bpp 3959 bz = 1.0 3960 newmask = N.less(abs(az - aold), EPS * abs(az)) 3961 frozen = N.where(newmask * N.equal(mask, 0), az, frozen) 3962 mask = N.clip(mask + newmask, 0, 1) 3963 noconverge = asum(N.equal(frozen, -1)) 3964 if noconverge <> 0 and verbose: 3965 print 'a or b too big, or ITMAX too small in Betacf for ', noconverge, ' elements' 3966 if arrayflag: 3967 return frozen 3968 else: 3969 return frozen[0] 3970 3971 def agammln(xx): 3972 """ 3973Returns the gamma function of xx. 3974 Gamma(z) = Integral(0,infinity) of t^(z-1)exp(-t) dt. 3975Adapted from: Numerical Recipies in C. Can handle multiple dims ... but 3976probably doesn't normally have to. 3977 3978Usage: agammln(xx) 3979""" 3980 coeff = [76.18009173, -86.50532033, 24.01409822, -1.231739516, 3981 0.120858003e-2, -0.536382e-5] 3982 x = xx - 1.0 3983 tmp = x + 5.5 3984 tmp = tmp - (x + 0.5) * N.log(tmp) 3985 ser = 1.0 3986 for j in range(len(coeff)): 3987 x = x + 1 3988 ser = ser + coeff[j] / x 3989 return -tmp + N.log(2.50662827465 * ser) 3990 3991 def abetai(a, b, x, verbose=1): 3992 """ 3993Returns the incomplete beta function: 3994 3995 I-sub-x(a,b) = 1/B(a,b)*(Integral(0,x) of t^(a-1)(1-t)^(b-1) dt) 3996 3997where a,b>0 and B(a,b) = G(a)*G(b)/(G(a+b)) where G(a) is the gamma 3998function of a. The continued fraction formulation is implemented 3999here, using the betacf function. (Adapted from: Numerical Recipies in 4000C.) Can handle multiple dimensions. 4001 4002Usage: abetai(a,b,x,verbose=1) 4003""" 4004 TINY = 1e-15 4005 if type(a) == N.ndarray: 4006 if asum(N.less(x, 0) + N.greater(x, 1)) <> 0: 4007 raise ValueError, 'Bad x in abetai' 4008 x = N.where(N.equal(x, 0), TINY, x) 4009 x = N.where(N.equal(x, 1.0), 1 - TINY, x) 4010 4011 bt = N.where(N.equal(x, 0) + N.equal(x, 1), 0, -1) 4012 exponents = (gammln(a + b) - gammln(a) - gammln(b) + a * N.log(x) + b * 4013 N.log(1.0 - x)) 4014 # 746 (below) is the MAX POSSIBLE BEFORE OVERFLOW 4015 exponents = N.where(N.less(exponents, -740), -740, exponents) 4016 bt = N.exp(exponents) 4017 if type(x) == N.ndarray: 4018 ans = N.where( 4019 N.less(x, (a + 1) / (a + b + 2.0)), bt * abetacf(a, b, x, verbose) / 4020 float(a), 1.0 - bt * abetacf(b, a, 1.0 - x, verbose) / float(b)) 4021 else: 4022 if x < (a + 1) / (a + b + 2.0): 4023 ans = bt * abetacf(a, b, x, verbose) / float(a) 4024 else: 4025 ans = 1.0 - bt * abetacf(b, a, 1.0 - x, verbose) / float(b) 4026 return ans 4027 4028##################################### 4029####### AANOVA CALCULATIONS ####### 4030##################################### 4031 4032 import numpy.linalg, operator 4033 LA = numpy.linalg 4034 4035 def aglm(data, para): 4036 """ 4037Calculates a linear model fit ... anova/ancova/lin-regress/t-test/etc. Taken 4038from: 4039 Peterson et al. Statistical limitations in functional neuroimaging 4040 I. Non-inferential methods and statistical models. Phil Trans Royal Soc 4041 Lond B 354: 1239-1260. 4042 4043Usage: aglm(data,para) 4044Returns: statistic, p-value ??? 4045""" 4046 if len(para) <> len(data): 4047 print 'data and para must be same length in aglm' 4048 return 4049 n = len(para) 4050 p = pstat.aunique(para) 4051 x = N.zeros((n, len(p))) # design matrix 4052 for l in range(len(p)): 4053 x[:, l] = N.equal(para, p[l]) 4054 b = N.dot( 4055 N.dot( 4056 LA.inv(N.dot( 4057 N.transpose(x), x)), # i.e., b=inv(X'X)X'Y 4058 N.transpose(x)), 4059 data) 4060 diffs = (data - N.dot(x, b)) 4061 s_sq = 1. / (n - len(p)) * N.dot(N.transpose(diffs), diffs) 4062 4063 if len(p) == 2: # ttest_ind 4064 c = N.array([1, -1]) 4065 df = n - 2 4066 fact = asum(1.0 / asum(x, 0)) # i.e., 1/n1 + 1/n2 + 1/n3 ... 4067 t = N.dot(c, b) / N.sqrt(s_sq * fact) 4068 probs = abetai(0.5 * df, 0.5, float(df) / (df + t * t)) 4069 return t, probs 4070 4071 def aF_oneway(*args): 4072 """ 4073Performs a 1-way ANOVA, returning an F-value and probability given 4074any number of groups. From Heiman, pp.394-7. 4075 4076Usage: aF_oneway (*args) where *args is 2 or more arrays, one per 4077 treatment group 4078Returns: f-value, probability 4079""" 4080 na = len(args) # ANOVA on 'na' groups, each in it's own array 4081 means = [0] * na 4082 vars = [0] * na 4083 ns = [0] * na 4084 alldata = [] 4085 tmp = map(N.array, args) 4086 means = map(amean, tmp) 4087 vars = map(avar, tmp) 4088 ns = map(len, args) 4089 alldata = N.concatenate(args) 4090 bign = len(alldata) 4091 sstot = ass(alldata) - (asquare_of_sums(alldata) / float(bign)) 4092 ssbn = 0 4093 for a in args: 4094 ssbn = ssbn + asquare_of_sums(N.array(a)) / float(len(a)) 4095 ssbn = ssbn - (asquare_of_sums(alldata) / float(bign)) 4096 sswn = sstot - ssbn 4097 dfbn = na - 1 4098 dfwn = bign - na 4099 msb = ssbn / float(dfbn) 4100 msw = sswn / float(dfwn) 4101 f = msb / msw 4102 prob = fprob(dfbn, dfwn, f) 4103 return f, prob 4104 4105 def aF_value(ER, EF, dfR, dfF): 4106 """ 4107Returns an F-statistic given the following: 4108 ER = error associated with the null hypothesis (the Restricted model) 4109 EF = error associated with the alternate hypothesis (the Full model) 4110 dfR = degrees of freedom the Restricted model 4111 dfF = degrees of freedom associated with the Restricted model 4112""" 4113 return ((ER - EF) / float(dfR - dfF) / (EF / float(dfF))) 4114 4115 def outputfstats(Enum, Eden, dfnum, dfden, f, prob): 4116 Enum = round(Enum, 3) 4117 Eden = round(Eden, 3) 4118 dfnum = round(Enum, 3) 4119 dfden = round(dfden, 3) 4120 f = round(f, 3) 4121 prob = round(prob, 3) 4122 suffix = '' # for *s after the p-value 4123 if prob < 0.001: 4124 suffix = ' ***' 4125 elif prob < 0.01: 4126 suffix = ' **' 4127 elif prob < 0.05: 4128 suffix = ' *' 4129 title = [['EF/ER', 'DF', 'Mean Square', 'F-value', 'prob', '']] 4130 lofl = title + [[Enum, dfnum, round(Enum / float(dfnum), 3), f, prob, suffix 4131 ], [Eden, dfden, round(Eden / float(dfden), 3), '', '', '']] 4132 pstat.printcc(lofl) 4133 return 4134 4135 def F_value_multivariate(ER, EF, dfnum, dfden): 4136 """ 4137Returns an F-statistic given the following: 4138 ER = error associated with the null hypothesis (the Restricted model) 4139 EF = error associated with the alternate hypothesis (the Full model) 4140 dfR = degrees of freedom the Restricted model 4141 dfF = degrees of freedom associated with the Restricted model 4142where ER and EF are matrices from a multivariate F calculation. 4143""" 4144 if type(ER) in [IntType, FloatType]: 4145 ER = N.array([[ER]]) 4146 if type(EF) in [IntType, FloatType]: 4147 EF = N.array([[EF]]) 4148 n_um = (LA.det(ER) - LA.det(EF)) / float(dfnum) 4149 d_en = LA.det(EF) / float(dfden) 4150 return n_um / d_en 4151 4152##################################### 4153####### ASUPPORT FUNCTIONS ######## 4154##################################### 4155 4156 def asign(a): 4157 """ 4158Usage: asign(a) 4159Returns: array shape of a, with -1 where a<0 and +1 where a>=0 4160""" 4161 a = N.asarray(a) 4162 if ((type(a) == type(1.4)) or (type(a) == type(1))): 4163 return a - a - N.less(a, 0) + N.greater(a, 0) 4164 else: 4165 return N.zeros(N.shape(a)) - N.less(a, 0) + N.greater(a, 0) 4166 4167 def asum(a, dimension=None, keepdims=0): 4168 """ 4169An alternative to the Numeric.add.reduce function, which allows one to 4170(1) collapse over multiple dimensions at once, and/or (2) to retain 4171all dimensions in the original array (squashing one down to size. 4172Dimension can equal None (ravel array first), an integer (the 4173dimension over which to operate), or a sequence (operate over multiple 4174dimensions). If keepdims=1, the resulting array will have as many 4175dimensions as the input array. 4176 4177Usage: asum(a, dimension=None, keepdims=0) 4178Returns: array summed along 'dimension'(s), same _number_ of dims if keepdims=1 4179""" 4180 if type(a) == N.ndarray and a.dtype in [N.int_, N.short, N.ubyte]: 4181 a = a.astype(N.float_) 4182 if dimension == None: 4183 s = N.sum(N.ravel(a)) 4184 elif type(dimension) in [IntType, FloatType]: 4185 s = N.add.reduce(a, dimension) 4186 if keepdims == 1: 4187 shp = list(a.shape) 4188 shp[dimension] = 1 4189 s = N.reshape(s, shp) 4190 else: # must be a SEQUENCE of dims to sum over 4191 dims = list(dimension) 4192 dims.sort() 4193 dims.reverse() 4194 s = a * 1.0 4195 for dim in dims: 4196 s = N.add.reduce(s, dim) 4197 if keepdims == 1: 4198 shp = list(a.shape) 4199 for dim in dims: 4200 shp[dim] = 1 4201 s = N.reshape(s, shp) 4202 return s 4203 4204 def acumsum(a, dimension=None): 4205 """ 4206Returns an array consisting of the cumulative sum of the items in the 4207passed array. Dimension can equal None (ravel array first), an 4208integer (the dimension over which to operate), or a sequence (operate 4209over multiple dimensions, but this last one just barely makes sense). 4210 4211Usage: acumsum(a,dimension=None) 4212""" 4213 if dimension == None: 4214 a = N.ravel(a) 4215 dimension = 0 4216 if type(dimension) in [ListType, TupleType, N.ndarray]: 4217 dimension = list(dimension) 4218 dimension.sort() 4219 dimension.reverse() 4220 for d in dimension: 4221 a = N.add.accumulate(a, d) 4222 return a 4223 else: 4224 return N.add.accumulate(a, dimension) 4225 4226 def ass(inarray, dimension=None, keepdims=0): 4227 """ 4228Squares each value in the passed array, adds these squares & returns 4229the result. Unfortunate function name. :-) Defaults to ALL values in 4230the array. Dimension can equal None (ravel array first), an integer 4231(the dimension over which to operate), or a sequence (operate over 4232multiple dimensions). Set keepdims=1 to maintain the original number 4233of dimensions. 4234 4235Usage: ass(inarray, dimension=None, keepdims=0) 4236Returns: sum-along-'dimension' for (inarray*inarray) 4237""" 4238 if dimension == None: 4239 inarray = N.ravel(inarray) 4240 dimension = 0 4241 return asum(inarray * inarray, dimension, keepdims) 4242 4243 def asummult(array1, array2, dimension=None, keepdims=0): 4244 """ 4245Multiplies elements in array1 and array2, element by element, and 4246returns the sum (along 'dimension') of all resulting multiplications. 4247Dimension can equal None (ravel array first), an integer (the 4248dimension over which to operate), or a sequence (operate over multiple 4249dimensions). A trivial function, but included for completeness. 4250 4251Usage: asummult(array1,array2,dimension=None,keepdims=0) 4252""" 4253 if dimension == None: 4254 array1 = N.ravel(array1) 4255 array2 = N.ravel(array2) 4256 dimension = 0 4257 return asum(array1 * array2, dimension, keepdims) 4258 4259 def asquare_of_sums(inarray, dimension=None, keepdims=0): 4260 """ 4261Adds the values in the passed array, squares that sum, and returns the 4262result. Dimension can equal None (ravel array first), an integer (the 4263dimension over which to operate), or a sequence (operate over multiple 4264dimensions). If keepdims=1, the returned array will have the same 4265NUMBER of dimensions as the original. 4266 4267Usage: asquare_of_sums(inarray, dimension=None, keepdims=0) 4268Returns: the square of the sum over dim(s) in dimension 4269""" 4270 if dimension == None: 4271 inarray = N.ravel(inarray) 4272 dimension = 0 4273 s = asum(inarray, dimension, keepdims) 4274 if type(s) == N.ndarray: 4275 return s.astype(N.float_) * s 4276 else: 4277 return float(s) * s 4278 4279 def asumdiffsquared(a, b, dimension=None, keepdims=0): 4280 """ 4281Takes pairwise differences of the values in arrays a and b, squares 4282these differences, and returns the sum of these squares. Dimension 4283can equal None (ravel array first), an integer (the dimension over 4284which to operate), or a sequence (operate over multiple dimensions). 4285keepdims=1 means the return shape = len(a.shape) = len(b.shape) 4286 4287Usage: asumdiffsquared(a,b) 4288Returns: sum[ravel(a-b)**2] 4289""" 4290 if dimension == None: 4291 inarray = N.ravel(a) 4292 dimension = 0 4293 return asum((a - b)**2, dimension, keepdims) 4294 4295 def ashellsort(inarray): 4296 """ 4297Shellsort algorithm. Sorts a 1D-array. 4298 4299Usage: ashellsort(inarray) 4300Returns: sorted-inarray, sorting-index-vector (for original array) 4301""" 4302 n = len(inarray) 4303 svec = inarray * 1.0 4304 ivec = range(n) 4305 gap = n / 2 # integer division needed 4306 while gap > 0: 4307 for i in range(gap, n): 4308 for j in range(i - gap, -1, -gap): 4309 while j >= 0 and svec[j] > svec[j + gap]: 4310 temp = svec[j] 4311 svec[j] = svec[j + gap] 4312 svec[j + gap] = temp 4313 itemp = ivec[j] 4314 ivec[j] = ivec[j + gap] 4315 ivec[j + gap] = itemp 4316 gap = gap / 2 # integer division needed 4317# svec is now sorted input vector, ivec has the order svec[i] = vec[ivec[i]] 4318 return svec, ivec 4319 4320 def arankdata(inarray): 4321 """ 4322Ranks the data in inarray, dealing with ties appropritely. Assumes 4323a 1D inarray. Adapted from Gary Perlman's |Stat ranksort. 4324 4325Usage: arankdata(inarray) 4326Returns: array of length equal to inarray, containing rank scores 4327""" 4328 n = len(inarray) 4329 svec, ivec = ashellsort(inarray) 4330 sumranks = 0 4331 dupcount = 0 4332 newarray = N.zeros(n, N.float_) 4333 for i in range(n): 4334 sumranks = sumranks + i 4335 dupcount = dupcount + 1 4336 if i == n - 1 or svec[i] <> svec[i + 1]: 4337 averank = sumranks / float(dupcount) + 1 4338 for j in range(i - dupcount + 1, i + 1): 4339 newarray[ivec[j]] = averank 4340 sumranks = 0 4341 dupcount = 0 4342 return newarray 4343 4344 def afindwithin(data): 4345 """ 4346Returns a binary vector, 1=within-subject factor, 0=between. Input 4347equals the entire data array (i.e., column 1=random factor, last 4348column = measured values. 4349 4350Usage: afindwithin(data) data in |Stat format 4351""" 4352 numfact = len(data[0]) - 2 4353 withinvec = [0] * numfact 4354 for col in range(1, numfact + 1): 4355 rows = pstat.linexand(data, col, pstat.unique(pstat.colex(data, 1))[0] 4356 ) # get 1 level of this factor 4357 if len(pstat.unique(pstat.colex(rows, 0))) < len( 4358 rows): # if fewer subjects than scores on this factor 4359 withinvec[col - 1] = 1 4360 return withinvec 4361 4362 ######################################################### 4363 ######################################################### 4364 ###### RE-DEFINE DISPATCHES TO INCLUDE ARRAYS ######### 4365 ######################################################### 4366 ######################################################### 4367 4368 ## CENTRAL TENDENCY: 4369 geometricmean = Dispatch( 4370 (lgeometricmean, (ListType, TupleType)), (ageometricmean, (N.ndarray,))) 4371 harmonicmean = Dispatch( 4372 (lharmonicmean, (ListType, TupleType)), (aharmonicmean, (N.ndarray,))) 4373 mean = Dispatch((lmean, (ListType, TupleType)), (amean, (N.ndarray,))) 4374 median = Dispatch((lmedian, (ListType, TupleType)), (amedian, (N.ndarray,))) 4375 medianscore = Dispatch( 4376 (lmedianscore, (ListType, TupleType)), (amedianscore, (N.ndarray,))) 4377 mode = Dispatch((lmode, (ListType, TupleType)), (amode, (N.ndarray,))) 4378 tmean = Dispatch((atmean, (N.ndarray,))) 4379 tvar = Dispatch((atvar, (N.ndarray,))) 4380 tstdev = Dispatch((atstdev, (N.ndarray,))) 4381 tsem = Dispatch((atsem, (N.ndarray,))) 4382 4383 ## VARIATION: 4384 moment = Dispatch((lmoment, (ListType, TupleType)), (amoment, (N.ndarray,))) 4385 variation = Dispatch( 4386 (lvariation, (ListType, TupleType)), (avariation, (N.ndarray,))) 4387 skew = Dispatch((lskew, (ListType, TupleType)), (askew, (N.ndarray,))) 4388 kurtosis = Dispatch( 4389 (lkurtosis, (ListType, TupleType)), (akurtosis, (N.ndarray,))) 4390 describe = Dispatch( 4391 (ldescribe, (ListType, TupleType)), (adescribe, (N.ndarray,))) 4392 4393 ## DISTRIBUTION TESTS 4394 4395 skewtest = Dispatch( 4396 (askewtest, (ListType, TupleType)), (askewtest, (N.ndarray,))) 4397 kurtosistest = Dispatch( 4398 (akurtosistest, (ListType, TupleType)), (akurtosistest, (N.ndarray,))) 4399 normaltest = Dispatch( 4400 (anormaltest, (ListType, TupleType)), (anormaltest, (N.ndarray,))) 4401 4402 ## FREQUENCY STATS: 4403 itemfreq = Dispatch( 4404 (litemfreq, (ListType, TupleType)), (aitemfreq, (N.ndarray,))) 4405 scoreatpercentile = Dispatch( 4406 (lscoreatpercentile, (ListType, TupleType)), (ascoreatpercentile, 4407 (N.ndarray,))) 4408 percentileofscore = Dispatch( 4409 (lpercentileofscore, (ListType, TupleType)), (apercentileofscore, 4410 (N.ndarray,))) 4411 histogram = Dispatch( 4412 (lhistogram, (ListType, TupleType)), (ahistogram, (N.ndarray,))) 4413 cumfreq = Dispatch( 4414 (lcumfreq, (ListType, TupleType)), (acumfreq, (N.ndarray,))) 4415 relfreq = Dispatch( 4416 (lrelfreq, (ListType, TupleType)), (arelfreq, (N.ndarray,))) 4417 4418 ## VARIABILITY: 4419 obrientransform = Dispatch( 4420 (lobrientransform, (ListType, TupleType)), (aobrientransform, 4421 (N.ndarray,))) 4422 samplevar = Dispatch( 4423 (lsamplevar, (ListType, TupleType)), (asamplevar, (N.ndarray,))) 4424 samplestdev = Dispatch( 4425 (lsamplestdev, (ListType, TupleType)), (asamplestdev, (N.ndarray,))) 4426 signaltonoise = Dispatch((asignaltonoise, (N.ndarray,)),) 4427 var = Dispatch((lvar, (ListType, TupleType)), (avar, (N.ndarray,))) 4428 stdev = Dispatch((lstdev, (ListType, TupleType)), (astdev, (N.ndarray,))) 4429 sterr = Dispatch((lsterr, (ListType, TupleType)), (asterr, (N.ndarray,))) 4430 sem = Dispatch((lsem, (ListType, TupleType)), (asem, (N.ndarray,))) 4431 z = Dispatch((lz, (ListType, TupleType)), (az, (N.ndarray,))) 4432 zs = Dispatch((lzs, (ListType, TupleType)), (azs, (N.ndarray,))) 4433 4434 ## TRIMMING FCNS: 4435 threshold = Dispatch((athreshold, (N.ndarray,)),) 4436 trimboth = Dispatch( 4437 (ltrimboth, (ListType, TupleType)), (atrimboth, (N.ndarray,))) 4438 trim1 = Dispatch((ltrim1, (ListType, TupleType)), (atrim1, (N.ndarray,))) 4439 4440 ## CORRELATION FCNS: 4441 paired = Dispatch((lpaired, (ListType, TupleType)), (apaired, (N.ndarray,))) 4442 lincc = Dispatch((llincc, (ListType, TupleType)), (alincc, (N.ndarray,))) 4443 pearsonr = Dispatch( 4444 (lpearsonr, (ListType, TupleType)), (apearsonr, (N.ndarray,))) 4445 spearmanr = Dispatch( 4446 (lspearmanr, (ListType, TupleType)), (aspearmanr, (N.ndarray,))) 4447 pointbiserialr = Dispatch( 4448 (lpointbiserialr, (ListType, TupleType)), (apointbiserialr, (N.ndarray,))) 4449 kendalltau = Dispatch( 4450 (lkendalltau, (ListType, TupleType)), (akendalltau, (N.ndarray,))) 4451 linregress = Dispatch( 4452 (llinregress, (ListType, TupleType)), (alinregress, (N.ndarray,))) 4453 4454 ## INFERENTIAL STATS: 4455 ttest_1samp = Dispatch( 4456 (lttest_1samp, (ListType, TupleType)), (attest_1samp, (N.ndarray,))) 4457 ttest_ind = Dispatch( 4458 (lttest_ind, (ListType, TupleType)), (attest_ind, (N.ndarray,))) 4459 ttest_rel = Dispatch( 4460 (lttest_rel, (ListType, TupleType)), (attest_rel, (N.ndarray,))) 4461 chisquare = Dispatch( 4462 (lchisquare, (ListType, TupleType)), (achisquare, (N.ndarray,))) 4463 ks_2samp = Dispatch( 4464 (lks_2samp, (ListType, TupleType)), (aks_2samp, (N.ndarray,))) 4465 mannwhitneyu = Dispatch( 4466 (lmannwhitneyu, (ListType, TupleType)), (amannwhitneyu, (N.ndarray,))) 4467 tiecorrect = Dispatch( 4468 (ltiecorrect, (ListType, TupleType)), (atiecorrect, (N.ndarray,))) 4469 ranksums = Dispatch( 4470 (lranksums, (ListType, TupleType)), (aranksums, (N.ndarray,))) 4471 wilcoxont = Dispatch( 4472 (lwilcoxont, (ListType, TupleType)), (awilcoxont, (N.ndarray,))) 4473 kruskalwallish = Dispatch( 4474 (lkruskalwallish, (ListType, TupleType)), (akruskalwallish, (N.ndarray,))) 4475 friedmanchisquare = Dispatch( 4476 (lfriedmanchisquare, (ListType, TupleType)), (afriedmanchisquare, 4477 (N.ndarray,))) 4478 4479 ## PROBABILITY CALCS: 4480 chisqprob = Dispatch( 4481 (lchisqprob, (IntType, FloatType)), (achisqprob, (N.ndarray,))) 4482 zprob = Dispatch((lzprob, (IntType, FloatType)), (azprob, (N.ndarray,))) 4483 ksprob = Dispatch((lksprob, (IntType, FloatType)), (aksprob, (N.ndarray,))) 4484 fprob = Dispatch((lfprob, (IntType, FloatType)), (afprob, (N.ndarray,))) 4485 betacf = Dispatch((lbetacf, (IntType, FloatType)), (abetacf, (N.ndarray,))) 4486 betai = Dispatch((lbetai, (IntType, FloatType)), (abetai, (N.ndarray,))) 4487 erfcc = Dispatch((lerfcc, (IntType, FloatType)), (aerfcc, (N.ndarray,))) 4488 gammln = Dispatch((lgammln, (IntType, FloatType)), (agammln, (N.ndarray,))) 4489 4490 ## ANOVA FUNCTIONS: 4491 F_oneway = Dispatch( 4492 (lF_oneway, (ListType, TupleType)), (aF_oneway, (N.ndarray,))) 4493 F_value = Dispatch( 4494 (lF_value, (ListType, TupleType)), (aF_value, (N.ndarray,))) 4495 4496 ## SUPPORT FUNCTIONS: 4497 incr = Dispatch((lincr, (ListType, TupleType, N.ndarray)),) 4498 sum = Dispatch((lsum, (ListType, TupleType)), (asum, (N.ndarray,))) 4499 cumsum = Dispatch((lcumsum, (ListType, TupleType)), (acumsum, (N.ndarray,))) 4500 ss = Dispatch((lss, (ListType, TupleType)), (ass, (N.ndarray,))) 4501 summult = Dispatch( 4502 (lsummult, (ListType, TupleType)), (asummult, (N.ndarray,))) 4503 square_of_sums = Dispatch( 4504 (lsquare_of_sums, (ListType, TupleType)), (asquare_of_sums, (N.ndarray,))) 4505 sumdiffsquared = Dispatch( 4506 (lsumdiffsquared, (ListType, TupleType)), (asumdiffsquared, (N.ndarray,))) 4507 shellsort = Dispatch( 4508 (lshellsort, (ListType, TupleType)), (ashellsort, (N.ndarray,))) 4509 rankdata = Dispatch( 4510 (lrankdata, (ListType, TupleType)), (arankdata, (N.ndarray,))) 4511 findwithin = Dispatch( 4512 (lfindwithin, (ListType, TupleType)), (afindwithin, (N.ndarray,))) 4513 4514###################### END OF NUMERIC FUNCTION BLOCK ##################### 4515 4516###################### END OF STATISTICAL FUNCTIONS ###################### 4517 4518except ImportError: 4519 pass 4520