The numpy statistics subpackage contains several elementary statistical analysis functions.
The functions average(data), var(data), and std(data) all take a sequence of data values and return the statistical quantity indicated by their name.
Examples:
In [4]: from numpy import * In [6]: sqrt(arange(100.)) Out[6]: array([ 0. , 1. , 1.41421356, 1.73205081, 2. , 2.23606798, 2.44948974, 2.64575131, 2.82842712, 3. , 3.16227766, 3.31662479, 3.46410162, 3.60555128, 3.74165739, 3.87298335, 4. , 4.12310563, 4.24264069, 4.35889894, 4.47213595, 4.58257569, 4.69041576, 4.79583152, 4.89897949, 5. , 5.09901951, 5.19615242, 5.29150262, 5.38516481, 5.47722558, 5.56776436, 5.65685425, 5.74456265, 5.83095189, 5.91607978, 6. , 6.08276253, 6.164414 , 6.244998 , 6.32455532, 6.40312424, 6.4807407 , 6.55743852, 6.63324958, 6.70820393, 6.78232998, 6.8556546 , 6.92820323, 7. , 7.07106781, 7.14142843, 7.21110255, 7.28010989, 7.34846923, 7.41619849, 7.48331477, 7.54983444, 7.61577311, 7.68114575, 7.74596669, 7.81024968, 7.87400787, 7.93725393, 8. , 8.06225775, 8.1240384 , 8.18535277, 8.24621125, 8.30662386, 8.36660027, 8.42614977, 8.48528137, 8.54400375, 8.60232527, 8.66025404, 8.71779789, 8.77496439, 8.83176087, 8.88819442, 8.94427191, 9. , 9.05538514, 9.11043358, 9.16515139, 9.21954446, 9.2736185 , 9.32737905, 9.38083152, 9.43398113, 9.48683298, 9.53939201, 9.59166305, 9.64365076, 9.69535971, 9.74679434, 9.79795897, 9.8488578 , 9.89949494, 9.94987437]) In [7]: print average(sqrt(arange(100.))) 6.61462947103 In [8]: print std(sqrt(arange(100.))) 2.39722275998 |
The submodule histogram defines a histogram object.
numpy.histogram(data,bins=number_of_bins,range=(data_low,data_high)) takes a data sequence and calculates a histogram by dividing the data's range into the specified number of bins, counting how many data values fall within each bin interval. An optional third argument is a tuple specifying the range to be analyzed (lower bound data_low and upper bound data_high).
The resulting histogram consists of two arrays (of rank 1) representing a sequence of bins, with each bin described by its center and its count.
Example:
In [21]: from numpy import histogram In [22]: hg = histogram(sqrt(arange(100.)), 10) In [23]: print hg (array([ 1, 3, 5, 7, 9, 11, 13, 15, 17, 19]), array([ 0. , 0.99498744, 1.98997487, 2.98496231, 3.97994975, 4.97493719, 5.96992462, 6.96491206, 7.9598995 , 8.95488693, 9.94987437])) |
There are 10 histogram bins.
The elements of the first array are the counts of the number of data items in each bin, and the elements of the second indicate the bin value range, giving the low value for each bin. That is, in the above example, there is a bin with data range (3.97994975,4.97493719) and count 9.
A common data analysis technique for times series is to estimate the frequency components. Mathematically, this is done using the Fourier transform, mathematically; numerically, with the Fast Fourier transform (FFT).
You set this up with
In [1]: from numpy import * In [2]: from numpy.fft import * |
The FFT is estimated using the function fft(data,n=None,axis=-1).
Here's an example where we generate a waveform with two sine waves.
In [3]: x = arange(64.0) In [4]: y = sin(x/3) + cos(x/5) In [5]: print y [ 1. 1.30726127 1.5394308 1.6668066 1.66864461 1.53571026 1.27165518 0.89305302 0.4280731 -0.08608209 -0.6067148 -1.08977817 -1.49419621 -1.78590325 -1.94117726 -1.94891677 -1.81162417 -1.54499643 -1.17617391 -0.7408407 -0.27949239 0.16672578 0.560164 0.87035526 1.07685723 1.17095629 1.15606789 1.04681136 0.86688311 0.64598274 0.41614918 0.20792581 0.04678916 -0.04975761 -0.07410214 -0.02924059 0.0717784 0.20761356 0.35138783 0.47412246 0.5484515 0.55218675 0.4713187 0.30210095 0.05196685 -0.26084242 -0.60891547 -0.95840818 -1.27259117 -1.51582361 -1.65751878 -1.67566314 -1.55949548 -1.31104093 -0.94531715 -0.48917279 0.02113347 0.54336808 1.03341605 1.44960415 1.75679921 1.92984989 1.95600997 1.83609022] In [6]: abs(fft(y)[0:32]) Out[6]: array([ 6.18571506, 7.08187202, 39.9364864 , 24.3758143 , 15.59164636, 5.63184873, 3.35647352, 2.37098615, 1.83040175, 1.49316891, 1.26476944, 1.10094631, 0.97837048, 0.88365397, 0.80860264, 0.74794032, 0.69812731, 0.65670695, 0.6219252 , 0.59249929, 0.56747197, 0.54611656, 0.52787358, 0.51230736, 0.49907573, 0.48790847, 0.47859185, 0.47095727, 0.46487304, 0.46023825, 0.45697846, 0.4550425 ]) |
Slicing was used to extract only the first 32 data samples. Since the FFT is symmetric for data represented by real (rather than complex) values, the last half of the FFT is a mirror to the first. And so, we only needed the first half.
Also, we took the absolute value abs() of the FFT to look at only the magnitude of the frequency components. In other words, we calculated the power spectral density—the power in each frequency bin.
It's a bit hard to see what's going on in the output FFT, such as locating the original two sine waves, which should appear as two, separate peaks. This is partly due to our using a too-short time series. It's also partly due to having to look at the raw numbers. We'll come back to this shortly, when we learn how to plot data.
To convert from frequencies back to time values use the inverse FFT ifft(data).
Both functions take an optional second argument specifying the length of the data sequence (by default the length of the array). If this is larger than the array, an appropriate number of zeros are added to make the effective length a power of 2. For improved efficiency, therefore, the data array length should be a power of 2. An optional third argument specifies the dimension along which the transform is performed.
Finally, the function fft2(a) performs an FFT of two-dimensional data. An optional second argument specifies the shape (a tuple of length two) of the subarray to be transformed; by default the whole array is transformed. An optional third argument (also a tuple of length two) specifies the two dimensions for the transformation.
The scipy.optimize package provides leastsq(model, parameters, args) to perform a general nonlinear least-squares fit, using the Levenberg-Marquardt algorithm.
The argument model specifies the function that you want to fit. It is called with three arguments: The first is a tuple containing the model parameters, the second is the first element of a data point (see below), and the third is the independent variable. The return value must be a number.
The argument parameters is a tuple of initial values for the fit parameters. The speed of convergence of the method depends on the quality of the initial values, so they should be chosen with care.
The argument args is a tuple of two lists giving the data to which the model is fit. The first list consists of the data values that the model function is supposed to approximate. The second list gives the independent variables for the data values.
Example:
First we define a model—an exponential function—and then create some data. These get passed to the least squares fitting function.
In [1]: from scipy.optimize import leastsq In [2]: import numpy In [3]: def exponential_residuals(parameters,y,x): ...: a = parameters[0] ...: b = parameters[1] ...: err = y - a*numpy.exp(-b/x) ...: return err ...: In [4]: x = [100, 200, 300, 400] In [5]: data = [4.999e-8, 5.307e+2, 1.289e+6, 6.559e+7] In [6]: p0 = [1e13, 4700] In [7]: fit = leastsq(exponential_residuals,p0,args=(data,x)) In [8]: print "Estimated parameters:",fit[0] Estimated parameters: [ 70965687.75740071, 5166737.06740553] In [9]: err = 0 In [10]: for i in xrange(4): ...: e = exponential_residuals([ 8.64155171e+12, 4.71546779e+03],data[i],x[i]) ...: err += e*e ...: ...: In [11]: print "Fit error:", err Fit error: 281642.49000000005 |
NumPy's random module provides support for generating random numbers, including multi-dimensional arrays of them.
Random numbers find their way into many simulation algorithms representing the element of chance or noisy measurement data. Since an algorithm is used to generate the numbers, they are not truly “random” but rather have statistical properties that emulate what we think “truly random” is.
In fact, it's very hard to generate good approximates of random numbers. Many of the random number generators are equivalent to chaotic dynamical systems and these typically have some structure somewhere. But I digress ... only temporarily.
Like most random number packages, the random module allows the user to pick a seed value that initializes the random number generator to a known state. Picking a different seed each time a program starts, for example, varies the randomness across the runs. If you need an identical series of “random” numbers from run to run, however, simply fix the seed value to be the same each run.
To use the random, import it as follows:
In [1]: from numpy import * In [2]: from numpy.random import * |
Then set the seed value like this:
In [3]: seed(2345) |
If no arguments are specified, values will be substituted based on the system clock— a common and useful default. Then each time you start the program you'll get a different realization from the random number generator. Typically, seed() is only called at the start of a session or when you want to restart the generated sequence.
Random numbers are specified by their distribution. Different distributions are used to simulate different natural processes. One common distribution is the uniform distribution. This is used to model equally random choices between upper and lower values. Another common choice is the normal or Gaussian distribution.
The random module computes sequences with either of these distributions as well as others, including binomial, Poisson, chi-square, F, gamma, beta, and exponential.
All of the distribution functions accept a shape tuple specifying the desired output matrix size. In addition, some functions also require information about the desired distribution. The following code computes a 3-by-3 matrix of normally distributed random numbers with a mean of zero and a variance of 1.0.
In [4]: a = normal(0.0,1.0,(3,3)) In [5]: a Out[5]: array([[-0.95129875, 1.7687718 , -1.14182693], [ 0.71075487, 0.51095064, 1.14902943], [-0.53846023, -0.73664509, -0.07599648]]) |
See the NumPy manual for further information.
The numpy package also contains operations that are particularly useful for dealing with matrices (arrays of rank 2) and vectors (arrays of rank 1).
In [16]: from numpy import * In [17]: a = array([1., 2., 3.]) In [18]: b = array([0., 1., 0.]) In [19]: dot(a,b) Out[19]: 2.0 |
It can also be used for a vector and a matrix (vector-matrix product):
In [22]: m = array([[1.,0.,0.],[0.,1.,0.],[0.,0.,1.]]) In [23]: print m [[ 1. 0. 0.] [ 0. 1. 0.] [ 0. 0. 1.]] In [24]: dot(a,m) Out[24]: array([ 1., 2., 3.]) In [25]: dot(m,a) Out[25]: array([ 1., 2., 3.]) |
Finally, it also applies to two matrices giving the matrix product.
In [28]: mm = array([[1.,2.,3.],[4.,5.,6.],[7.,8.,9.]]) In [29]: dot(m,mm) Out[29]: array([[ 1., 2., 3.], [ 4., 5., 6.], [ 7., 8., 9.]]) |
Note that the last dimension of m must have the same length as the first dimension of mm.
In [30]: transpose(mm) Out[30]: array([[ 1., 4., 7.], [ 2., 5., 8.], [ 3., 6., 9.]]) |
This function can actually be applied to arrays of any rank and in general reverses the order of the dimensions.
In [31]: mmm = array([[1.,2.,3.],[4.,5.,6.]]) In [32]: transpose(mmm) Out[32]: array([[ 1., 4.], [ 2., 5.], [ 3., 6.]]) |
In [35]: identity(5) Out[35]: array([[1, 0, 0, 0, 0], [0, 1, 0, 0, 0], [0, 0, 1, 0, 0], [0, 0, 0, 1, 0], [0, 0, 0, 0, 1]]) |
In [36]: ravel(mmm) Out[36]: array([ 1., 2., 3., 4., 5., 6.]) |
Note that the order of elements is defined by the first index varying most slowly. So, for example, the elements of a 2 x 2 array are returned in the order (0,0), (0,1), (1,0), (1,1).
In [37]: reshape(mmm,(3,2)) Out[37]: array([[ 1., 2.], [ 3., 4.], [ 5., 6.]]) |
This is useful when constructing special matrices. The shape must lead to the same total number of elements.
In [39]: resize(mmm,(3,3)) Out[39]: array([[ 1., 2., 3.], [ 4., 5., 6.], [ 1., 2., 3.]]) |
The NumPy submodule numpy.linalg contains common matrix operations, where the matrices are represented by two-dimensional arrays. Some operations require square matrices and so check their input accordingly. All operations are available for real and for complex matrices. The linear algebra functions are based on time-tested and efficient LAPACK routines.
Let's set up the environment and create a matrix so that we can try the routines:
In [1]: import numpy as N In [2]: import numpy.linalg as LA In [3]: a = N.reshape(N.arange(25.), (5, 5)) + N.identity(5) In [4]: print a [[ 1. 1. 2. 3. 4.] [ 5. 7. 7. 8. 9.] [ 10. 11. 13. 13. 14.] [ 15. 16. 17. 19. 19.] [ 20. 21. 22. 23. 25.]] |
The function solve(A, B) solves a system of linear equations with a square nonsingular matrix A and a righthand-side vector B. That is, it finds x such that A x = B. Several righthand-side vectors can be treated simultaneously by making B a two-dimensional array, that is, a sequence of vectors. The resulting x is then a matrix.
In particular, the function inv(A) calculates the inverse of the square nonsingular matrix A by calling solve(A, B) with a suitable B—the identity matrix. The resulting matrix x is the inverse of A.
In [5]: inv_a = LA.inv(a) In [6]: print inv_a [[ 0.20634921 -0.52380952 -0.25396825 0.01587302 0.28571429] [-0.5026455 0.63492063 -0.22751323 -0.08994709 0.04761905] [-0.21164021 -0.20634921 0.7989418 -0.1957672 -0.19047619] [ 0.07936508 -0.04761905 -0.17460317 0.6984127 -0.42857143] [ 0.37037037 0.11111111 -0.14814815 -0.40740741 0.33333333]] |
We can test that this is, in fact, the inverse by printing the largest magnitude element of a * a^{-1} - identity(5):
In [7]: print "Inversion error:", \ ...: N.maximum.reduce(N.fabs(N.ravel(N.dot(a, inv_a)-N.identity(5)))) Inversion error: 1.33226762955e-15 |
Note how the long command line was continued on to the next by using a backslash.
The function eigvals(A) returns the eigenvalues of the square matrix A.
In [8]: print LA.eigvals(a) [ 64.91164992 -2.91164992 1. 1. 1. ] |
The function eig(A) returns both eigenvalues and eigenvectors—the latter as a two-dimensional array.
In [9]: print LA.eig(a)[0] [ 64.91164992 -2.91164992 1. 1. 1. ] In [10]: print LA.eig(a)[1] [[-0.0851802 0.67779864 -0.39557464 -0.20718894 0.21892898] [-0.23825372 0.36348873 0.21266495 -0.25222325 -0.06325102] [-0.39132723 0.04917881 0.15377748 0.53859125 0.04797398] [-0.54440074 -0.2651311 0.63674874 0.508243 -0.78191084] [-0.69747425 -0.57944101 -0.60761653 -0.58742207 0.57825889]] |
The function det(A) returns the determinant of the square matrix A.
In [11]: print LA.det(a) -189.0 |
The function svd(A) returns the singular value decomposition of matrix A: three arrays V, S, and WT whose matrix product is the original matrix A = V S WT. V and WT are unitary matrices. The returned S is a vector (rank-1 array) of singular values, the diagonal elements of the singular-value matrix S. This function is mainly used to check whether (and in what way) a matrix is ill-conditioned.
In [12]: V, S, W = LA.svd(a) In [13]: print V [[ -7.25057693e-02 -7.71195769e-01 6.32455532e-01 4.17141669e-16 2.30147128e-16] [ -2.30109765e-01 -4.97040739e-01 -6.32455532e-01 5.35732959e-01 1.13974543e-01] [ -3.87713761e-01 -2.22885709e-01 -3.16227766e-01 -7.99262221e-01 2.47345714e-01] [ -5.45317758e-01 5.12693211e-02 1.11022302e-16 -8.67443548e-03 -8.36615057e-01] [ -7.02921754e-01 3.25424351e-01 3.16227766e-01 2.72203697e-01 4.75294800e-01]] In [14]: print S [ 70.81579622 2.66889607 1. 1. 1. ] In [15]: print W [[ -3.86049372e-01 -4.15649728e-01 -4.45250085e-01 -4.74850441e-01 -5.04450797e-01] [ 6.71539934e-01 3.56700580e-01 4.18612254e-02 -2.72978129e-01 -5.87817483e-01] [ 6.32455532e-01 -6.32455532e-01 -3.16227766e-01 3.33066907e-16 3.16227766e-01] [ -0.00000000e+00 5.35732959e-01 -7.99262221e-01 -8.67443548e-03 2.72203697e-01] [ 0.00000000e+00 1.13974543e-01 2.47345714e-01 -8.36615057e-01 4.75294800e-01]] |
The function lstsq(A, b, cutoff) returns the least-squares solution x to an overdetermined system of linear equations that minimizes the residual of || A x - b ||. The (optional) third argument cutoff indicates the threshold for the range of singular values; it defaults to cutoff = 10-10. There are four return values: (i) the least-squares solution itself x, (ii) the sum of the squared residuals (i.e., the quantity minimized by the solution), (iii) the rank of matrix A, and (iv) the singular values of A in descending order.
In [16]: b = [1.,1.,1.,1.,1.] In [17]: x, resid, rank, s = LA.lstsq(a,b) In [18]: print x [-0.26984127 -0.13756614 -0.00529101 0.12698413 0.25925926] In [19]: print resid [] In [20]: print rank 5 In [21]: print s [ 70.81579622 2.66889607 1. 1. 1. ] |
In general, resid is returned as an empty array when the rank of a is less than or equal to the number of columns of a or greater than or equal to the number of columns.
In [21]: from numpy import * In [22]: from scipy.optimize import fsolve In [23]: from math import pi In [24]: def func(x): ....: return (2*x*cos(x) - sin(x))*cos(x) - x + pi/4.0 ....: In [25]: print fsolve(func,0.0) [ 0.95284786] |
Given a map f(x), the fixed points are values of x that map to themselves: x = f(x).
Here's how we can find the two fixed points of the logistic map at r = 4.0. We find the root of x - f(x).
In [13]: def LogisticMap(x): ....: return x - 4.0 * x * (1.0 - x) ....: In [14]: print fsolve(LogisticMap,0.05) Out[14]: [ 0.0] In [15]: print fsolve(LogisticMap,.5) Out[15]: [ 0.75] |
Note how I changed the range in which to look for the root in order for the routine to find the fixed point at x = 3/4.
There are additional modules for scientific applications that not covered here which provide, for example, statistical parameter estimation and function interpolation. See the NumPy manual and SciPy documentation.
There is no single, standard full-featured plotting module for Python. In fact, there are multiple choices—including Chaco, GnuPlot, matplotlib, MayaVi, VRML, and visual Python—each with its own features. We'll work with several, choosing whichever meets on our needs.
I started with ipython --pylab, this reads the ipythonrc-pylab start-up script in the .ipythonrc subdirectory. With this, I don't have to import the various SciPy and matplotlib functions one by one.
If you didn't start up that way, use
In [1]: from pylab import * |
Here's the plot() function. Let's find out what it does by typing plot? at the iPython prompt:
In [2]: plot? String Form: |
And lots more documentation. Well developed objects in Python typically have lots of information about themselves, which is accessed via the `?' query in iPython. Page through the documentation using the space-bar and exit by typing `q'.
plot() has many options, as you can see. Read over the documentation to check these out.
Let's try an example:
In [3]: plot([1,2,3], [1,2,3], 'go-', label='line 1',linewidth=2) Out[3]: [<:matplotlib.lines.Line2D instance at 0x3415058>] In [4]: plot([1,2,3], [1,4,9], 'rs', label='line 2') Out[4]: [<matplotlib.lines.Line2D instance at 0x3415cd8>] In [5]: axis([0, 4, 0, 10]) Out[5]: [0, 4, 0, 10] In [6]: legend() Out[6]: <matplotlib.legend.Legend instance at 0x3415e40> In [7]: show() |
On most platforms, you have to show() the plot to get a window on the screen that displays what you've built up with the preceding commands.
show() waits for you to exit, which under Unix (at least) happens when you close the graphics window.
Curious about what axis and legend do? Then type axis? and legend?. Again, lots of documentation.
Let's combine the Histogram function from the Statistics module with plot() to take a look at where the iterates of the logistic map visit the interval.
In [1]: from numpy import * In [2]: from pylab import * In [3]: from numpy import histogram In [4]: xlist = zeros((20),float) In [5]: ylist = zeros((20),float) In [6]: its = zeros((1000),float) In [7]: x = 0.3 In [8]: for i in xrange(1000): ...: x = 4.0 * x * (1.0 - x) ...: its[i] = x ...: ...: In [9]: hg = histogram(its,20) In [10]: for i in xrange(20): ....: xlist[i] = hg[1][i] ....: ylist[i] = hg[0][i] ....: ....: In [11]: plot(xlist,ylist) Out[11]: [<matplotlib.lines.Line2D instance at 0x3182378>] In [12]: show() |
Looks like the iterates of the logistic map spend most of their time near 0.0 and near 1.0.
Similarly, we can develop a a better view of the FFT example above. Rather than a list of numbers, we look at its plot:
In [1]: from numpy import * In [2]: from numpy.fft import * In [3]: from pylab import * In [4]: x = arange(128.0) In [5]: y = sin(x/3)+cos(x/5) In [6]: plot(x,y) Out[6]: [<matplotlib.lines.Line2D instance at 0x35988f0>] In [7]: show() |
Quit the plotting window.
Now let's take the FFT:
In [9]: plot(x[0:64],abs(fft(y)[0:64])) Out[9]: [<matplotlib.lines.Line2D instance at 0x3739c60^gt;] In [10]: show() |
The two frequencies are much easier to see with a longer time series and using a graphics display compared to the list of numbers we had before.
We can also work with two-dimensional (matrix) data.
In [1]: from numpy import * In [2]: from numpy.fft import * In [3]: from pylab import * In [4]: x = arange(64.0) In [5]: y = sin(x/3) + cos(x/5) In [6]: z = cos(x/2) In [7]: a = add.outer(y,z) In [8]: print a Out[8]: array([[ 2. , 1.87758256, 1.54030231, ..., 1.60905598, 1.91474236, 1.99646791], [ 2.30726127, 2.18484384, 1.84756358, ..., 1.91631725, 2.22200363, 2.30372918], [ 2.5394308 , 2.41701336, 2.0797331 , ..., 2.14848677, 2.45417315, 2.5358987 ], ..., [ 2.92984989, 2.80743245, 2.47015219, ..., 2.53890586, 2.84459224, 2.92631779], [ 2.95600997, 2.83359253, 2.49631227, ..., 2.56506594, 2.87075233, 2.95247788], [ 2.83609022, 2.71367279, 2.37639253, ..., 2.4451462 , 2.75083258, 2.83255813]]) In [9]: a.shape Out[9]: (64, 64) In [10]: imshow(a) Out[10]: <matplotlib.image.AxesImage instance at 0x2d7dda0> In [11]: show() |
We can then take the 2D FFT:
In [12]: a_fft = abs(fft2(a)) In [13]: imshow(a_fft) Out[13]: <matplotlib.image.AxesImage instance at 0x2d96c88> In [14]: show() |
You can download a truckload of matplotlib examples.