Scipy: curve fitting - tiagopereira/python_tips GitHub Wiki

Scipy: curve fitting

This page deals with fitting in python, in the sense of least-squares fitting (but not limited to). As with many other things in python and scipy, fitting routines are scattered in many places and not always easy to find or learn to use. Here I cover numpy's polyfit and scipy's least squares and orthogonal distance regression functions.

Polynomial fitting

Polynomial fitting is one of the simplest cases, and one used often. The quick and easy way to do it in python is using numpy's polyfit. It's fast, reliable and simple to use. So why would you want more? Well, one reason is that if you want the errors for the fitted coefficients. Numpy's polyfit doesn't compute them. A more complicated to use but more complete option is to use scipy's odrpack (Orthogonal Distance Regression).

numpy polyfit

This is the simplest option. We will fit the data arrays x, y against a polynomial (in this case degree 3). First, let's get some random data:

import numpy as N

x = N.arange(-20,20)
y = 0.5*x**3 + 1.5*x**2 - 2.5*x + 69 + N.random.normal(scale=200,size=len(x))

y is in this case a cubic polynomial with coefficients [0.5, 1.5, -2.5, 69], with the added complexity of some random numbers. Now to fit a degree 3 polynomial is quite easy:

fit = N.polyfit(x,y,3)

In my case, the fit array contained:

array([   0.49374829,    1.15468274,   -5.62868479,  139.4280685 ])

Compare with our initial values of [0.5, 1.5, -2.5, 69]. Not bad, but not exactly there due to the introduced errors. You can visualize the results by using polyval:

y_new = N.polyval(fit,x)

plot(x,y,'b-')
plot(x,y_new,'r-')

fit 1

odrpack

If the above is not enough for you and you need something more serious, then ODRPACK is the way to go. In my case I only needed the errors on the fitting parameters, so numpy's polyfit was not an option. Scipy's odrpack is possibly one of the most overlooked gems in the whole Scipy distribution. It's incredibly powerful and flexible, yet the documentation is limited to docstrings in the functions. It is a wrapper to the powerful ODRPACK FORTRAN library, that stands for Orthogonal Distance Regression. You can see it as an extension of least-squares, where you can even feed in the errors in the data and that is taken into account in the fitting.

In this case we will use mode 2 of odrpack, which is least-squares. Other modes (0 and 1) are implicit and explicit ODR, and if you want to use it I suggest reading the documentation. Odrpack has a lot of functions and manipulations before a fit can be done, so I have a small program to do that for me and extract only what I need (fit coefficients and errors). This works in a similar syntax to numpy's polyfit:

from scipy.odr import odrpack as odr
from scipy.odr import models

def poly_lsq(x,y,n,verbose=False,itmax=200):
    ''' Performs a polynomial least squares fit to the data,
    with errors! Uses scipy odrpack, but for least squares.
    
    IN:
       x,y (arrays) - data to fit
       n (int)      - polinomial order
       verbose      - can be 0,1,2 for different levels of output
                      (False or True are the same as 0 or 1)
       itmax (int)  - optional maximum number of iterations
       
    OUT:
       coeff -  polynomial coefficients, lowest order first
       err   - standard error (1-sigma) on the coefficients

    --Tiago, 20071114
    '''

    # http://www.scipy.org/doc/api_docs/SciPy.odr.odrpack.html
    # see models.py and use ready made models!!!!
    
    func   = models.polynomial(n)
    mydata = odr.Data(x, y)
    myodr  = odr.ODR(mydata, func,maxit=itmax)

    # Set type of fit to least-squares:
    myodr.set_job(fit_type=2)
    if verbose == 2: myodr.set_iprint(final=2)
          
    fit = myodr.run()

    # Display results:
    if verbose: fit.pprint()

    if fit.stopreason[0] == 'Iteration limit reached':
        print '(WWW) poly_lsq: Iteration limit reached, result not reliable!'

    # Results and errors
    coeff = fit.beta[::-1]
    err   = fit.sd_beta[::-1]

    return coeff,err

(also available from the code section: poly_lsq.py)

This function can be used as in the above example of numpy's polyfit, and returns the coefficients and errors.

General least-squares

Real life is not only made of polynomials, so you will eventually want to fit other functions. Again, I present two methods for fitting arbitrary 1D functions: a simple one (scipy's least squares) and a more powerful one (scipy's odrpack).

Scipy least squares

The optimize module of scipy has a least squares function. This (you guessed) is a wrapper around the MINPACK FORTRAN library. It uses a modified Levenberg-Marquardt algorithm. Let's do an example of fitting a gaussian function. First, define a generic gaussian function:

import numpy as N

def gaussian(B,x):
    ''' Returns the gaussian function for B=m,stdev,max,offset '''
    return B[3]+B[2]/(B[1]*N.sqrt(2*N.pi))*N.exp(-((x-B[0])**2/(2*B[1]**2)))

Now, to use scipy's least squares you need to define an error function. This function should return the difference between some data and the function you want to fit. In this case, with the gaussian function:

def errfunc(p,x,y):
   return y-gaussian(p,x)

Warning: The error function has to return y - function. Do not try to be smart and return the squared difference, it will not work!

Note the order of the parameters, p (array with parameters) should be first in errfunc.

Let's get some random data and see this in action.

from scipy import optimize

x = N.arange(-2,2,0.01)
# parameters of our gaussian
p = [0.5,0.55,1.5,0.5]
y = gaussian(p,x) + N.random.normal(scale=0.02,size=len(x))

# initial estimate of parameters 
p0 = [1., 1., 1., 1.]
fit = optimize.leastsq(errfunc,p0,args=(x,y))

Now the fitted gaussian parameters are in the fit array, which for me gives:

(array([ 0.50949099,  0.53044489,  1.4721773 ,  0.50440252]), 1)

Yes, in fact it's a tuple. The first item are the results, and the second is 1 if the fit converged (other numbers for other scenarios).

So the fit was quite good, let's see it graphically:

plot(x,y,'b-')
plot(x,gaussian(fit[0],x),'r-')

fit 2

Least squares with odrpack

Least squares fitting a general function using odrpack is a bit different than optimize.leastsq. Once you have the function y ou want to fit, you create a model object based on that function, a data object based on the data you want to fit, and then an odr object that does the fitting and gives the results. Here's a simple example:

from scipy.odr import odrpack as odr
from scipy.odr import models

def my_function(p, x):
   (some code...)
   return result

my_model = odr.Model(my_function)
my_data = odr.Data(x,y)

my_odr = odr.ODR(my_data,my_model)

# fit type 2 for least squares
my_odr.set_job(fit_type=2)

fit = my_odr.run()

And now the fitting results (parameters, errors, correlation matrix, and debugging output) are in the object fit. The coefficients are stored in fit.beta, and the errors in the coefficients in fit.sd_beta. This is a very generic approach to least squares fitting with odrpack.

You don't always have to define an odr.Model(). You can see in the polynomial example above that I used models.polynomial(n). This is provided by odrpack and gives you a odr.Model() instance for a particular type of function. There are a few more defined in models.py (look inside the odr folder in the scipy distribution folder, python site-packages). When you do define an odr.Model(), it's useful to add the derivatives of the function in respect to the parameters and to the data (in this case, x). If you have an analytical expression for your function, this is recommended as it saves time by not computing the numerical derivatives and is more precise. Here is a more useful example, with a code that I use to fit gaussians. Again I define the same gaussian function as before:

from scipy.odr import odrpack as odr
from scipy.odr import models
import math
import numpy as N

def gaussian(B,x):
    ''' Returns the gaussian function for B=m,stdev,max,offset '''
    return B[3]+B[2]/(B[1]*math.sqrt(2*math.pi))*N.exp(-((x-B[0])**2/(2*B[1]**2)))


def gauss_lsq(x,y,verbose=False,itmax=200,iparams=[]):
    ''' Performs a gaussian least squares fit to the data,
    with errors! Uses scipy odrpack, but for least squares.'''

    def _gauss_fjd(B,x):
        # Analytical derivative of gaussian with respect to x
        return 2*(x-B[0])*gaussian(B,x)

    def _gauss_fjb(B,x):
        # Analytical derivatives of gaussian with respect to parameters
        _ret = N.concatenate(( -2*(x-B[0])*gaussian(B,x),\
                             ((x-B[0])**2/(2*B[1]**2)-1)/B[1]*gaussian(B,x),\
                             gaussian(B,x)/B[2] ,\
                             N.ones(x.shape, float) ))
        _ret.shape = (4,) + x.shape
        return _ret

    # Centre data in mean(x) (makes better conditioned matrix)
    mx = N.mean(x)
    x2 = x - mx
    
    if not any(iparams):
        # automatic guessing of gaussian's initial parameters (saves iterations)
        iparams = N.array([x2[N.argmax(y)],N.std(y),math.sqrt(2*math.pi)*N.std(y)*N.max(y),1.])

    gauss  = odr.Model(gaussian, fjacd=_gauss_fjd, fjacb=_gauss_fjb)

    mydata = odr.Data(x2, y)
    myodr  = odr.ODR(mydata, gauss, beta0=iparams, maxit=itmax)

    # Set type of fit to least-squares:
    myodr.set_job(fit_type=2)
    if verbose == 2: myodr.set_iprint(final=2)
          
    fit = myodr.run()
    
    # Display results:
    if verbose:
        fit.pprint()
        print 'Re-centered Beta: [%f  %f  %f %f]' % \
              (fit.beta[0]+mx,fit.beta[1],fit.beta[2],fit.beta[3])

    itlim = False
    if fit.stopreason[0] == 'Iteration limit reached':
        itlim = True
        print '(WWW) gauss_lsq: Iteration limit reached, result not reliable!'

    # Results and errors
    coeff = fit.beta
    coeff[0] += mx # Recentre in original axis
    err   = fit.sd_beta


    return coeff,err,itlim

(again, also available from the code: gaussian_fit.py)

You can see that I define the function's derivatives in _gauss_fjd and _gauss_fjb, pay attention to the format if you want to build some for another function. This particular program makes it much easier to fit. Given a set of data x,y you just do:

fit,errors,itlim = gauss_lsq(x,y)

and you have your results.