Python code solutions - mdepasca/miniature-adventure GitHub Wiki

Pandas

To read a ASCII file

I used read_csv, function full of parameters.

The file is SIMGEN_PUBLIC_DES.DUMP from the Supernova Photometric Classification Challenge. I want to read just the supernova ID and its type. Here is a chunck of it:

NVAR: 36 
VARNAMES:  CID GENTYPE SNTYPE NON1A_INDEX GENZ HOSTZ....
SN:  857494 3 -9 1 3.9452e-01 3.7062e-01 3.6300e-02 ....
SN:  238436 3 -9 1 2.5966e-01 2.6396e-01 3.0200e-02 ....
SN:  870263 3 -9 1 8.2972e-01 8.2412e-01 2.0200e-02 ....
SN:  419054 3 -9 1 7.2242e-01 7.3212e-01 1.5300e-02 ....
SN:  463559 3 -9 1 8.8864e-01 9.0714e-01 1.7600e-02 ....
SN:  368063 3 -9 1 7.5189e-01 7.5669e-01 1.4300e-02 ....
SN:  105238 3 -9 1 6.0254e-01 5.9714e-01 2.2000e-02 ....
SN:  812941 3 -9 1 4.7147e-01 5.0657e-01 2.4900e-02 ....

Thing to do are:

to skip the first line
keep the second as header (aka columns names)
select only CID and GENTYPE columns.

import pandas as pd

 dump = pd.read_csv(path, sep=' ', skiprows=0, header=1, 
         usecols=[1,2], skipinitialspace=True, engine='c')

sep = ' ' is specified beacause its default is ','
skiprows is the number of the row ro skip, zero indexed
header is row's number to use as header
usecols used to specify col indeces to read
skipinitialspace does what it says. Used because before CID column there are more then one space

The problem is that the type associated to each column is not numeric but the generic object.

In [1]: dump.dtypes
Out[2]:
CID        object
GENTYPE    object
dtype: object

Type object is subtle. To convert to something else a specific function has to be used

In [60]: dump = dump.convert_objects(convert_numeric=True, copy=False)

In [61]: dump.dtypes
Out[61]: 
CID        float64
GENTYPE    float64
dtype: object