Python code solutions - mdepasca/miniature-adventure GitHub Wiki
Pandas
To read a ASCII file
I used read_csv, function full of parameters.
The file is SIMGEN_PUBLIC_DES.DUMP from the Supernova Photometric Classification Challenge. I want to read just the supernova ID and its type. Here is a chunck of it:
NVAR: 36
VARNAMES: CID GENTYPE SNTYPE NON1A_INDEX GENZ HOSTZ....
SN: 857494 3 -9 1 3.9452e-01 3.7062e-01 3.6300e-02 ....
SN: 238436 3 -9 1 2.5966e-01 2.6396e-01 3.0200e-02 ....
SN: 870263 3 -9 1 8.2972e-01 8.2412e-01 2.0200e-02 ....
SN: 419054 3 -9 1 7.2242e-01 7.3212e-01 1.5300e-02 ....
SN: 463559 3 -9 1 8.8864e-01 9.0714e-01 1.7600e-02 ....
SN: 368063 3 -9 1 7.5189e-01 7.5669e-01 1.4300e-02 ....
SN: 105238 3 -9 1 6.0254e-01 5.9714e-01 2.2000e-02 ....
SN: 812941 3 -9 1 4.7147e-01 5.0657e-01 2.4900e-02 ....
Thing to do are:
- to skip the first line
- keep the second as header (aka columns names)
- select only
CIDandGENTYPEcolumns.
import pandas as pd
dump = pd.read_csv(path, sep=' ', skiprows=0, header=1,
usecols=[1,2], skipinitialspace=True, engine='c')
sep = ' 'is specified beacause its default is','skiprowsis the number of the row ro skip, zero indexedheaderis row's number to use as headerusecolsused to specify col indeces to readskipinitialspacedoes what it says. Used because beforeCIDcolumn there are more then one space
The problem is that the type associated to each column is not numeric but the generic object.
In [1]: dump.dtypes
Out[2]:
CID object
GENTYPE object
dtype: object
Type object is subtle. To convert to something else a specific function has to be used
In [60]: dump = dump.convert_objects(convert_numeric=True, copy=False)
In [61]: dump.dtypes
Out[61]:
CID float64
GENTYPE float64
dtype: object