09 02 Importing data from other file types - HannaAA17/Data-Scientist-With-Python-datacamp GitHub Wiki

  • Excel spreadsheets
  • MATLAB files
  • SAS files
  • Stata files
  • HDF5 files

Pickled files

  • File type native to Python
    • to store dictionaries, nparrays etc..
  • Motivation: many datatypes for which it isn’t obvious how to store them
  • Pickled files are serialized
    • Serialize = convert object to bytestream
# Import pickle package
import pickle

# Open pickle file and load data: d
with open('data.pkl', 'rb') as file: # r: read, b: binary
    d = pickle.load(file)

Excel spreadsheets

  • load spreedsheets
  • get the sheets name
  • import sheets
import pandas as pd

# Assign spreadsheet filename: file
file = 'battledeath.xlsx'

# Load spreadsheet: xls
xls = pd.ExcelFile(file)

# Print sheet names
print(xls.sheet_names) #  ['2002', '2004']

# Load a sheet into a DataFrame by name: df1
df1 = xls.parse('2004')

# Print the head of the DataFrame df1
print(df1.head())

# Load a sheet into a DataFrame by index: df2
df2 = xls.parse(0)

# Print the head of the DataFrame df2
print(df2.head())

Customizing your spreadsheet import

  • usecols=0: only parse the first columns, (the number of columns to keep - 1)
  • skiprows=[0]: a list of rows to skip
  • names = [] : rename columns, a list
# Parse the first column of the second sheet and rename the column: df2
df2 = xls.parse(1, usecols=0, skiprows=[0], names=['Country'])

Importing SAS/Stata files using pandas

Importing SAS files

# Import sas7bdat package
from sas7bdat import SAS7BDAT

# Save file to a DataFrame: df_sas
with SAS7BDAT('sales.sas7bdat') as file:
    df_sas = file.to_data_frame()

Using read_stata to import Stata files

# Load Stata file into a pandas DataFrame: df
df = pd.read_stata('disarea.dta')

Importing HDF5 files

  • Hierarchical Data Format version 5
  • Standard for storing large quantities of numerical data
  • Datasets can be hundreds of gigabytes or terabytes
  • HDF5 can scale to exabytes

Using h5py to import HDF5 files

# Import packages
import numpy as np
import h5py

# Assign filename: file
file = 'LIGO_data.hdf5'

# Load file: data
data = h5py.File(file, 'r')

# Print the datatype of the loaded file
print(type(data))  #<class 'h5py._hl.files.File'>

# Print the keys of the file
for key in data.keys():
    print(key)
#meta
#quality
#strain

print(type(data['meta']) # <class 'h5py._hl.group.Group'>

Extracting data from your HDF5 file

# Get the HDF5 group: group
group = data['strain']

# Check out keys of group
for key in group.keys():
    print(key)

# Set variable equal to time series data: strain
strain = np.array(data['strain']['Strain'])

# Set number of time points to sample: num_samples
num_samples = 10000

# Set time vector
time = np.arange(0, 1, 1/num_samples)

# Plot data
plt.plot(time, strain[:num_samples])
plt.xlabel('GPS Time (s)')
plt.ylabel('strain')
plt.show()

Importing MATLAB files

loading .mat files

# Import package
import scipy.io

# Load MATLAB file: mat
mat = scipy.io.loadmat('albeck_gene_expression.mat')

# Print the datatype type of mat
print(type(mat))  # <class 'dict'>
  • keys = MATLAB variable names
  • values = objects assigned to variables