09 01 Introduction and Flat Files - HannaAA17/Data-Scientist-With-Python-datacamp GitHub Wiki

Text Files

Importing entire text files

# Open a file: file
file = open('moby_dick.txt', 'r')

# Print it
print(file.read())

# Close file
file.close()

Context Manager and Importing text files line by line

  • you can bind a variable file by using a context manager construct: with open('huck_finn.txt') as file:
  • While still within this construct, the variable file will be bound toopen('huck_finn.txt'); thus, to print the file to the shell, all the code you need to execute is:
with open('huck_finn.txt') as file:
    print(file.readline())
  • no need to close the file explicitly
# Read & print the first 3 lines
with open('moby_dick.txt') as file:
    print(file.readline())
    print(file.readline())
    print(file.readline())

Import Flat Files

  • File extension
    • .csv
    • .txt
    • Delimiters: commas,tabs

Using NumPy to import flat files

  • NumPy arrays: standard for storing numerical data
  • Essential for other packages: e.g. scikit-learn
  • loadtxt() : will freak if there's is multiple types of data
  • genfromtxt(): return a structured array
# Import numpy
import numpy as np

# Assign the filename: file
file = 'digits_header.txt'

# Load the data: data
data = np.loadtxt(file, delimiter='\t', skiprows=1, usecols=[0,2])

# Print data
print(data)

Using pandas to import flat files as DataFrames

  • sep: '\t` etc.
  • comment ='#'
  • na_values='nothing'
# Assign the filename: file
file = 'digits.csv'

# Read the first 5 rows of the file into a DataFrame: data
data = pd.read_csv(file, header=None, nrows=5)

# Build a numpy array from the DataFrame: data_array
data_array = data.values

# Print the datatype of data_array to the shell
print(type(data_array)) #<class 'numpy.ndarray'>