09 01 Introduction and Flat Files - HannaAA17/Data-Scientist-With-Python-datacamp GitHub Wiki
Text Files
Importing entire text files
# Open a file: file
file = open('moby_dick.txt', 'r')
# Print it
print(file.read())
# Close file
file.close()
Context Manager and Importing text files line by line
- you can bind a variable file by using a context manager construct:
with open('huck_finn.txt') as file:
- While still within this construct, the variable
file
will be bound toopen('huck_finn.txt')
; thus, to print the file to the shell, all the code you need to execute is:
with open('huck_finn.txt') as file:
print(file.readline())
- no need to close the file explicitly
# Read & print the first 3 lines
with open('moby_dick.txt') as file:
print(file.readline())
print(file.readline())
print(file.readline())
Import Flat Files
- File extension
- .csv
- .txt
- Delimiters: commas,tabs
Using NumPy to import flat files
- NumPy arrays: standard for storing numerical data
- Essential for other packages: e.g. scikit-learn
loadtxt()
: will freak if there's is multiple types of data
genfromtxt()
: return a structured array
# Import numpy
import numpy as np
# Assign the filename: file
file = 'digits_header.txt'
# Load the data: data
data = np.loadtxt(file, delimiter='\t', skiprows=1, usecols=[0,2])
# Print data
print(data)
Using pandas to import flat files as DataFrames
sep
: '\t` etc.
comment ='#'
na_values='nothing'
# Assign the filename: file
file = 'digits.csv'
# Read the first 5 rows of the file into a DataFrame: data
data = pd.read_csv(file, header=None, nrows=5)
# Build a numpy array from the DataFrame: data_array
data_array = data.values
# Print the datatype of data_array to the shell
print(type(data_array)) #<class 'numpy.ndarray'>