Day 3 (9 10 2019): Data Manipulation with Pandas - Ajarlin/Data-Science GitHub Wiki
[Key Data Structures]
Only two types you will ever need
Data Frames
- a 2D Data structure to hold data identified by rows and columns.
Series
- a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, python objects, etc.)
Representation of DataFrames
- Column names: Must be unique
- Left: The index values are all unique and numeric, acting as a row number
- Right: The index values are named and non-unique
Create Data Frames
[Create a Data frame]
import pandas as pd
import numpy as np
tmp = pd.DataFrame()
tmp
[Create a Panda Series]
import pandas as pd
tmp = pd.Series([2,3,3,4])
tmp
[Reading a CSV file to create a Data Frame]
import pandas as pd
import numpy as np
import matplotlib as plt
pd._version_
%matplotlib inline
df=pd.read_csv("filename.csv")
df.head()
Basic Operations
[appending Data Frames]
* Combine two dataframes
first = pd.DataFrame(np.random.randn(5,4))
second = pd.DataFrame(np.random.randn(5,4))
concat(first, second)
[Merging Data Frames]
- merge two data frames by a key
left = pd.DataFrame({'netId' : ['adg111', 'adg2', 'adg3'] 'midterm': [67, 89, 90]})
right = pd.DataFrame({'netId' : ['adg111', 'adg2', 'adg3'] 'midterm': [67, 89, 90]})
pd.merge(left,right, on="netId")
Indexing
- Column Selection: Extract a column/series or a series of columns/series. This is known as "indexing by columns"
filter by fd["filter_name"]
df1 = fd['midterm']
df2 = fd['midterm', 'Finals']
Memory Optimization
- Larges Files : Under 100mb is typically fine
- Very Large files: multiple gigabytes can be a problem. Can use different tools like Apache Spark or read data "chunks" as a time
- How Panda manga memory: repsresents numerical values as NumPy ndarry.