Day 3 (9 10 2019): Data Manipulation with Pandas - Ajarlin/Data-Science GitHub Wiki

[Key Data Structures]

Only two types you will ever need

Data Frames

  • a 2D Data structure to hold data identified by rows and columns.

Series

  • a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, python objects, etc.)

Representation of DataFrames

  • Column names: Must be unique
  • Left: The index values are all unique and numeric, acting as a row number
  • Right: The index values are named and non-unique

Create Data Frames

[Create a Data frame]

import pandas as pd
import numpy as np
tmp = pd.DataFrame()
tmp

[Create a Panda Series]

    import pandas as pd
    tmp = pd.Series([2,3,3,4])
    tmp

[Reading a CSV file to create a Data Frame]

    import pandas as pd
    import numpy as np
    import matplotlib as plt
    pd._version_
    %matplotlib inline
    df=pd.read_csv("filename.csv")
    df.head()

Basic Operations

[appending Data Frames]

* Combine two dataframes
   first = pd.DataFrame(np.random.randn(5,4))
   second = pd.DataFrame(np.random.randn(5,4))
   concat(first, second) 

[Merging Data Frames]

  • merge two data frames by a key
  left = pd.DataFrame({'netId' : ['adg111', 'adg2', 'adg3'] 'midterm': [67, 89, 90]})
  right = pd.DataFrame({'netId' : ['adg111', 'adg2', 'adg3'] 'midterm': [67, 89, 90]})
  pd.merge(left,right, on="netId") 

Indexing

  • Column Selection: Extract a column/series or a series of columns/series. This is known as "indexing by columns"

filter by fd["filter_name"]

   df1 = fd['midterm']
   df2 = fd['midterm', 'Finals']

Memory Optimization

  • Larges Files : Under 100mb is typically fine
  • Very Large files: multiple gigabytes can be a problem. Can use different tools like Apache Spark or read data "chunks" as a time
  • How Panda manga memory: repsresents numerical values as NumPy ndarry.