03 03 Slicing and Indexing - HannaAA17/Data-Scientist-With-Python-datacamp GitHub Wiki

Explicit indexes

  • .set_index()
  • .reset_index() and .reset_index(drop=True) to reset index to default and delete the entire column
  • .loc[] to subset
  • .sort_index()

problems

  • index values are just data
  • index violate "tidy data" principles. index values don't get their own columns
  • need to learn two syntax

Subsetting with .loc[]

# Make a list of cities to subset on
cities = ["Moscow","Saint Petersburg"]

# Subset temperatures using square brackets
print(temperatures[temperatures["city"].isin(cities)])

# Subset temperatures_ind using .loc[]
print(temperatures_ind.loc[cities])

Setting multi-level indexes

# Index temperatures by country & city
temperatures_ind = temperatures.set_index(["country","city"])

# List of tuples: Brazil, Rio De Janeiro & Pakistan, Lahore
rows_to_keep = [("Brazil","Rio De Janeiro"),("Pakistan","Lahore")]

# Subset for rows to keep
print(temperatures_ind.loc[rows_to_keep])

Sorting by index values

# Sort temperatures_ind by index values
print(temperatures_ind.sort_index())

# Sort temperatures_ind by index values at the city level
print(temperatures_ind.sort_index(level="city"))

# Sort temperatures_ind by country then descending city
print(temperatures_ind.sort_index(level=["country","city"],ascending=[True, False]))

Slcing and subsetting with .loc and .iloc

  • You can only slice an index if the index is sorted (using .sort_index()).
  • To slice at the outer level, first and last can be strings.
  • To slice at inner levels, first and last should be tuples.
  • If you pass a single slice to .loc[], it will slice the rows.
# Sort the index of temperatures_ind
temperatures_srt = temperatures_ind.sort_index()

# Subset rows from Pakistan to Russia
print(temperatures_srt.loc["Pakistan":"Russia"])

# Try to subset rows from Lahore to Moscow
print(temperatures_srt.loc["Lahore":"Moscow"]) # return nothing but doesn't report bugs

# Subset rows from Pakistan, Lahore to Russia, Moscow
print(temperatures_srt.loc[("Pakistan","Lahore"):("Russia","Moscow")])

Slicing in both directions

# Subset rows from India, Hyderabad to Iraq, Baghdad
print(temperatures_srt.loc[("India","Hyderabad"):("Iraq","Baghdad")])

# Subset columns from date to avg_temp_c
print(temperatures_srt.loc[:,"date":"avg_temp_c"])

# Subset in both directions at once
print(temperatures_srt.loc[("India","Hyderabad"):("Iraq","Baghdad"),"date":"avg_temp_c"])

Slicing time series

# Use Boolean conditions to subset temperatures for rows in 2010 and 2011
# Note that because the date isn't set as an index, a condition that contains only a year, such as df["date"] == "2009", will check 
# if the date is equal to the first day of the first month of the year (e.g. 2009-01-01), 
# rather than checking whether the date occurs within the given year. 
# We recommend writing out the full date when using Boolean conditions (e.g. 2009-12-31).
temperatures_bool = temperatures[(temperatures["date"] >= "2010-01-01") & (temperatures["date"] <= "2011-12-31")]
print(temperatures_bool)

# Set date as an index
temperatures_ind = temperatures.set_index("date")

# Use .loc[] to subset temperatures_ind for rows in 2010 and 2011
print(temperatures_ind.loc["2010":"2011"])

# Use .loc[] to subset temperatures_ind for rows from Aug 2010 to Feb 2011
print(temperatures_ind.loc["2010-08":"2011-02"])

Subsetting by row/column number

# Get 23rd row, 2nd column (index 22, 1)
print(temperatures.iloc[22,1])

# Use slicing to get the first 5 rows
print(temperatures.iloc[:5])

# Use slicing to get columns 3 to 4
print(temperatures.iloc[:,2:4])

# Use slicing in both directions at once
print(temperatures.iloc[:5,2:4])

Working with pivot table

  • .loc[] + slicing is a power combo
  • the axis argument: dogs_height_by_breed_vs_color.mean(axis = "index") index is default value, which means 'calculate across rows) , also we can set axis = "columns"
# Add a year column to temperatures
# You can access the components of a date (year, month and day) using code of the form dataframe["column"].dt.component. 
# For example, the month component is dataframe["column"].dt.month, and the year component is 
# dataframe["column"].dt.year.
temperatures["year"] = temperatures["date"].dt.year

# Pivot avg_temp_c by country and city vs year
temp_by_country_city_vs_year = temperatures.pivot_table(values="avg_temp_c",index=["country","city"],columns="year")

# See the result
print(temp_by_country_city_vs_year)

# Subset for Egypt to India
print(temp_by_country_city_vs_year.loc["Egypt":"India"])

# Subset for Egypt, Cairo to India, Delhi
print(temp_by_country_city_vs_year.loc[("Egypt","Cairo"):("India","Delhi")])

# Subset in both directions at once
print(temp_by_country_city_vs_year.loc[("Egypt","Cairo"):("India","Delhi"),"2005":"2010"])

# Get the worldwide mean temp by year
# means are calculated for each year, across cities (index) 
mean_temp_by_year = temp_by_country_city_vs_year.mean()

# Filter for the year that had the highest mean temp
print(mean_temp_by_year[mean_temp_by_year == mean_temp_by_year.max()])

# Get the mean temp by city
mean_temp_by_city = temp_by_country_city_vs_year.mean(axis = "columns")

# Filter for the city that had the lowest mean temp
print(mean_temp_by_city[mean_temp_by_city == mean_temp_by_city.min()])