03 03 Slicing and Indexing - HannaAA17/Data-Scientist-With-Python-datacamp GitHub Wiki
Explicit indexes
.set_index()
.reset_index()
and.reset_index(drop=True)
to reset index to default and delete the entire column.loc[]
to subset.sort_index()
problems
- index values are just data
- index violate "tidy data" principles. index values don't get their own columns
- need to learn two syntax
.loc[]
Subsetting with # Make a list of cities to subset on
cities = ["Moscow","Saint Petersburg"]
# Subset temperatures using square brackets
print(temperatures[temperatures["city"].isin(cities)])
# Subset temperatures_ind using .loc[]
print(temperatures_ind.loc[cities])
Setting multi-level indexes
# Index temperatures by country & city
temperatures_ind = temperatures.set_index(["country","city"])
# List of tuples: Brazil, Rio De Janeiro & Pakistan, Lahore
rows_to_keep = [("Brazil","Rio De Janeiro"),("Pakistan","Lahore")]
# Subset for rows to keep
print(temperatures_ind.loc[rows_to_keep])
Sorting by index values
# Sort temperatures_ind by index values
print(temperatures_ind.sort_index())
# Sort temperatures_ind by index values at the city level
print(temperatures_ind.sort_index(level="city"))
# Sort temperatures_ind by country then descending city
print(temperatures_ind.sort_index(level=["country","city"],ascending=[True, False]))
Slcing and subsetting with .loc and .iloc
- You can only slice an index if the index is sorted (using .sort_index()).
- To slice at the outer level, first and last can be strings.
- To slice at inner levels, first and last should be tuples.
- If you pass a single slice to .loc[], it will slice the rows.
# Sort the index of temperatures_ind
temperatures_srt = temperatures_ind.sort_index()
# Subset rows from Pakistan to Russia
print(temperatures_srt.loc["Pakistan":"Russia"])
# Try to subset rows from Lahore to Moscow
print(temperatures_srt.loc["Lahore":"Moscow"]) # return nothing but doesn't report bugs
# Subset rows from Pakistan, Lahore to Russia, Moscow
print(temperatures_srt.loc[("Pakistan","Lahore"):("Russia","Moscow")])
Slicing in both directions
# Subset rows from India, Hyderabad to Iraq, Baghdad
print(temperatures_srt.loc[("India","Hyderabad"):("Iraq","Baghdad")])
# Subset columns from date to avg_temp_c
print(temperatures_srt.loc[:,"date":"avg_temp_c"])
# Subset in both directions at once
print(temperatures_srt.loc[("India","Hyderabad"):("Iraq","Baghdad"),"date":"avg_temp_c"])
Slicing time series
# Use Boolean conditions to subset temperatures for rows in 2010 and 2011
# Note that because the date isn't set as an index, a condition that contains only a year, such as df["date"] == "2009", will check
# if the date is equal to the first day of the first month of the year (e.g. 2009-01-01),
# rather than checking whether the date occurs within the given year.
# We recommend writing out the full date when using Boolean conditions (e.g. 2009-12-31).
temperatures_bool = temperatures[(temperatures["date"] >= "2010-01-01") & (temperatures["date"] <= "2011-12-31")]
print(temperatures_bool)
# Set date as an index
temperatures_ind = temperatures.set_index("date")
# Use .loc[] to subset temperatures_ind for rows in 2010 and 2011
print(temperatures_ind.loc["2010":"2011"])
# Use .loc[] to subset temperatures_ind for rows from Aug 2010 to Feb 2011
print(temperatures_ind.loc["2010-08":"2011-02"])
Subsetting by row/column number
# Get 23rd row, 2nd column (index 22, 1)
print(temperatures.iloc[22,1])
# Use slicing to get the first 5 rows
print(temperatures.iloc[:5])
# Use slicing to get columns 3 to 4
print(temperatures.iloc[:,2:4])
# Use slicing in both directions at once
print(temperatures.iloc[:5,2:4])
Working with pivot table
.loc[]
+ slicing is a power combo- the axis argument:
dogs_height_by_breed_vs_color.mean(axis = "index")
index is default value, which means 'calculate across rows) , also we can setaxis = "columns"
# Add a year column to temperatures
# You can access the components of a date (year, month and day) using code of the form dataframe["column"].dt.component.
# For example, the month component is dataframe["column"].dt.month, and the year component is
# dataframe["column"].dt.year.
temperatures["year"] = temperatures["date"].dt.year
# Pivot avg_temp_c by country and city vs year
temp_by_country_city_vs_year = temperatures.pivot_table(values="avg_temp_c",index=["country","city"],columns="year")
# See the result
print(temp_by_country_city_vs_year)
# Subset for Egypt to India
print(temp_by_country_city_vs_year.loc["Egypt":"India"])
# Subset for Egypt, Cairo to India, Delhi
print(temp_by_country_city_vs_year.loc[("Egypt","Cairo"):("India","Delhi")])
# Subset in both directions at once
print(temp_by_country_city_vs_year.loc[("Egypt","Cairo"):("India","Delhi"),"2005":"2010"])
# Get the worldwide mean temp by year
# means are calculated for each year, across cities (index)
mean_temp_by_year = temp_by_country_city_vs_year.mean()
# Filter for the year that had the highest mean temp
print(mean_temp_by_year[mean_temp_by_year == mean_temp_by_year.max()])
# Get the mean temp by city
mean_temp_by_city = temp_by_country_city_vs_year.mean(axis = "columns")
# Filter for the city that had the lowest mean temp
print(mean_temp_by_city[mean_temp_by_city == mean_temp_by_city.min()])