03 01 Transforming Data - HannaAA17/Data-Scientist-With-Python-datacamp GitHub Wiki
When you get a new DataFrame to work with, the first thing you need to do is explore it and see what it contains. There are several useful methods and attributes for this.
- head() returns the first few rows (the “head” of the DataFrame).
- info() shows information on each of the columns, such as the data type and number of missing values.
- shape returns the number of rows and columns of the DataFrame.
- describe() calculates a few summary statistics for each column.
# Print the head of the homelessness data
print(homelessness.head())
# Print information about homelessness
print(homelessness.info())
# Print the shape of homelessness
print(homelessness.shape)
# Print a description of homelessness
print(homelessness.describe())
<output>:
region state individuals family_members state_pop
0 East South Central Alabama 2570.0 864.0 4887681
1 Pacific Alaska 1434.0 582.0 735139
2 Mountain Arizona 7259.0 2606.0 7158024
3 West South Central Arkansas 2280.0 432.0 3009733
4 Pacific California 109008.0 20964.0 39461588
<class 'pandas.core.frame.DataFrame'>
Int64Index: 51 entries, 0 to 50
Data columns (total 5 columns):
region 51 non-null object
state 51 non-null object
individuals 51 non-null float64
family_members 51 non-null float64
state_pop 51 non-null int64
dtypes: float64(2), int64(1), object(2)
memory usage: 2.4+ KB
None
(51, 5)
individuals family_members state_pop
count 51.000 51.000 5.100e+01
mean 7225.784 3504.882 6.406e+06
std 15991.025 7805.412 7.327e+06
min 434.000 75.000 5.776e+05
25% 1446.500 592.000 1.777e+06
50% 3082.000 1482.000 4.461e+06
75% 6781.500 3196.000 7.341e+06
max 109008.000 52070.000 3.946e+07
To better understand DataFrame objects, it's useful to know that they consist of three components, stored as attributes:
-
values
: A two-dimensional NumPy array of values. -
columns
: An index of columns: the column names. -
index
: An index for the rows: either row numbers or row names.
- one column
df.sort_values("breed")
- multiple columns
df.sort_values(["breed", "weight_kg"])
# Sort homelessness by region, then descending family members
homelessness_reg_fam = homelessness.sort_values(["region","family_members"],ascending = [True,False])
# Print the top few rows
print(homelessness_reg_fam.head())
<script.py> output:
region state individuals family_members state_pop
13 East North Central Illinois 6752.0 3891.0 12723071
35 East North Central Ohio 6929.0 3320.0 11676341
22 East North Central Michigan 5209.0 3142.0 9984072
49 East North Central Wisconsin 2740.0 2167.0 5807406
14 East North Central Indiana 3776.0 1482.0 6695497
subsetting columns
# Select only the individuals and state columns, in that order
ind_state = homelessness[["individuals","state"]]
# Print the head of the result
print(ind_state.head())
<script.py> output:
individuals state
0 2570.0 Alabama
1 1434.0 Alaska
2 7259.0 Arizona
3 2280.0 Arkansas
4 109008.0 California
subsetting rows
# Filter for rows where family_members is less than 1000
# and region is Pacific
fam_lt_1k_pac = homelessness[( homelessness["family_members"] < 1000 ) & ( homelessness["region"] == "Pacific" )]
# See the result
print(fam_lt_1k_pac)
Subsetting rows by categorical variables
.isin()
# The Mojave Desert states
canu = ["California", "Arizona", "Nevada", "Utah"]
# Filter for rows in the Mojave Desert states
mojave_homelessness = homelessness[homelessness["state"].isin(canu)]
# See the result
print(mojave_homelessness)
<script.py> output:
region state individuals family_members state_pop
2 Mountain Arizona 7259.0 2606.0 7158024
4 Pacific California 109008.0 20964.0 39461588
28 Mountain Nevada 7058.0 486.0 3027341
44 Mountain Utah 1904.0 972.0 3153550
# Create indiv_per_10k col as homeless individuals per 10k state pop
homelessness["indiv_per_10k"] = 10000 * homelessness["individuals"] / homelessness["state_pop"]
# Subset rows for indiv_per_10k greater than 20
high_homelessness = homelessness[homelessness["indiv_per_10k"] > 20]
# Sort high_homelessness by descending indiv_per_10k
high_homelessness_srt = high_homelessness.sort_values("indiv_per_10k", ascending = False)
# From high_homelessness_srt, select the state and indiv_per_10k cols
result = high_homelessness_srt[["state","indiv_per_10k"]]
# See the result
print(result)
<script.py> output:
state indiv_per_10k
8 District of Columbia 53.738
11 Hawaii 29.079
4 California 27.624
37 Oregon 26.636
28 Nevada 23.314
47 Washington 21.829
32 New York 20.392