03 01 Transforming Data - HannaAA17/Data-Scientist-With-Python-datacamp GitHub Wiki

Inspecting a DataFrame

When you get a new DataFrame to work with, the first thing you need to do is explore it and see what it contains. There are several useful methods and attributes for this.

  • head() returns the first few rows (the “head” of the DataFrame).
  • info() shows information on each of the columns, such as the data type and number of missing values.
  • shape returns the number of rows and columns of the DataFrame.
  • describe() calculates a few summary statistics for each column.
# Print the head of the homelessness data
print(homelessness.head())

# Print information about homelessness
print(homelessness.info())

# Print the shape of homelessness
print(homelessness.shape)

# Print a description of homelessness
print(homelessness.describe())

<output>:
            region       state  individuals  family_members  state_pop
0  East South Central     Alabama       2570.0           864.0    4887681
1             Pacific      Alaska       1434.0           582.0     735139
2            Mountain     Arizona       7259.0          2606.0    7158024
3  West South Central    Arkansas       2280.0           432.0    3009733
4             Pacific  California     109008.0         20964.0   39461588
<class 'pandas.core.frame.DataFrame'>
Int64Index: 51 entries, 0 to 50

Data columns (total 5 columns):
region            51 non-null object
state             51 non-null object
individuals       51 non-null float64
family_members    51 non-null float64
state_pop         51 non-null int64
dtypes: float64(2), int64(1), object(2)
memory usage: 2.4+ KB
None

(51, 5)

       individuals  family_members  state_pop
count       51.000          51.000  5.100e+01
mean      7225.784        3504.882  6.406e+06
std      15991.025        7805.412  7.327e+06
min        434.000          75.000  5.776e+05
25%       1446.500         592.000  1.777e+06
50%       3082.000        1482.000  4.461e+06
75%       6781.500        3196.000  7.341e+06
max     109008.000       52070.000  3.946e+07

To better understand DataFrame objects, it's useful to know that they consist of three components, stored as attributes:

  • values: A two-dimensional NumPy array of values.
  • columns: An index of columns: the column names.
  • index: An index for the rows: either row numbers or row names.

Sorting and Subsetting

  • one column df.sort_values("breed")
  • multiple columns df.sort_values(["breed", "weight_kg"])

Sorting

# Sort homelessness by region, then descending family members
homelessness_reg_fam = homelessness.sort_values(["region","family_members"],ascending = [True,False])

# Print the top few rows
print(homelessness_reg_fam.head())

<script.py> output:
                    region      state  individuals  family_members  state_pop
    13  East North Central   Illinois       6752.0          3891.0   12723071
    35  East North Central       Ohio       6929.0          3320.0   11676341
    22  East North Central   Michigan       5209.0          3142.0    9984072
    49  East North Central  Wisconsin       2740.0          2167.0    5807406
    14  East North Central    Indiana       3776.0          1482.0    6695497

Subsetting

subsetting columns

# Select only the individuals and state columns, in that order
ind_state = homelessness[["individuals","state"]]

# Print the head of the result
print(ind_state.head())

<script.py> output:
       individuals       state
    0       2570.0     Alabama
    1       1434.0      Alaska
    2       7259.0     Arizona
    3       2280.0    Arkansas
    4     109008.0  California

subsetting rows

# Filter for rows where family_members is less than 1000 
# and region is Pacific
fam_lt_1k_pac = homelessness[( homelessness["family_members"] < 1000 ) & ( homelessness["region"] == "Pacific" )]

# See the result
print(fam_lt_1k_pac)

Subsetting rows by categorical variables
.isin()

# The Mojave Desert states
canu = ["California", "Arizona", "Nevada", "Utah"]

# Filter for rows in the Mojave Desert states
mojave_homelessness = homelessness[homelessness["state"].isin(canu)]

# See the result
print(mojave_homelessness)

<script.py> output:
          region       state  individuals  family_members  state_pop
    2   Mountain     Arizona       7259.0          2606.0    7158024
    4    Pacific  California     109008.0         20964.0   39461588
    28  Mountain      Nevada       7058.0           486.0    3027341
    44  Mountain        Utah       1904.0           972.0    3153550

New Columns

Combo - attack

# Create indiv_per_10k col as homeless individuals per 10k state pop
homelessness["indiv_per_10k"] = 10000 * homelessness["individuals"] / homelessness["state_pop"] 

# Subset rows for indiv_per_10k greater than 20
high_homelessness = homelessness[homelessness["indiv_per_10k"] > 20]

# Sort high_homelessness by descending indiv_per_10k
high_homelessness_srt = high_homelessness.sort_values("indiv_per_10k", ascending = False)

# From high_homelessness_srt, select the state and indiv_per_10k cols
result = high_homelessness_srt[["state","indiv_per_10k"]]

# See the result
print(result)

<script.py> output:
                       state  indiv_per_10k
    8   District of Columbia         53.738
    11                Hawaii         29.079
    4             California         27.624
    37                Oregon         26.636
    28                Nevada         23.314
    47            Washington         21.829
    32              New York         20.392
⚠️ **GitHub.com Fallback** ⚠️