Exploratory Data Analysis (EDA) - setiamanlhc/python-snippet-code GitHub Wiki

Count number of record in Dataframe

#Total row in DF
len(df.index)

#Count for Female only
len(df[df['gender_label']=='Female'].index )

#count for specific column. SQL Statement same as "select bmi_risk, count(1) from table group by bmi_risk"
df['bmi_risk'].value_counts()

#using Groupby to count
df.groupby(['smoke','alco']).size()

**Display data structure **

df.info()

Display all the columns in the list This list of columns will be useful for you to copy and paste to select part of columns instead of typing one by one.

df.columns

output of the df.columns from superstore dataset.

Index(['Row ID', 'Order ID', 'Order Date', 'Ship Date', 'Ship Mode',
       'Customer ID', 'Customer Name', 'Segment', 'Country/Region', 'City',
       'State', 'Postal Code', 'Region', 'Product ID', 'Category',
       'Sub-Category', 'Product Name', 'Sales', 'Quantity', 'Discount',
       'Profit'],
      dtype='object')
df.describe()

Display the top 6 records. This is like select using LIMIT n in postgres or select TOP n in T-SQL

df.head(6)

Display the last 10 records.

df.tail(10)

Display random sampling of 10 records.

df.sample(10)

Extract Dataframe with Null values

df[ df['user_id'].isnull()]

Get statistic of your dataset.

df.describe()

Sort Dataframe

df.sort_values(by=['col1', 'col2'], ascending=True)

Using Groupby

df.groupby(['col1']).sum()
df.groupby(['col1']).mean()
df.groupby(['col1']).median()

Using GroupBy with multiple columns on GroupBy and Calculation culumns

calculate average api_hi and api_lo from combination of active and cardio fields.

df.groupby(['active','cardio'])[['ap_hi','ap_lo']].mean()

Group column is for reference only, it is not created from groupby function.

Group active cardio ap_hi ap_lo
1 no no 119.418 419 78.232473
2 no yes 132.217714 84.076955
3 yes no 119.235340 78.094759
4 Yes yes 133.803466 84.746746

Another example using multiple function

# Calculate the min, average, median, max total_stays for each of the market_segment_type
df_pivot.groupby(['market_segment'])[['row_total']].agg(['min','mean','median','max'])
Script Description
df['Category'].unique() select distinct category
df['Category'].nunique() select count distinct category.
Return number of distinct records
df['Category'].value_counts() select Category and return total count of the category. Similar to
Select category,count(1) From superstore Group by category
df['Category'].value_counts(normalize=True) * 100
Output:
Office Supplies 60.296178
Furniture 21.222734
Technology 18.481089
Get ratio of the category record

EDA Using Chart

Correlation in table format

df.corr()

create heatmap

matrix_corr = df.corr()
sns.heatmap(matrix_corr, annot=True)

Create pairplot

sns.pairplot(df)

Scatterplot for Visualizing Relationship between 2 Variables

sns.scatterplot('Father', 'Height', hue='Gender', data=df)
⚠️ **GitHub.com Fallback** ⚠️