Exploratory Data Analysis (EDA) - setiamanlhc/python-snippet-code GitHub Wiki
#Total row in DF
len(df.index)
#Count for Female only
len(df[df['gender_label']=='Female'].index )
#count for specific column. SQL Statement same as "select bmi_risk, count(1) from table group by bmi_risk"
df['bmi_risk'].value_counts()
#using Groupby to count
df.groupby(['smoke','alco']).size()**Display data structure **
df.info()Display all the columns in the list This list of columns will be useful for you to copy and paste to select part of columns instead of typing one by one.
df.columnsoutput of the df.columns from superstore dataset.
Index(['Row ID', 'Order ID', 'Order Date', 'Ship Date', 'Ship Mode',
'Customer ID', 'Customer Name', 'Segment', 'Country/Region', 'City',
'State', 'Postal Code', 'Region', 'Product ID', 'Category',
'Sub-Category', 'Product Name', 'Sales', 'Quantity', 'Discount',
'Profit'],
dtype='object')df.describe()Display the top 6 records. This is like select using LIMIT n in postgres or select TOP n in T-SQL
df.head(6)Display the last 10 records.
df.tail(10)Display random sampling of 10 records.
df.sample(10)df[ df['user_id'].isnull()]Get statistic of your dataset.
df.describe()df.sort_values(by=['col1', 'col2'], ascending=True)df.groupby(['col1']).sum()
df.groupby(['col1']).mean()
df.groupby(['col1']).median()calculate average api_hi and api_lo from combination of active and cardio fields.
df.groupby(['active','cardio'])[['ap_hi','ap_lo']].mean()Group column is for reference only, it is not created from groupby function.
| Group | active | cardio | ap_hi | ap_lo |
|---|---|---|---|---|
| 1 | no | no | 119.418 419 | 78.232473 |
| 2 | no | yes | 132.217714 | 84.076955 |
| 3 | yes | no | 119.235340 | 78.094759 |
| 4 | Yes | yes | 133.803466 | 84.746746 |
Another example using multiple function
# Calculate the min, average, median, max total_stays for each of the market_segment_type
df_pivot.groupby(['market_segment'])[['row_total']].agg(['min','mean','median','max'])| Script | Description |
|---|---|
| df['Category'].unique() | select distinct category |
| df['Category'].nunique() | select count distinct category. Return number of distinct records |
| df['Category'].value_counts() | select Category and return total count of the category. Similar to Select category,count(1) From superstore Group by category |
| df['Category'].value_counts(normalize=True) * 100 Output: Office Supplies 60.296178 Furniture 21.222734 Technology 18.481089 |
Get ratio of the category record |
df.corr()matrix_corr = df.corr()
sns.heatmap(matrix_corr, annot=True)sns.pairplot(df)sns.scatterplot('Father', 'Height', hue='Gender', data=df)