7.3.1.Exploratory Data Analysis - sj50179/IBM-Data-Science-Professional-Certificate GitHub Wiki
Exploratory Data Analysis (EDA)
- Preliminary step in data analysis to:
- Summarize main characteristics of the data
- Gain better understanding of the data set
- Uncover relationships between variables
- Extract important variables
- Question: "What are the characteristics which have the most impact on the car price?"
Descriptive Statistics
- Describe basic features of data
- Giving short summaries about the sample and measures of the data
describe()
Descriptive Statistics - - Summarize statistics using pandas
**describe()**
method**df.describe()**
Question
What happens if the method describe is applied to a dataframe with NaN values
an error will occurall the statistics calculated using NaN values will also be NaN- NaN values will be excluded
Correct
value_counts()
Descriptive Statistics - - Summarize the categorical data is by using the
value_counts()
method
drive_wheels_counts=df['drive-wheels'].value_counts().to_frame()
drive_wheels_counts.rename(columns={'dirve-wheels':'value_counts'}, inplace=True)
Descriptive Statistics - Scatter Plot
- Each observation represented as a point
- Scatter plot show the relationship between two variables
- Predictor/independent variables on x-axis
- Target/dependent variables on y-axis
GroupBy in Python
Grouping data
- Use Panda
**dataframe.Groupby()**
method:- Can be applied on categorical variables
- Group data into categories
- Single or multiple variables
groupby()
- Example
df_test = df['drive-wheels', 'body-style', 'price'](/sj50179/IBM-Data-Science-Professional-Certificate/wiki/'drive-wheels',-'body-style',-'price')
df_grp = df_test.groupby(['drive-wheels', 'body-style'], as_index=False).mean()
Question
How would you use the groupby
function to find the average "price" of each car based on "body-style" ?
**df['price','body-style'](/sj50179/IBM-Data-Science-Professional-Certificate/wiki/'price','body-style').groupby(['body-style'],as_index= False).mean()**
~~df.groupby(['price" ],as_index= False).mean()~~
~~mean(df.groupby(['price','body-style'],as_index= False))~~
Correct
pivot()
Pandas method - - One variable displayed along the columns and the other variable displayed along the rows. df_pivot = df_grp.pivot(index='drive-wheels', columns='body-style')
Heatmap
- Plot target variable over multiple variables
plt.pcolor(df_pivot, cmap='RdBu')
plt.colorbar()
plt.show()
Question
Select the appropriate description of a pivot table:
- A pivot table has one variable displayed along the columns and the other variable displayed along the rows.
A pivot table contains statistical information for each column
Correct
Correlation
What is Correlation?
- Measures to what extent different variables are interdependent.
- For example:
- Lung cancer → Smoking
- Rain → Umbrella
- Correlation doesn't imply causation
Correlation - Positive Linear Relationship
- Correlation between two features (engine-size and price)
sns.regplot(x='engine-size', y='price', data=df)
plt.ylim(0, )
Correlation - Negative Linear Relationship
- Correlation between two features (highway-mpg and price)
sns.regplot(x='highway-mpg', y='price', data=df)
plt.ylim(0, )
Correlation - Negative Linear Relationship
- Weak correlation between two features (peak-rpm and price)
sns.regplot(x='peak-rpm', y='price', data=df)
plt.ylim(0, )
Correlation - Statistics
Pearson Correlation
-
Measure the strenth of the correlation between two features
- Correlation coefficient
- P-value
-
Correlation coefficient
- Close to +1: Large Positive relationship
- Close to -1: Large Negative relationship
- Clost to 0: No relationship
-
Strong Correlation:
- Correlation coefficient close to 1 or -1
- P value less than 0.001
-
P-value
- P-value < 0.001 Strong certainty in the result
- P-value < 0.05 Moderate certainty in the result
- P-value < 0.1 Weak certainty in the result
- P-value > 0.1 No certainty in the result
Pearson Correlation
pearson_coef, p_value = stats.pearsonr(df['horsepower'], df['price'])
- Pearson correlation: 0.81
- P-value: 9.35 $e$-48
Correlation - Heatmap
Association between two categorical variables: Chi-Square
Categorical variables
- Use the Chi-square Test for Association (denoted as $\chi^2$)
- The test is intended to test how likely it is that an observed distribution is due to chance.
Chi-square Test for Association
- The Chi-square tests a null hypothesis that the variables are independent.
- The Chi-square does not tell you the type of relationship that exists between both variables; but only that relationship exists.
Categorical variables
scipy.stats.chi2_contingency(cont_table, correction=True)
Lesson Summary
In this lesson, you have learned how to:
Describe Exploratory Data Analysis: By summarizing the main characteristics of the data and extracting valuable insights.
Compute basic descriptive statistics: Calculate the mean, median, and mode using python and use it as a basis in understanding the distribution of the data.
Create data groups: How and why you put continuous data in groups and how to visualize them.
Define correlation as the linear association between two numerical variables: Use Pearson correlation as a measure of the correlation between two continuous variables
Define the association between two categorical variables: Understand how to find the association of two variables using the Chi-square test for association and how to interpret them.