7.3.1.Exploratory Data Analysis - sj50179/IBM-Data-Science-Professional-Certificate GitHub Wiki

Exploratory Data Analysis (EDA)

Preliminary step in data analysis to:
- Summarize main characteristics of the data
- Gain better understanding of the data set
- Uncover relationships between variables
- Extract important variables
Question: "What are the characteristics which have the most impact on the car price?"

Descriptive Statistics

Describe basic features of data
Giving short summaries about the sample and measures of the data

Descriptive Statistics - `describe()`

Summarize statistics using pandas **describe()** method **df.describe()**

Question

What happens if the method describe is applied to a dataframe with NaN values

~~an error will occur~~
~~all the statistics calculated using NaN values will also be NaN~~
NaN values will be excluded

Correct

Descriptive Statistics - `value_counts()`

Summarize the categorical data is by using the value_counts() method

drive_wheels_counts=df['drive-wheels'].value_counts().to_frame()

drive_wheels_counts.rename(columns={'dirve-wheels':'value_counts'}, inplace=True)

Descriptive Statistics - Scatter Plot

Each observation represented as a point
Scatter plot show the relationship between two variables

Predictor/independent variables on x-axis
Target/dependent variables on y-axis

GroupBy in Python

Grouping data

Use Panda **dataframe.Groupby()** method:
- Can be applied on categorical variables
- Group data into categories
- Single or multiple variables

`groupby()` - Example

df_test = df['drive-wheels', 'body-style', 'price'](/sj50179/IBM-Data-Science-Professional-Certificate/wiki/'drive-wheels',-'body-style',-'price')
df_grp = df_test.groupby(['drive-wheels', 'body-style'], as_index=False).mean()

Question

How would you use the groupby function to find the average "price" of each car based on "body-style" ?

**df['price','body-style'](/sj50179/IBM-Data-Science-Professional-Certificate/wiki/'price','body-style').groupby(['body-style'],as_index= False).mean()**
~~df.groupby(['price" ],as_index= False).mean()~~
~~mean(df.groupby(['price','body-style'],as_index= False))~~

Correct

Pandas method - `pivot()`

One variable displayed along the columns and the other variable displayed along the rows. df_pivot = df_grp.pivot(index='drive-wheels', columns='body-style')

Heatmap

Plot target variable over multiple variables

plt.pcolor(df_pivot, cmap='RdBu')
plt.colorbar()
plt.show()

Question

Select the appropriate description of a pivot table:

A pivot table has one variable displayed along the columns and the other variable displayed along the rows.
~~A pivot table contains statistical information for each column~~

Correct

Correlation

What is Correlation?

Measures to what extent different variables are interdependent.
For example:
- Lung cancer → Smoking
- Rain → Umbrella
Correlation doesn't imply causation

Correlation - Positive Linear Relationship

Correlation between two features (engine-size and price)

sns.regplot(x='engine-size', y='price', data=df)
plt.ylim(0, )

Correlation - Negative Linear Relationship

Correlation between two features (highway-mpg and price)

sns.regplot(x='highway-mpg', y='price', data=df)
plt.ylim(0, )

Correlation - Negative Linear Relationship

Weak correlation between two features (peak-rpm and price)

sns.regplot(x='peak-rpm', y='price', data=df)
plt.ylim(0, )

Correlation - Statistics

Pearson Correlation

Measure the strenth of the correlation between two features
- Correlation coefficient
- P-value
Correlation coefficient
- Close to +1: Large Positive relationship
- Close to -1: Large Negative relationship
- Clost to 0: No relationship
Strong Correlation:
- Correlation coefficient close to 1 or -1
- P value less than 0.001
P-value
- P-value < 0.001 Strong certainty in the result
- P-value < 0.05 Moderate certainty in the result
- P-value < 0.1 Weak certainty in the result
- P-value > 0.1 No certainty in the result

Pearson Correlation

pearson_coef, p_value = stats.pearsonr(df['horsepower'], df['price'])

Pearson correlation: 0.81
P-value: 9.35 $e$-48

Correlation - Heatmap

Association between two categorical variables: Chi-Square

Categorical variables

Use the Chi-square Test for Association (denoted as $\chi^2$)
The test is intended to test how likely it is that an observed distribution is due to chance.

Chi-square Test for Association

The Chi-square tests a null hypothesis that the variables are independent.
The Chi-square does not tell you the type of relationship that exists between both variables; but only that relationship exists.

Categorical variables

$\chi^2 = \displaystyle\sum_{i=1}^n \frac {(O_i-E_i)^2} {E_i}$

scipy.stats.chi2_contingency(cont_table, correction=True)

Lesson Summary

In this lesson, you have learned how to:

Describe Exploratory Data Analysis: By summarizing the main characteristics of the data and extracting valuable insights.

Compute basic descriptive statistics: Calculate the mean, median, and mode using python and use it as a basis in understanding the distribution of the data.

Create data groups: How and why you put continuous data in groups and how to visualize them.

Define correlation as the linear association between two numerical variables: Use Pearson correlation as a measure of the correlation between two continuous variables

Define the association between two categorical variables: Understand how to find the association of two variables using the Chi-square test for association and how to interpret them.

7.3.1.Exploratory Data Analysis - sj50179/IBM-Data-Science-Professional-Certificate GitHub Wiki

Exploratory Data Analysis (EDA)

Descriptive Statistics

Descriptive Statistics - describe()

Question

Descriptive Statistics - value_counts()

Descriptive Statistics - Scatter Plot

GroupBy in Python

Grouping data

groupby() - Example

Question

Pandas method - pivot()

Heatmap

Question

Correlation

Correlation - Positive Linear Relationship

Correlation - Negative Linear Relationship

Correlation - Negative Linear Relationship

Correlation - Statistics

Pearson Correlation

Pearson Correlation

Correlation - Heatmap

Association between two categorical variables: Chi-Square

Categorical variables

Chi-square Test for Association

Categorical variables

Lesson Summary

Descriptive Statistics - `describe()`

Descriptive Statistics - `value_counts()`

`groupby()` - Example

Pandas method - `pivot()`