7.3.1.Exploratory Data Analysis - sj50179/IBM-Data-Science-Professional-Certificate GitHub Wiki

Exploratory Data Analysis (EDA)

  • Preliminary step in data analysis to:
    • Summarize main characteristics of the data
    • Gain better understanding of the data set
    • Uncover relationships between variables
    • Extract important variables
  • Question: "What are the characteristics which have the most impact on the car price?"

Descriptive Statistics

  • Describe basic features of data
  • Giving short summaries about the sample and measures of the data

Descriptive Statistics - describe()

  • Summarize statistics using pandas **describe()** method **df.describe()**

Question

What happens if the method describe is applied to a dataframe with NaN values

  • an error will occur
  • all the statistics calculated using NaN values will also be NaN
  • NaN values will be excluded

Correct

Descriptive Statistics - value_counts()

  • Summarize the categorical data is by using the value_counts() method
drive_wheels_counts=df['drive-wheels'].value_counts().to_frame()

drive_wheels_counts.rename(columns={'dirve-wheels':'value_counts'}, inplace=True)

Descriptive Statistics - Scatter Plot

  • Each observation represented as a point
  • Scatter plot show the relationship between two variables
  1. Predictor/independent variables on x-axis
  2. Target/dependent variables on y-axis

GroupBy in Python

Grouping data

  • Use Panda **dataframe.Groupby()** method:
    • Can be applied on categorical variables
    • Group data into categories
    • Single or multiple variables

groupby() - Example

df_test = df['drive-wheels', 'body-style', 'price'](/sj50179/IBM-Data-Science-Professional-Certificate/wiki/'drive-wheels',-'body-style',-'price')
df_grp = df_test.groupby(['drive-wheels', 'body-style'], as_index=False).mean()

Question

How would you use the groupby function to find the average "price" of each car based on "body-style" ?

  • **df['price','body-style'](/sj50179/IBM-Data-Science-Professional-Certificate/wiki/'price','body-style').groupby(['body-style'],as_index= False).mean()**
  • ~~df.groupby(['price" ],as_index= False).mean()~~
  • ~~mean(df.groupby(['price','body-style'],as_index= False))~~

Correct

Pandas method - pivot()

  • One variable displayed along the columns and the other variable displayed along the rows. df_pivot = df_grp.pivot(index='drive-wheels', columns='body-style')

Heatmap

  • Plot target variable over multiple variables
plt.pcolor(df_pivot, cmap='RdBu')
plt.colorbar()
plt.show()

Question

Select the appropriate description of a pivot table:

  • A pivot table has one variable displayed along the columns and the other variable displayed along the rows.
  • A pivot table contains statistical information for each column

Correct

Correlation

What is Correlation?

  • Measures to what extent different variables are interdependent.
  • For example:
    • Lung cancer → Smoking
    • Rain → Umbrella
  • Correlation doesn't imply causation

Correlation - Positive Linear Relationship

  • Correlation between two features (engine-size and price)
sns.regplot(x='engine-size', y='price', data=df)
plt.ylim(0, )

Correlation - Negative Linear Relationship

  • Correlation between two features (highway-mpg and price)
sns.regplot(x='highway-mpg', y='price', data=df)
plt.ylim(0, )

Correlation - Negative Linear Relationship

  • Weak correlation between two features (peak-rpm and price)
sns.regplot(x='peak-rpm', y='price', data=df)
plt.ylim(0, )

Correlation - Statistics

Pearson Correlation

  • Measure the strenth of the correlation between two features

    • Correlation coefficient
    • P-value
  • Correlation coefficient

    • Close to +1: Large Positive relationship
    • Close to -1: Large Negative relationship
    • Clost to 0: No relationship
  • Strong Correlation:

    • Correlation coefficient close to 1 or -1
    • P value less than 0.001
  • P-value

    • P-value < 0.001 Strong certainty in the result
    • P-value < 0.05 Moderate certainty in the result
    • P-value < 0.1 Weak certainty in the result
    • P-value > 0.1 No certainty in the result

Pearson Correlation

pearson_coef, p_value = stats.pearsonr(df['horsepower'], df['price'])
  • Pearson correlation: 0.81
  • P-value: 9.35 $e$-48

Correlation - Heatmap

Association between two categorical variables: Chi-Square

Categorical variables

  • Use the Chi-square Test for Association (denoted as $\chi^2$)
  • The test is intended to test how likely it is that an observed distribution is due to chance.

Chi-square Test for Association

  • The Chi-square tests a null hypothesis that the variables are independent.
  • The Chi-square does not tell you the type of relationship that exists between both variables; but only that relationship exists.

Categorical variables

\chi^2 = \displaystyle\sum_{i=1}^n \frac {(O_i-E_i)^2} {E_i}

scipy.stats.chi2_contingency(cont_table, correction=True)

Lesson Summary

In this lesson, you have learned how to:

Describe Exploratory Data Analysis: By summarizing the main characteristics of the data and extracting valuable insights.

Compute basic descriptive statistics: Calculate the mean, median, and mode using python and use it as a basis in understanding the distribution of the data.

Create data groups: How and why you put continuous data in groups and how to visualize them.

Define correlation as the linear association between two numerical variables: Use Pearson correlation as a measure of the correlation between two continuous variables

Define the association between two categorical variables: Understand how to find the association of two variables using the Chi-square test for association and how to interpret them.