Data analysis and visualization - sparklabnyc/resources GitHub Wiki

Data analysis

What is Data Analysis?

Data Analysis is the process of systematically applying statistical and/or logical techniques to describe and illustrate, condense and recap, and evaluate data. The procedure helps reduce the risks inherent in decision-making by providing useful insights and statistics, often presented in charts, images, tables, and graphs

Types of Data Analysis

Descriptive Analysis: Descriptive analysis involves summarizing and describing the main features of a dataset. It focuses on organizing and presenting the data in a meaningful way, often using measures such as mean, median, mode, and standard deviation. It provides an overview of the data and helps identify patterns or trends.

Inferential Analysis: Inferential analysis aims to make inferences or predictions about a larger population based on sample data. It involves applying statistical techniques such as hypothesis testing, confidence intervals, and regression analysis. It helps generalize findings from a sample to a larger population.

Exploratory Data Analysis (EDA): EDA focuses on exploring and understanding the data without preconceived hypotheses. It involves visualizations, summary statistics, and data profiling techniques to uncover patterns, relationships, and interesting features. It helps generate hypotheses for further analysis.

Diagnostic Analysis: Diagnostic analysis aims to understand the cause-and-effect relationships within the data. It investigates the factors or variables that contribute to specific outcomes or behaviors. Techniques such as regression analysis, ANOVA (Analysis of Variance), or correlation analysis are commonly used in diagnostic analysis.

Predictive Analysis: Predictive analysis involves using historical data to make predictions or forecasts about future outcomes. It utilizes statistical modeling techniques, machine learning algorithms, and time series analysis to identify patterns and build predictive models. It is often used for forecasting sales, predicting customer behavior, or estimating risk.

Prescriptive Analysis: Prescriptive analysis goes beyond predictive analysis by recommending actions or decisions based on the predictions. It combines historical data, optimization algorithms, and business rules to provide actionable insights and optimize outcomes. It helps in decision-making and resource allocation.

Make sure to "clean" or preprocess your data for effective analysis and visualization.

Statistical analysis

Statistics is a branch of mathematics that is used to summarize data and compare differences seen between samples and groups. Statistical tests can be used to check if patterns you might observe are real, or simply due to random variation

Statistical tests are used to compare values between groups. The result of a statistical test is often a "p-value," which represents the probability that the observed differences between groups occurred by random chance. Simply, if you multiply the p-value of a statistical test by 100, you get the percent chance that the test occurred by random chance.

Statistical tests often make assumptions that need to be met in order to validate any results, and low p-values are generally used to determine if something is “statistically significant."

Below are some of the common types of tests

  • Comparing quantitative measurements between two populations
    • Examples: t-test, z-test, proportion tests, regression, permutation testing
  • Comparing quantitative measurements between more than two qualitative categories
    • Examples: ANOVA, goodness of fit, chi-squared
  • Comparing quantitative measurements between a large number of quantitative observations that fall under different qualitative categories.
    • Example: Principal component analysis

FYI: Many statistical tests may assume certain conditions (normality, equal variances, observations)

Visualization

Data visualization is the process of creating summaries of datasets in the form of images or graphics. There are two central approaches to data visualizations.

  • Exploratory data visualization is used to understand patterns or trends in data. These may depict averages, most common values, variability, and range of data.
  • Explanatory data visualization is used to communicate data to an audience. This can be used to compare values between populations, or depict trends or relationships between things

By using visual elements like charts, graphs, and maps, data visualization tools provide an accessible way to see and understand trends, outliers, and patterns in data.

The common types of data visualization are listed below

  • Bar chart: most commonly used as a form to present quantitative data
  • Pie chart: used to display the proportions of a whole
  • Line chart: used to show information that changes over time
  • Scatterplot: displays values for typically two variables for a set of data.The trend line shows the central tendency of the data
  • Box plot: a method for graphically depicting groups of numerical data through their quartiles

Common statistical analysis models and methods

At Spark Lab NYC, the key method that we use is Bayesian modelling, which could be found at this section of the wiki

Other commonly used approaches:

  • Regression
  • Decision tree
  • Multi-linear models
  • Spatial analysis
  • Principal Component Analysis

Figures and graphs

Nothing much could be said about creating your own figures since it varies depending on the findings and project. I suggest you look at other similar research to base your figures. Best practices are standard: Label everything, choose your visualization type, use consistent colors and scales, include error/confidence intervals, and avoid overly complex diagrams that might distort interpretation.

Tools and resources

Here are some commonly used tools for data cleaning, statistical analysis, and visualization.