01.Association01.Exploratory analysis - sporedata/researchdesigneR GitHub Wiki

1. Use cases: in which situations should I use this method?

  1. To understand the dataset and the patient sample better, e.g., understanding the frequency and percentage of categorical variables and the mean or median of a continuous variable (usually accompanied by its standard deviation and interquartile range).
  2. To describe the burden of disease
  3. To determine relationships between variables, frequencies for nzv, missing data and imputation
  4. To search for mistakes
  5. To look at distributions to check assumptions.
  6. To choose appropriate models.

A very important aspect of exploratory analysis is to always perform the analysis again once variables are transformed.

2. Input: what kind of data does the method require?

  1. No other pre-requisites other than having data, specific methods having data-type requirements.

3. Algorithm: how does the method work?

Model mechanics

Describing in words

  1. Exploratory analyses involve measures and plots that assist in better understanding the data prior to more complex modeling.
  2. It should be conducted every time modifications are made to the data, otherwise models will start with strange behaviors.

In general, datasets contain variables displayed in columns, while patient records are represented in rows. Patient information can be contained in a single row, but it is often represented with multiple rows/patient in a few situations. Examples include: when patients have multiple encounters (one row/encounter), each patient encounter has multiple events (one row for each lab exam, or medication, etc.).

Describing in images

  • Exploratory data analysis could work as a lighthouse, shedding light on potential avenues for further detailed research.

Describing with code

  • Tables:
    • GT is a package that emphasizes simple functions for the everyday display table needs, such as modifying the locations of outputs, formatting text as well as background shading, and adding titles, subtitles, captions, etc [4].
    • Kable and kableExtra packages build on the kable output from the knitr package. This package helps to build common complex tables, as well as manipulate table styles [5].
    • Formattable is a package to add colors, background shading, bars, and other features that move tables beyond hulking blocks of text and also to formatting on vectors and data frames to convey more information with an easier, richer, and more flexible data presentation [6].
    • DT is a package to add filtering, pagination, scroll through, and arrange tables to help them make sense of your results[7].
    • Reactable is similar to the DT package, but much more visually appealing [8].
    • Flextable is a package to knit to Word. This package provides a straightforward syntax to shape the content and format of tables [9].
    • Huxtable produces tables with a friendly, modern interface. Cells can span multiple rows and/or columns, and can be manipulated with standard R subsetting or dplyr functions [10].
    • Rhandsontable package has features like data validation, sorting, grouping, data binding, formula support, or column ordering, like tables in Excel. Users can even make changes to data in tables made with this package [11].
    • Pixiedust enables you to do things like removing excess decimal places, converting p-values without having to use scientific notation, and more [12].
    • skimr a package for summary statistics [13].
Data science packages
  • Some GUIs are available, but they do not allow reproducibility.
  • patchwork [1].
  • A package for data cleaning [2].
  • A color palette for plots in R [3].
  • Exploratory analysis often indicates the need for the categorization of numeric variables. The package cutpointr provides tools to determine optimal cutpoints.
  • A package to generate an overview of a dataset [13].

Suggested companion methods

  • Exploratory analysis supports every other method because, to conduct all subsequent analysis, you need to be aware of the distribution of the numerical variables as well as the proportion of categorical ones, as these will play a role concerning their underlying assumptions.

Learning materials

  1. Books
  2. Articles

4. Output: how do I interpret this method's results?

Tables, plots, and their interpretation

Some examples of plots and tables are available for visualization.

Metaphors

Lighthouse Exploratory data analysis could be instrumental for monitoring quality control, pointing towards potential issues or ideas that analysts could investigate in detail.

Reporting guidelines

5. SporeData-specific

Data science functions

  • sdatools::histogram
  • sdatools::boxPlot
  • sdatools::scatterPlot
  • sdatools::barPlot
  • sdatools::stackedBarPlot
  • sdatools::piratePlot
  • sdatools::likertPlot

References

[1] Pedersen TL. patchwork: The Composer of Plots. R package version. 2019;1(0):410.
[2] Firke S. Janitor: Simple tools for examining and cleaning dirty data. R package version. 2018;1(1).
[3] Ram K, Wickham H. wesanderson: A Wes Anderson palette generator. R package version 0.3. 6.
[4] Iannone R, Cheng J, Schloerke B. gt: Easily Create Presentation-Ready Display Tables.
[5] Zhu H. kableExtra: construct complex table with ‘kable’and pipe syntax. R package version 1.1. 0 (2019).
[6] Ren K, Russell K. Formattable: Create ‘Formattable’Data Structures. R Package Version 0.2. 0.1. 2016.
[7] Xie Y, Cheng J, Tan X. DT: A wrapper of the Javascript library ‘Datatables’. R package version 0.4. 2018.
[8] Jordà S. On stage: the react table and other musical tangibles go real. International Journal of Arts and Technology. 2008 Jan 1;1(3-4):268-87.
[9] Gohel D. Flextable: Functions for tabular reporting.
[10] Hugh-Jones D. HuxTable: Easily create and style tables for LaTeX, HTML and other formats. R package version 4.7. 1.
[11] Owen J. rhandsontable: Interface to the ‘Handsontable. js’ library. R package version 0.3. 2018;7.
[12] Ochiai Y, Hoshi T, Rekimoto J. Pixie dust: graphics generated by levitated and animated objects in computational acoustic-potential field. ACM Transactions on Graphics (TOG). 2014 Jul 27;33(4):1-3.
[13] Quinn M. skimr - A frictionless, pipeable approach to dealing with summary statistics: a way to generate an overview of a dataset.

⚠️ **GitHub.com Fallback** ⚠️