How to Import Data for Analysis - JonasEngstrom/overleaf-article-template GitHub Wiki

The most compelling reason to use R Markdown for writing articles is the ability to write and to programmatically analyze data and produce graphics all in the same document. For this reason you will most likely want to import external datasets into your document. Here is one suggestion for how to do this in a convenient manner.

Use Data Packages

If you will be using the same data in several studies or your data requires cleaning and formatting before you use it (which, let’s be honest‚ most data does), I highly recommend storing it in a data package using lazy loading as it allows you to write the import script once and be done with it.

Think About Privacy

Some data comes from public sources, some data should only be shared with a limited group of people, and some data should be kept completely private. For our purposes we will treat the first two cases the same.

Data That Can Be Shared

Data that can be shared can be added to a project as a Git submodule by running the following code:

git submodule add <adress to data>

Given that the submodule above is the root directory of an R data package, it can be included in a R Markdown document by including the following code in the document:

library(devtools)
load_all('<name of submodule>')

The advantage of this method is that a given commit is used, so that no changes are introduced in the data unintentionally.

Data That Cannot Be Shared

The disadvantage of using the method described above is that if a third party performs a recursive clone of the article repository, the data submodule will be included. To get around this a data package, kept locally on e.g. an encrypted drive, can be referenced from the R Markdown document. If we assume that we have an encrypted disk image called Data mounted, and that its root directory is the root directory of an R data package, we can load it by adding the following code to our R Markdown document:

library(devtools)
load_all('/Volumes/Data')

The advantage of this method is that the R Markdown document can be shared. The disk image will need to be mounted when working with the document, but the data is kept locally and could also be shared on physical drives or in encrypted form with other authors.

When working with potentially sensitive data in this manner it is important to remember not to include any information that should not be publicly available in the R Markdown document itself—e.g. by dropping columns with personally identifiable data before displaying output in the document. Be careful!