Data Management - sparklabnyc/resources GitHub Wiki

This page will guide you on how to get started with data and how to manage it effectively.

Planning for data

Data Management Plans (DMPs) are documents that outline how data will be collected, stored, and analyzed over the timeline of the research project. They are typically created in the early stages and might be required by funders and institutions. Typical elements include the description, format, metadata, and how the data will be stored, secured, and backed up.

Planning for your data management needs and activities will help ensure that:

  • You have adequate technological resources
  • Your data will be free from versioning errors and gaps in documentation
  • Your data is backed up and safe from sudden loss or corruption
  • You can meet legal and ethical requirements
  • You can share your finalised data publicly
  • Your data will remain accessible and comprehensible in the future.

What do research funders expect?

Many funders expect you to prepare a data management plan when applying for a grant.

Formats: In planning a research project, it is important that you consider which file formats you will use to store your data.

Here are some common file formats for most data types:

  • Textual data - XML, TXT, HTML, PDF
  • Tabular data - CSV
  • Databases - CSV, XML
  • Images - TIFF, PNG, JPEG
  • Audio - FLAC, WAV, MP3

Intellectual Property Rights (IPR): Failure to clarify rights at the start of the research project might lead to unexpected limitation to your research, dissemination, future related projects, and associated credits.

Questions to help you start with the DMP

  1. Project, experiment, and data description
  • What’s the purpose of the research?
  • What is the data? How and in what format will the data be collected? Is it numerical data, image data, text sequences, or modeling data?
  • How much data will be generated for this research?
  • Are you using data that someone else produced?
  1. Documentation, organization, and storage
  • What documentation will you be creating in order to make the data understandable by other researchers?
  • Are you using a file format that is standard to your field?
  • What directory and file naming convention will be used?
  • What tools or software are required to read or view the data?
  1. Access, sharing, and re-use
  • Who has the right to manage this data? Is it the responsibility of the PI, student, lab, institution, or funding agency?
  • What data will be shared, when, and how?
  • Does sharing the data raise privacy, ethical, or confidentiality concerns? Do you have a plan to protect or anonymize data, if needed?
  • Who holds intellectual property rights for the data and other information created by the project? Will any copyrighted or licensed material be used? Do you have permission to use/disseminate this material?
  • Are there any licensing-related restrictions on data sharing associated with this grant?
  • Will this research be published in a journal that requires the underlying data to accompany articles?
  • Will you permit re-use, redistribution, or the creation of new tools, services, data sets, or products?
  1. Archiving
  • How will you be archiving the data? Will you be storing it in an archive or repository for long-term access?
  • How will you prepare data for preservation or data sharing?
  • Are software or tools needed to use the data? Will these be archived?
  • How long should the data be retained? 1 year, 3-5 years, 10 years, or forever?

Finding data

Finding the right data for your research project is easiest when you have a plan.

  1. Define your needs

Before you start searching, clearly state what you're looking for (DMP)

  1. Identify potential data sources Think about who might collect the information you need
  • Government agencies: often collect data on demographics, economics, health, and more.
  • Nonprofit organizations: may collect data on specific issues and populations that they work with
  • Private groups: might collect data relevant to their business/field
  • Academic researchers: often share research datasets collected for their studies and publications
  1. Start your search

There are many places to search for data;

  • Data search engines, archives, databases: just run a simple search on the website to check for available data
  • Research data repositories: These platforms curate collections of datasets from various sources (e.g., ICPSR, data.gov, re3data.org)
  • Agency: if data is collected by an agency, the data may be available on their website
  • Publications: look for articles and reports related to your project

Organising data

Best Practices for Organization and File Names

Organizing your data by developing your directory structure for your files.

File names should be unique, descriptive, and applied consistently. File names should not use special characters, spaces, or periods

Avoid relying on the directory structure to describe file contents. Files should be able to be understood independently by their file name.

See here for an example from the lab

Cleaning data

What is data cleaning?

Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset. When combining multiple data sources, there are many opportunities for data to be duplicated or mislabeled. If data is incorrect, outcomes and algorithms are unreliable, even though they may look correct. Data cleaning is the process that removes data that does not belong in your dataset.

Before beginning to clean your data, it’s a good idea to keep a copy of the raw data set. If you make an error during the cleaning stage, you can always go back to the original, and you won’t lose important information.

Common types of cleaning:

  • Mismatched/Incomplete metadata
  • Inconsistent formatting
  • Bias

How to clean data

While the techniques used for data cleaning may vary according to the types of data, you can follow these basic steps.

Step 1: Remove duplicate or irrelevant observations

When you combine data sets from multiple places, scrape data, or receive data from clients or multiple departments, there are opportunities to create duplicate data. De-duplication is one of the largest areas to be considered in this process. Keeping duplicates can cause irregular and false results. Irrelevant observations are when you notice observations that do not fit into the specific problem you are trying to analyze, so remove them if you see them.

Step 2: Fix structural errors

Structural errors are when you measure or transfer data and notice strange naming conventions, typos, or incorrect capitalization. These inconsistencies can cause mislabeled categories or classes.

Step 3: Filter unwanted outliers

Often, there will be a few observations that appear not to fit into the data that you are analyzing. If you have a legitimate reason to remove an outlier, like improper data entry, doing so will help the performance of the data you are working with. However, you shouldn't remove too many outliers since it can skew the data. Remember: just because an outlier exists, doesn’t mean it is incorrect.

Step 4: Handle missing data

You can’t ignore missing data because many algorithms will not accept missing values. There are a couple of ways to deal with this. As a first option, you can drop the observations that have missing information. As a second option, you can input the missing values based on other data observations, but this can lose integrity since you may be operating from assumptions rather than observations. As a third option, you can input null or alter the code. However, none of these are optimal, but can be considered for the missing data.

Step 5: Validate

At the end of the data cleaning process, you should be able to answer these questions as a part of basic validation:

  • Does the data make sense?
  • Does the data follow the appropriate rules for its field?
  • Does it prove or disprove your working theory, or bring any insight to light?
  • Can you find trends in the data to help you form your next theory?
  • If not, is that because of a data quality issue?

False conclusions because of incorrect or “dirty” data can lead to poor findings and loss of integrity.

Here is a workflow from Guo M, Wang Y, Yang Q, Li R, Zhao Y, Li C, Zhu M, Cui Y, Jiang X, Sheng S, Li Q, Gao R. Normal Workflow and Key Strategies for Data Cleaning Toward Real-World Data: Viewpoint. Interact J Med Res. 2023 Sep 21;12:e44310. doi: 10.2196/44310. PMID: 37733421; PMCID: PMC10557005. (I recommend their guide) ijmr_v12i1e44310_fig3

Archiving and sharing data

Accessing your data

You can most likely access most data remotely through shared repositories, but in the case of private data, you will most likely be given access after approval, and the data will be sent to you or a USB with the data.

Storage

Choosing the right way to store your data can help you work more flexibly, easily, and quickly. You may be required by your PI or funder to store your data in a particular place, or you may have more choices available.

Sharing

College Virtual Private Network: A VPN will usually allow you to: access files securely, save new files/versions, and remotely access any folder you can access on-site. You can also share data from repositories, secured emails, or other file transfers.

Selecting a data repository

Choosing what type of repository best suits your data ultimately depends on your research area, the sensitivity of the data, and the features of the repository. Make sure to secure your repository if the project has confidential data

Elements of a data citation

A citation for a data set is very similar to that for a research publication. The basic elements are: author(s), date of publication, title, publisher and/or distributor, and persistent identifier (such as a DOI).

Data ethics

Human subjects in research

U.S. based research: The U.S. Department of Health and Human Services (DHHS) sets the requirements for what is considered Human Subjects Research, and has codified these requirements in Title 45, Subtitle A, Subchapter A, Part 46 of the Code of Federal Regulations (45 CFR 46), more commonly known as the Common Rule. The Common Rule is very heavily influenced by the 1979 Belmont Report, published by the National Commission for the Protection of Human Subjects of Biomedical and Behavioral Research. If you are working with human subjects, you will become very familiar with the ethical considerations in both of these documents.

In order to share human subjects data with the broader research community (often referred to as data publishing), you must address data sharing in your IRB form. If you do not, you will likely need to seek re-consent from your participants as well as amend your IRB protocol, which isn't always easy and further complicated if your protocol is closed.

Some sample consent language: If you choose to be in this study, data collected from you and all the other people who take part may be stored long-term in a repository following the completion of the study. Any personal information that could identify you will be removed or changed before files are shared with other researchers or results are made public. The removal of this information allows your data to be used without anyone knowing which person in the study it comes from.

Confidentiality

The removal of identifiers from human subjects data is important prior to publishing. There are both direct and indirect identifiers, and some require explicit removal or sufficient masking in order to be released. The Health Insurance Portability and Accountability Act (HIPAA) has a list of 18 identifiers. In order for something to be considered fully de-identified it must have all 18 identifiers removed or expertly masked. When data is not coming from the health record, it still needs to be protected to the same ethical standards to lower deductive disclosure risk.

Data Management Checklist

1. Plan
Begin planning for data management when your research project starts. Draft a data management plan. Also, make sure to create a repository to store data for documentation purposes.

2. Organize
Name and organize data according to their purpose and contents. Store them logically in your repository.

3. Document
Document information: data, metadata, variables, and contextual information. This will help others understand and interpret your data and its usage within your research project.

4. Store and secure
Keep your original data files safe, know your data's risks and secure them, back up your data in multiple locations, and practice version control.

5. Share
When appropriate, share your data.