Data Curation Workflow - psu-libraries/library_data_services GitHub Wiki

The general workflow is pretty straightforward. I think of it as five stages. That being said, steps 1 and 2 can vary a lot and depending on the data can be fairly complicated. Additionally, the details vary a lot and therefore it is necessary to be adaptive and creative sometimes for the first two steps.

1. Ingest

There are three primary ways we get content into ScholarSphere, through the User Interface, facilitated through Box and via manual transfer.

Ingest via the User Interface

This is the most common method and preferred methods. Users can either upload files from their local disks or use a cloud service such as Box. Box has slightly larger upload limits. Currently these will sometimes fail silently. ScholarSphere generates a weekly report of all the uploads that you can use to see what has been uploaded.

Facilitated Ingest

Often users will ask for help uploading files. This could be either because they had problems trying to use the UI or sometimes they just feel they need help. The easiest way is to ask them to upload the content to Box and give you access. Then you can look at the contents, file size and numbers. Sometimes some reorganization will help the upload process, sometimes it takes a couple tries. These will be initiated either when you notice some uploads have failed, or if the user contacts you for help.

Manual Transfer

For very large datasets with a lot of files or large files, the easiest thing to do is to transfer the files manually. With the recent migration from using Fedora to store binaries to storing them on the a file system this should be even easier. The details really vary with each instance, depending on what the content are, where the user has them stored etc. Sometimes the files exist on a server where they can be downloaded, sometimes a hard disk needs to be walked around. These will be initiated either when you notice some uploads have failed, or if the user contacts you for help.

2. Verification

Verify that the files are uploaded successfully. For very large datasets you might just verify a subset of files unless you find failures. A failed upload will still get a record and in may ways appear successful. However, the file size will be zero, and it will not usually have the file characterization data associated with it.

NOTE: Automating this would be a good thing.

You can also do a quick scan of the quality of the data in regards to its organization and documentation. Note any unusual file formats, or formats that will require attention (i.e. Excel files that need conversion to CSV). Also note if there is a README uploaded with the content, if there are data dictionaries, and if any analysis code is included.

3. Triage and Planning

Datasets can be categorized into three groups:

Require little or no attention, the user has done an adequate job organization and describing their data.

The user has done a reasonable job organizing and curating their data, but still requires significant curation.

The data is nearly beyond hope due to lack of organization, use of very obscure file types etc.

Group 2 is by far the most common and also where you should spend most of your effort.

It is often useful to open up a GitHub ticket and start making a plan there, noting what changes are recommended for the dataset. Also create README, data dictionary and any other additional files at this point.

Do some research on the depositor's work. Look at their faculty page (if they are faculty) or their lab page if they are a graduate student. Get a sense of what the overall goal of their research is. Sometimes it helps to scan a related paper or two as well.

4. Curation

This part is where most of the work lies, and also the most variable. The most common tasks are listed, however, it should be noted that each dataset can have unique issues or needs. Part of the fun and challenge of data curation rests in those areas. You get the best response rate and the most cooperation when you can do a lot of the work for the user. This includes writing the README and data dictionaries as much as possible as well as doing any file conversions. While this might seem a lot of work, often when users submit datasets in the future, they have incorporated what they have learned in subsequent submissions. Also, it is quite likely that failure to respond to curation requests is a result of both lack of time on the part of the researcher, and also, more importantly, lack of knowledge and skill about good data practices. Providing a strong start for the researcher to work with overcomes both challenges.

README

If a README has not been supplied, then write one for the researcher. A template can be found here along with a couple other templates. The goal of the README is to provide a good overview for the data that will give another user enough information to confidently reuse the data. While a README is often written last in a project, best practices suggest that it should be written first by the researcher and updated often. This provides a good way for their team to understand what is going on in the project. Using this logic will often strike home with researchers.

Write as much of the README as you can. Clearly denote areas where the researcher should fill in missing information.

Data Dictionaries

Most data will have some sort of tabular component. All tabular data (CSV, Excel etc) should have a data dictionary. At minimum it should consist of column headings with a description of what that column contains, units, and often notes about how it was collected. Other data that can often benefit from a data dictionary are hierarchical data (JSON, XML, HDF5) and also complex file structures.

Additionally, you can sometimes take this opportunity to rename columns. Column names should be short and descriptive and use complete words. Well named columns help other understand the data more and alleviate the need for other users to constantly refer to the data dictionary.

Column names should not contain space or special characters. They should also not start with a number. Snake case is also preferred over camel case. This maximizes the portability between different statistical and programming languages.

NOTE: Renaming the column names can prevent analysis scripts from running. Either edit the analysis scripts or don't change the names in this case.

## Good
this_says_variable       # Snake case is readable and should work in every programming language
variable_1

## Acceptable
ThisSaysVariable        # will need to be downcased for many programming langauges
thisSaysVariable        # Slightly less readable

## Bad
1var
t
var(1)

Verify The Files

Open up all the files (or in very large data sets subsample) and verify that they can be opened. Document in the README what software can be used to open the files. Include version numbers of the software.

NOTE: We can also put copies of some software into ScholarSphere if that makes sense to do. Just link to that software from the README.

Rerun Analyses

Most times the user will not submit analyses with their data. The funding agencies and publishers are starting to require this more often so we will probably see it more and more. It is also something the researchers don't always like to do. If they don't submit analysis codes, it is good to ask for it (be prepared to be refused).

If you have access to the analysis code try to rerun it. Document any dependencies or non-obvious steps to running the code. Ideally there should be one master script (often named main.xx) that will run the entire analysis. Ideally it should also load any dependencies. Neither of these things often happens. Sometimes, it is possible to rewrite the code enough to do that though. I also read through the code and add comments where appropriate. Sometimes some rewriting and reorganization can be done too.

DOI

This is typically done automatically. However, ensure that the user has obtained one where appropriate (there is almost never a good reason not to get one).

Other Things

As noted earlier, not all contingencies can be accounted for. Look for things that should be documented or changed to make the data more understandable, reproducible and reusable.

5. Request Information and Changes

Write a friendly email to the depositor, the following format seems to work reasonably well and get good response rates.

Thank the user for their deposit.
Compliment them. At the very least they have funding for their research so there is something interesting about it. If they have done things that are good, point that out.
State that ScholarSphere currently reviews data submissions and curates them. This is both to help meet funder and publisher requirements but also should increase the impact of their work.
List 3 - 5 (NO MORE THAN 5!!) suggested improvements. If you are sending back README and data dictionaries, point out that you have written as much as you can, but their input would help a lot. Also make it clear that you can do the file transformations for them. If you have made naming changes or organizational changes, propose the changes as submitted for their approval (phrases like "With your permission, I can make the following changes for you" or "With your permission, I will substitute these modified files for you").
Thank them again, assure them that nothing about their work is abnormal (i.e. its not bad) and that if they have any questions encourage them to contact you.

The overall tenor should be positive and helpful. Often they will respond favorably. Often they will push back on some changes. That is OK, this is their work, we just facilitated improvements.

6. Verification Finalization

If you get a favorable response and they agree to and make the changes, verify it is as complete as you think you can get it. I don't tend to like too many back and forths, as researchers have typically moved on to other things at this point and have a limited attention span for this. Once you have gotten it as good as it can be, close the GitHub ticket if you have made one with any notes. DO NOT put negative comments in the GitHub ticket. These are open and searchable.