Considerations for Curating Metadata in Public Datasets - tbcgit/omdctk GitHub Wiki

During the curation process of public datasets, it is possible that duplicates or separated parts of the same sample may be encountered, causing the number of samples to not coincide with the expected amount. Exploring the information associated with the publication, as well as reviewing the names and content of the original uploaded fastq files, is critical to gaining context about what may be going on. Since, generally, most people would be interested in including a sample only once and as a whole, this is not a trivial issue. This may occur in different forms and for a number of various reasons.

Considerations within study projects

Multiple files belonging to the same sample accession:
- Multiple sequencing technologies for the same sample. In this case, we are talking about samples that present the same source sample material and have been processed to obtain the same type of data, but have been sequenced with different technologies. For instance, we could find a sample that has the same data type (AMPLICON + METAGENOMICS), but that presents two run accessions, one sequenced with Illumina Miseq and the other with Oxford Nanopore GridION. This is not something to worry about, but something to take into account, in order to keep the type of data you want to work with.
- Duplicated files. In some rare cases, we may find files associated to the same sample in different run accessions, but that are in fact exactly the same file. The only way to check this is to examine the original uploaded files. This is a real concern and could be a problem, if not fixed. Clarification may even be needed from the authors of the study publication.
- Multiple data types for the same sample. In this case, we are talking about samples that present the same source sample material, but have been processed to obtain different data types. For example, we could find a sample with two run accessions, one for 16S data (AMPLICON + METAGENOMICS) and the other for whole genome data (WGS + METAGENOMICS). This is not something to worry about, but something to take into account, in order to keep the type of data you want to work with.
- Multiple sequencer outputs from the same sample. There are cases in which we could find that the sequencer generated different fastq files from the same sample (lanes or runs). Common practice is to take these files and merge them to reconstruct the whole sample. It is important to check the original uploaded files to verify this, as well as the statistical read count tables of the publication (if available). This is an important thing to check, in principle, it is not a matter of concern, as long as it can be confirmed, but it will involve additional work. Clarification may even be needed from the authors of the study publication. Optional programs Make Treatment Template ENA, Treat Metadata ENA and Treat Fastqs have been specially designed to deal with this type of issues.
- Technical replicates. In this case, we are talking about samples that present the same source sample material and have been processed to obtain the same type of data with the same sequencer technology. It is important to check the original uploaded files to verify this, as well as the information available in the publication. This is an important thing to check, in principle, it is not a matter of concern, as long as it can be confirmed, but it will involve additional work. It may even require clarification from the authors of the study publication.
- Multiple post quality control files. It could happen that several files are found resulting from quality control associated with the same sample, including surviving and orphan sequences. In many cases, the uploaded files are the post quality control files instead of the raw sequence files, and some people upload all files generated in this process. It is important to check the original uploaded files to verify this, as well as the statistical read count tables of the publication (if available). One thing to keep in mind is that orphan sequence files tend to have a much lower number of reads than the main files. This is an important thing to check, in principle, it is not a cause for concern, as long as it can be confirmed, but it will involve additional work. Clarification may even be needed from the authors of the publication.
- PAIRED files uploaded as SINGLE files. In some rare cases, we may find that PAIRED fastq files were submitted to different run accessions. It is important to check the original upload files to verify this, as well as the information available in the publication. This can be a problem. There will be cases where we can use both files (if the original submitted fastq files are available), but in other cases (if we don't have the original submitted fastq files) it will be advisable to use only R1, as they usually have better quality than R2. Clarification may even be needed from the authors of the study publication.
- Fewer sample accessions than expected. In other occasions, we may find that there were more samples stated in the study publication than the number of sample accessions in the project. This could indicate that the authors have uploaded the samples in an unconventional way or that there are some missing files. For instance, they may have used the sample accession as the individual's value and each associated run accession corresponds in fact to a different sample from the same individual. It is important to check the original upload files to verify this and the metadata columns of interest (such as sample_alias), as well as the information available in the publication. These cases could be a problem. Clarification may even be needed from the authors of the study publication.
Multiple files belonging to different sample accessions but related to the same sample. In these cases, we would find more sample accessions than expected. They must necessarily be checked manually. In this situation, the identification is more difficult and requires a thorough examination of the corresponding metadata and study publication. Metadata columns of interest (such as sample_alias) and the original upload files are key to identify these cases. This can happen for different reasons: Multiple sequencing technologies for the same sample, Duplicated files, Multiple data types for the same sample, Multiple sequencer outputs from the same sample, Technical replicates, Multiple post quality control files, PAIRED files uploaded as SINGLE files.

The optional program Check Metadata ENA has been specially designed to help the user to identify the above cases.

Considerations between study projects

Duplicated files between different study projects. We could also find duplicated samples between related study projects. This happens when authors generate new projects by re-uploading samples from previous projects. These cases must be treated with extreme caution and need to be checked manually. In this situation, the identification is much more difficult and requires a thorough examination of the corresponding metadata and associated studies publications. Metadata columns of interest (such as sample_alias) and the original upload files are key to identify these cases.