Adding new datasets - borenstein-lab/microbiome-metabolome-curated-data GitHub Wiki

Contributions of new paired microbiome-metabolome datasets (from human fecal samples) to this collection are warmly welcomed. We require that new datasets are associated with published studies, and that they meet our inclusion criteria. Data files should be provided with the same naming convenstions and file formats as other datasets, and as detailed below. Submission of new datasets is performed using Git pull requests. We recommend following the steps provided below.

Please contact us if you have any questions or need assistance.

Fork the resource repository to your private GitHub account to create a private repository copy. To do so, navigate to the main resource page, click on the "Fork" button on the top-right and follow the steps in the opened window.
Clone the repository to have a local copy of it (e.g. run git clone https://github.com/<YourUserName>/microbiome-metabolome-curated-data from git bash).
Create a new branch for your changes by running git checkout -b new_dataset_<dataset-name> upstream/master.
Open a folder within ./data/processed_data with the new dataset name, following the convention <FIRST-AUTHOR>_<COHORT-DESCRIPTION>_<PUBLICATION-YEAR>.
Process the metagenomics data following the processing pipeline described in the metagenomics processing notes or a custom pipeline. When using custom pipelines, and in order to make the genera/species names comparable to other datasets, please use the GTDB database for taxonomy assignment, version R06-RS202. Genera relative abundances should be saved in a tab-delimited file, namely genera.tsv, and species relative abundances should be saved in species.tsv. Sample names should be listed in the first column, with the column name Sample. See here for an example of such a file.
Organize your processed metabolomics data into a mtb.tsv file (metabolite levels, identified or not), with the first column listing sample names as above. See example here. The second file named mtb.map.tsv should hold additional information per metabolite present in the mtb.tsv file, as in the example here. We specifically recommend adding a "HMDB" column and a "KEGG" column with HMDB/KEGG ID's wherever possible. For untargeted metabolomics data, please include all additional information available per peak. Notes about how current metabolomic datasets were organized are available in the Metabolomics processing notes section.
Organize sample and subject metadata in a tab-delimited file named metadata.tsv. Any available metadata should be included, and specifically all fields listed here.
Place all text files you created in the dataset folder from step 3. Overall, the folder should include the following files: metadata.tsv, mtb.tsv, mtb.map.tsv, genera.tsv, and optionally for shotgun datasets - species.tsv.
Add detailed information about the dataset, associated publication, metagenomics processing and metabolomics processing, in the "Supplementary Tables" Excel file located at ./docs/Supplementary Tables.xlsx.

Commit and push your changes by running the following commands from Git bash:

git add .
git commit -m "Adding dataset <dataset-name>"
git push -u origin new_dataset_<dataset-name>

Lastly, create the pull request: Navigate to the original repository's Pull Requests tab and you should see an automatic suggestion from GitHub to create a pull request from your new branch. Click on "Compare & pull request".