DataManipulation - BGIGPD/BestPractices4Pathogenomics GitHub Wiki
Rapid Introduction
Guideline
1 Given question: How's pathogenomics developed in Africa?
1.1 Investigation
https://www.cell.com/cell/current
Africa in the era of pathogen genomics: Unlocking data barriers
Genomic-informed pathogen surveillance in Africa: opportunities and challenges
Early transmission of SARS-CoV-2 in South Africa: An epidemiological and phylogenetic report
Data availability The SARS-CoV-2 genome sequences generated in this study were deposited in the GISAID database (https://www.gisaid.org/) under the following accession IDs: EPI_ISL_421572, EPI_ISL_421573, EPI_ISL_421574, EPI_ISL_421575 EPI_ISL_421576 EPI_ISL_436684 EPI_ISL_436685 EPI_ISL_436686 EPI_ISL_436687. In addition, raw short and long reads were submitted to the Short Read Archive (SRA) and can be accessed under BioProject Accession: PRJNA636748.
1.2 Collect public data
Find SRA Run Selector for PRJNA636748
Download Metadata, save under folder PRJNA636748
.
2 Data Manipulation via Rstudio.
Open R markdown file PRJNA636748/DataManipulation.Rmd
using Rstudio.
Learn more about the R packages which will be used. Those packages are not mandatory, but they are very powerful and modern style.
dplyr
for data manipulation, part of tidyverse;readr
for data import, also , part of tidyverse;ggplot2
Popular for data visualization and has been a part of the tidyverse since its later founding;- learn more about ggplot2: https://ggplot2-book.org
The tidyverse is a collection of R packages designed for data science. It brings together a cohesive set of functions that share an underlying design philosophy and work together smoothly.
https://www.tidyverse.org
Task
Practice the code in PRJNA636748/DataManipulation.Rmd
. Choose a question and select samples to help you answer the question. e.g:
- How different between sequencing platforms?
- What's the difference among different geographic location?
- Does sequencing volume affect the result consistency?
Note to consider following principles:
- Variable control: Make sure only one variable is comparied at a time.
- Replication: Both biological and technical replication is needed, in order to control random error.
Push your R markdown file with updated statistics and visualization. Remind to add description in README.md