DataManipulation - BGIGPD/BestPractices4Pathogenomics GitHub Wiki

Rapid Introduction

Guideline

1 Given question: How's pathogenomics developed in Africa?

1.1 Investigation

https://www.cell.com/cell/current
Cell 50th

Africa in the era of pathogen genomics: Unlocking data barriers

Genomic-informed pathogen surveillance in Africa: opportunities and challenges

Early transmission of SARS-CoV-2 in South Africa: An epidemiological and phylogenetic report

Data availability The SARS-CoV-2 genome sequences generated in this study were deposited in the GISAID database (https://www.gisaid.org/) under the following accession IDs: EPI_ISL_421572, EPI_ISL_421573, EPI_ISL_421574, EPI_ISL_421575 EPI_ISL_421576 EPI_ISL_436684 EPI_ISL_436685 EPI_ISL_436686 EPI_ISL_436687. In addition, raw short and long reads were submitted to the Short Read Archive (SRA) and can be accessed under BioProject Accession: PRJNA636748.

1.2 Collect public data

Find SRA Run Selector for PRJNA636748

Download Metadata, save under folder PRJNA636748.

2 Data Manipulation via Rstudio.

Open R markdown file PRJNA636748/DataManipulation.Rmd using Rstudio.

Learn more about the R packages which will be used. Those packages are not mandatory, but they are very powerful and modern style.

  • dplyr for data manipulation, part of tidyverse;
  • readr for data import, also , part of tidyverse;
  • ggplot2 Popular for data visualization and has been a part of the tidyverse since its later founding;

The tidyverse is a collection of R packages designed for data science. It brings together a cohesive set of functions that share an underlying design philosophy and work together smoothly.
https://www.tidyverse.org

Task

Practice the code in PRJNA636748/DataManipulation.Rmd. Choose a question and select samples to help you answer the question. e.g:

  • How different between sequencing platforms?
  • What's the difference among different geographic location?
  • Does sequencing volume affect the result consistency?

Note to consider following principles:

  • Variable control: Make sure only one variable is comparied at a time.
  • Replication: Both biological and technical replication is needed, in order to control random error.

Push your R markdown file with updated statistics and visualization. Remind to add description in README.md