Introduction - ncbi/workshop-asm-ngs-2022 GitHub Wiki

Introduction

Learning Objectives

  • How to Access SRA data in GCP's BigQuery
  • How to run common queries against SRA metadata in BigQuery
  • Different methods for retrieving SRA data in the Cloud
  • One approach to assessing reference genome coverage
  • How to query SARS-CoV-2 SRA data using precalculated variant calling results in BigQuery

Background Knowledge

  • General familiarity with writing SQL queries
    • A google search for SQL basics will reveal a number of decent tutorials
  • General familiarity with Next-Generation Sequencing (NGS) data
  • General familiarty with variant calling and the VCF files format
    • The VCF specification can be found here

Help Documentation

  • General documention on finding and downloading SRA data can be found here
  • Documentation on the SRA TAxonomy Analysis Tool (STAT) can be found here
  • General documentation on SRA cloud resources can be found here
  • Documentation on NCBI's SARS-CoV-2 Variant Calling Pipeline can be found here
  • GCP BigQuery documentation can be found here
  • Documentation on the SRAToolkit can be found at the associated GitHub page
  • minimap2 documentation can be found here
  • samtools documentation can be found here
  • jq documentation can be found here
  • gnuplot documentation can be found here

Other Resources

  • The AWS Open Data Program can be found here
  • Documentation on GCP Public Datasets can be found here
  • AWS Athena documentation can be found here
  • The initial publication on STAT can be found here