Exercises - ncbi/workshop-asm-ngs-2022 GitHub Wiki
Exercise 1: Review metadata table contents and help docs
Add nih-sra-datastore to your datasets
You will want to pin the SRA dataset to your BigQuery Console to make it easier to access and explore the available metadata. Click the “Add Data” button on the upper left side of the screen, in the Explorer panel.
Next, select “Pin a project by name”, paste “nih-sra-datastore” into the Project name box and click Pin. It will now appear on the left side of the page in the Explorer panel.
Inspect table schema, details, and preview
Click the triangle ‘expand node’ next to the pinned ‘nih-sra-datastore’ entry in the left-hand Explorer panel to see the contents of the project. Then click the triangle ‘expand node’ next to the ‘sra’ entry to reveal the metadata table. Click on the metadata table to open it in the main window.
You may then click on the ‘Schema’, ‘Details’, and ‘Preview’ tabs to further explore the contents of the table. The Schema tab gives information about the name of each field in the table and how that field is encoded (timestamp, string, integer, etc.). The Details tab provides information about the data underlying the table, such as the last time it was updated and how many rows the table contains. Finally, the Preview tab shows the first few rows of the table so you can get a firsthand understanding of the organization and contents of the table.
Compare BigQuery table contents to run selector
The SRA Run Selector allows users to search for SRA records via accession of Studies, Samples, Experiments, or Runs. Multiple query terms can be included by using a comma separated list of accessions. The requested records are displayed in a table format with 1 Run per row. Information common to all records in the query is displayed at the top under the 'Common Fields' section, while remaining metadata fields that vary between runs are included in the table underneath.
Some metadata fields can be used as filters to limit your output to a subset of the full query result and once filtering and selection of your Runs of interest is complete, the dataset can be downloaded as a list of Run accessions or a table of Runs including the metadata fields. Note that this BigQuery table includes information from the SRA record as well as associated BioSample and Bioproject Records.
Compare what you see in the BigQuery table to what is available in run selector, such as for PRJNA839090. Note that there are more fields available in BigQuery than in run selector.
Reference existing help documentation and ask any questions you might have
Looking at what is available in BigQuery, noting differences between what is available in run selector and in BigQuery, and referencing our help documentation, do you still have any questions?
Exercise 2: Review taxonomy table and help docs
Intro to nested interval indexing
For reference, see Tropashko, 2004. Briefly, for tree-like hierarchical structures, each node in the tree can be encode as a tuple, T1: (a, b), such that a < b. If we ensure that each child tuple, T2: (c, d), falls within the range of T1, such that a<=c<b and a<=d<b; we can use these indices to quickly and easily identify all nodes above and below a given node.
Inspect table schema, details, and preview
Click the triangle ‘expand node’ next to the pinned ‘nih-sra-datastore’ entry in the left-hand Explorer panel to see the contents of the project. Then click the triangle ‘expand node’ next to the ‘sra_tax_analysis_tool’ entry to reveal the constituent tables. Click on the taxonomy table to open it in the main window.
You may then click on the ‘Schema’, ‘Details’, and ‘Preview’ tabs to further explore the contents of the table. The Schema tab gives information about the name of each field in the table and how that field is encoded (timestamp, string, integer, etc.). The Details tab provides information about the data underlying the table, such as the last time it was updated and how many rows the table contains. Finally, the Preview tab shows the first few rows of the table so you can get a firsthand understanding of the organization and contents of the table.
Compare BigQuery table contents to the taxonomy web page
Compare what you see in the BigQuery table, to what is available in NCBI Taxonomy. Note that there is more information available on the website than in BigQuery.
Reference existing help documentation and ask any questions you might have
Looking at what is available in BigQuery, noting differences between what is available in Taxonomy and in BigQuery, and referencing our help documentation, do you still have any questions?
Exercise 3: Review STAT results table and help docs
Inspect table schema, details, and preview
Click the triangle ‘expand node’ next to the pinned ‘nih-sra-datastore’ entry in the left-hand Explorer panel to see the contents of the project. Then click the triangle ‘expand node’ next to the ‘sra_tax_analysis_tool’ entry to reveal the constituent tables. Click on the taxonomy table to open it in the main window.
You may then click on the ‘Schema’, ‘Details’, and ‘Preview’ tabs to further explore the contents of the table. The Schema tab gives information about the name of each field in the table and how that field is encoded (timestamp, string, integer, etc.). The Details tab provides information about the data underlying the table, such as the last time it was updated and how many rows the table contains. Finally, the Preview tab shows the first few rows of the table so you can get a firsthand understanding of the organization and contents of the table. Note that the ileft and iright values as unique per SRA record and that each record may have multiple rows, one for each taxa detected.
Compare BigQuery table contents to the run analysis web page
The SRA Run Analysis page allows users to see a representation of the results of STAT for each record in SRA, to assess what taxa might be present in the record.
Compare what you see in the BigQuery table, to what is exailable on the run analysis page, such as for SRR21803617. Note that there are more fields available in BigQuery than on the run analysis page.
Reference existing help documentation and ask any questions you might have
Looking at what is available in BigQuery, noting differences between what is available on the run analysis web page and in BigQuery, and referencing our help documentation, do you still have any questions?
Exercise 4: Review VCF results table and help docs
Find nih-sequence-read in the bigquery-public-data dataset
As done previously, you will want to pin the bigquery-public-data dataset to your BigQuery Console to make it easier to access and explore the available metadata. Click the “Add Data” button on the upper left side of the screen, in the Explorer panel, select “Pin a project by name”, paste “bigquery-public-data” into the Project name box and click Pin. It will now appear on the left side of the page in the Explorer panel. Scroll until you find the "nih-sequence-read."
Inspect table schema, details, and preview
Click the triangle ‘expand node’ next to the ‘sra’ entry to reveal the metadata table. Click on the metadata table to open it in the main window.
You may then click on the ‘Schema’, ‘Details’, and ‘Preview’ tabs to further explore the contents of the table. The Schema tab gives information about the name of each field in the table and how that field is encoded (timestamp, string, integer, etc.). The Details tab provides information about the data underlying the table, such as the last time it was updated and how many rows the table contains. Finally, the Preview tab shows the first few rows of the table so you can get a firsthand understanding of the organization and contents of the table.
Reference existing help documentation and ask any questions you might have
Looking at what is available in BigQuery, and referencing our help documentation, do you still have any questions?