Exercises - ncbi/workshop-asm-ngs-2022 GitHub Wiki

Exercise 1: Review metadata table contents and help docs

Add nih-sra-datastore to your datasets

You will want to pin the SRA dataset to your BigQuery Console to make it easier to access and explore the available metadata. Click the “Add Data” button on the upper left side of the screen, in the Explorer panel.

bigquery-pin

Next, select “Pin a project by name”, paste “nih-sra-datastore” into the Project name box and click Pin. It will now appear on the left side of the page in the Explorer panel.

Inspect table schema, details, and preview

Click the triangle ‘expand node’ next to the pinned ‘nih-sra-datastore’ entry in the left-hand Explorer panel to see the contents of the project. Then click the triangle ‘expand node’ next to the ‘sra’ entry to reveal the metadata table. Click on the metadata table to open it in the main window.

bq_metadata_details

You may then click on the ‘Schema’, ‘Details’, and ‘Preview’ tabs to further explore the contents of the table. The Schema tab gives information about the name of each field in the table and how that field is encoded (timestamp, string, integer, etc.). The Details tab provides information about the data underlying the table, such as the last time it was updated and how many rows the table contains. Finally, the Preview tab shows the first few rows of the table so you can get a firsthand understanding of the organization and contents of the table.

Compare BigQuery table contents to run selector

The SRA Run Selector allows users to search for SRA records via accession of Studies, Samples, Experiments, or Runs. Multiple query terms can be included by using a comma separated list of accessions. The requested records are displayed in a table format with 1 Run per row. Information common to all records in the query is displayed at the top under the 'Common Fields' section, while remaining metadata fields that vary between runs are included in the table underneath.

Some metadata fields can be used as filters to limit your output to a subset of the full query result and once filtering and selection of your Runs of interest is complete, the dataset can be downloaded as a list of Run accessions or a table of Runs including the metadata fields. Note that this BigQuery table includes information from the SRA record as well as associated BioSample and Bioproject Records.

Compare what you see in the BigQuery table to what is available in run selector, such as for PRJNA839090. Note that there are more fields available in BigQuery than in run selector.

run_selector

Reference existing help documentation and ask any questions you might have

Looking at what is available in BigQuery, noting differences between what is available in run selector and in BigQuery, and referencing our help documentation, do you still have any questions?


Exercise 2: Review taxonomy table and help docs

Intro to nested interval indexing

For reference, see Tropashko, 2004. Briefly, for tree-like hierarchical structures, each node in the tree can be encode as a tuple, T1: (a, b), such that a < b. If we ensure that each child tuple, T2: (c, d), falls within the range of T1, such that a<=c<b and a<=d<b; we can use these indices to quickly and easily identify all nodes above and below a given node.

Inspect table schema, details, and preview

Click the triangle ‘expand node’ next to the pinned ‘nih-sra-datastore’ entry in the left-hand Explorer panel to see the contents of the project. Then click the triangle ‘expand node’ next to the ‘sra_tax_analysis_tool’ entry to reveal the constituent tables. Click on the taxonomy table to open it in the main window.

bq_tax_schema

You may then click on the ‘Schema’, ‘Details’, and ‘Preview’ tabs to further explore the contents of the table. The Schema tab gives information about the name of each field in the table and how that field is encoded (timestamp, string, integer, etc.). The Details tab provides information about the data underlying the table, such as the last time it was updated and how many rows the table contains. Finally, the Preview tab shows the first few rows of the table so you can get a firsthand understanding of the organization and contents of the table.

Compare BigQuery table contents to the taxonomy web page

Compare what you see in the BigQuery table, to what is available in NCBI Taxonomy. Note that there is more information available on the website than in BigQuery.

taxonomy1

taxonomy2

Reference existing help documentation and ask any questions you might have

Looking at what is available in BigQuery, noting differences between what is available in Taxonomy and in BigQuery, and referencing our help documentation, do you still have any questions?


Exercise 3: Review STAT results table and help docs

Inspect table schema, details, and preview

Click the triangle ‘expand node’ next to the pinned ‘nih-sra-datastore’ entry in the left-hand Explorer panel to see the contents of the project. Then click the triangle ‘expand node’ next to the ‘sra_tax_analysis_tool’ entry to reveal the constituent tables. Click on the taxonomy table to open it in the main window.

bq_stat_preview

You may then click on the ‘Schema’, ‘Details’, and ‘Preview’ tabs to further explore the contents of the table. The Schema tab gives information about the name of each field in the table and how that field is encoded (timestamp, string, integer, etc.). The Details tab provides information about the data underlying the table, such as the last time it was updated and how many rows the table contains. Finally, the Preview tab shows the first few rows of the table so you can get a firsthand understanding of the organization and contents of the table. Note that the ileft and iright values as unique per SRA record and that each record may have multiple rows, one for each taxa detected.

Compare BigQuery table contents to the run analysis web page

The SRA Run Analysis page allows users to see a representation of the results of STAT for each record in SRA, to assess what taxa might be present in the record.

analysis_tab

Compare what you see in the BigQuery table, to what is exailable on the run analysis page, such as for SRR21803617. Note that there are more fields available in BigQuery than on the run analysis page.

Reference existing help documentation and ask any questions you might have

Looking at what is available in BigQuery, noting differences between what is available on the run analysis web page and in BigQuery, and referencing our help documentation, do you still have any questions?


Exercise 4: Review VCF results table and help docs

Find nih-sequence-read in the bigquery-public-data dataset

As done previously, you will want to pin the bigquery-public-data dataset to your BigQuery Console to make it easier to access and explore the available metadata. Click the “Add Data” button on the upper left side of the screen, in the Explorer panel, select “Pin a project by name”, paste “bigquery-public-data” into the Project name box and click Pin. It will now appear on the left side of the page in the Explorer panel. Scroll until you find the "nih-sequence-read."

variations

Inspect table schema, details, and preview

Click the triangle ‘expand node’ next to the ‘sra’ entry to reveal the metadata table. Click on the metadata table to open it in the main window.

variations_schema

You may then click on the ‘Schema’, ‘Details’, and ‘Preview’ tabs to further explore the contents of the table. The Schema tab gives information about the name of each field in the table and how that field is encoded (timestamp, string, integer, etc.). The Details tab provides information about the data underlying the table, such as the last time it was updated and how many rows the table contains. Finally, the Preview tab shows the first few rows of the table so you can get a firsthand understanding of the organization and contents of the table.

Reference existing help documentation and ask any questions you might have

Looking at what is available in BigQuery, and referencing our help documentation, do you still have any questions?