Dataset Releases - AndersenLab/CAENDR GitHub Wiki

Overview

This page describes the workflow for creating a new dataset release for a specific species.

Dataset releases are species-specific and version-specific. This page makes frequent use of the tokens SPECIES and RELEASE for these values. For more information on token values, please see the page Tokens & Variables.

Sections

  • Data Locations: The set of files involved in a single release, as defined in the codebase.
  • Release Page: Notes on what parts of the release page are rendered from what files.
  • Deploying a New Release: Instructions for creating a new dataset release.
  • Pinned Releases: Notes on what site features may remain pinned to an older release's data files.



Data Locations

This section describes the data available through the individual release pages.


Release Version 2

All files are held in the Dataset Release bucket, under the specified release folder. For more information, see the V2 instance of the ReportType class in the dataset release file (code link).

Report Files

File Name Filepath
release_notes release_notes_v2.md
summary summary.md
methods methods.md
alignment_report alignment_report.html
gatk_report gatk_report.html
concordance_report concordance_report.html

Divergent Regions

File Name Filepath
divergent_regions_strain_bed_gz browser_tracks/{RELEASE}_{SPECIES}_divergent_regions_strain.bed.gz
divergent_regions_strain_bed browser_tracks/{RELEASE}_{SPECIES}_divergent_regions_strain.bed

Filters

File Name Filepath
soft_filter_vcf_gz variation/WI.{RELEASE}.soft-filter.vcf.gz
soft_filter_vcf_gz_tbi variation/WI.{RELEASE}.soft-filter.vcf.gz.tbi
soft_filter_isotype_vcf_gz variation/WI.{RELEASE}.soft-filter.isotype.vcf.gz
soft_filter_isotype_vcf_gz_tbi variation/WI.{RELEASE}.soft-filter.isotype.vcf.gz.tbi
hard_filter_vcf_gz variation/WI.{RELEASE}.hard-filter.vcf.gz
hard_filter_vcf_gz_tbi variation/WI.{RELEASE}.hard-filter.vcf.gz.tbi
hard_filter_isotype_vcf_gz variation/WI.{RELEASE}.hard-filter.isotype.vcf.gz
hard_filter_isotype_vcf_gz_tbi variation/WI.{RELEASE}.hard-filter.isotype.vcf.gz.tbi
impute_isotype_vcf_gz variation/WI.{RELEASE}.impute.isotype.vcf.gz
impute_isotype_vcf_gz_tbi variation/WI.{RELEASE}.impute.isotype.vcf.gz.tbi

Filter Trees

File Name Filepath
hard_filter_min4_tree tree/WI.{RELEASE}.hard-filter.min4.tree
hard_filter_min4_tree_pdf tree/WI.{RELEASE}.hard-filter.min4.tree.pdf
hard_filter_isotype_min4_tree tree/WI.{RELEASE}.hard-filter.isotype.min4.tree
hard_filter_isotype_min4_tree_pdf tree/WI.{RELEASE}.hard-filter.isotype.min4.tree.pdf

Haplotypes

File Name Filepath
haplotype_png haplotype/haplotype.png
haplotype_pdf haplotype/haplotype.pdf
sweep_pdf haplotype/sweep.pdf
sweep_summary_tsv haplotype/sweep_summary.tsv

Transposons

File Name Filepath
transposon_calls {RELEASE}_{SPECIES}_transposon_calls.bed



Release Page

This section describes how the release data is used to render the release page tabs.


Release Notes

The "Release Notes" section is rendered from the Markdown file release_notes, and the "Release Summary" section is rendered from the Markdown file summary. For the locations of these files, see the section Report Files.

NOTE: In the Release Summary, the "Genome" value should correspond with the GENOME token for this release. For more information on tokens, please see the page Tokens & Variables.


Other Tabs

Methods

Rendered from the methods Markdown file. For the location of this file, see the section Report Files.

Alignment Summary

Rendered from the alignment_report Markdown file. For the location of this file, see the section Report Files.

Variant Summary

Rendered from the gatk_report Markdown file. For the location of this file, see the section Report Files.

Concordance

Rendered from the concordance_report Markdown file. For the location of this file, see the section Report Files.

Haplotypes

Rendered from the haplotype_png and haplotype_pdf files. For the locations of these files, see the section Haplotypes.

Swept Haplotypes

Rendered from the sweep_pdf and sweep_summary_tsv files. For the locations of these files, see the section Haplotypes.

Species Tree

Rendered from the hard_filter_isotype_min4_tree_pdf file. For the location of this file, see the section Filter Trees.




Deploying a New Release

This section describes how to create and publish a new dataset release.

NOTE: As of March 2025, not all tasks are automated yet! Parts of this flow changed during the site-v2 development cycle, and a few steps must be performed manually. Further development might change this.


Pre-Requisites

To deploy a new release, you will need:

  • Admin access to the CaeNDR site
  • Access to the datastore back-end (GCP)

The datastore access is required to manually fill out a few fields & make some updates that have not yet been integrated into the automated new release flow.


Instructions

  1. Upload all relevant files to the new release bucket. For more information on required files, see the section Data Locations and/or consult the spreadsheet of required data.

  2. Log in to the site as an admin user and navigate to the Admin portal.

  3. Under the section "Content Updates", select "Update 'Download Data' Releases". (This may change to a new name.)

  4. Click "Create Release", and fill out the form:

    Field Value
    Dataset Release Version The RELEASE value for this new release. See the page Tokens & Variables for details on the appropriate value. NOTE: Remember this value for later steps!
    Wormbase Version The GENOME value for this new release. See the page Tokens & Variables for details on the appropriate value.
    Report Type Select V2 (or otherwise the most recent version).
    Disabled; Hidden If you wish to keep the release hidden from public users, you may select one of these.

    When you are finished, click "Save".

  5. In the GCP datastore back-end, navigate to "Datastore Studio", and select the database for this project.

  6. Query by the kind dataset_release, and locate the release you just created - you should be able to find it by the fields version and/or created_on. Here, you will have to add a few additional fields:

    Field Value
    browser_tracks The list of tracks to make available in the Genome Browser tool. Unless this list has changed, you can copy the value from the previous release OF THE CURRENT SPECIES.
    genome The GENOME release value. See the page Tokens & Variables for details on the appropriate value.
    species The SPECIES release value. See the page Tokens & Variables for details on the appropriate value.

    When you are finished, save the new entity.

  7. Query by the kind species, and locate the entity representing the species for the new release.

  8. Update the species entity's release_latest value to the RELEASE value of the new release, i.e. the value in the release's version field. If the species entity has a latest_release field, update it to the new release value as well.

    • The value release_latest is the one that is used across the site.
    • The value latest_release is a legacy value, and can likely be dropped, but I include it here for the sake of completeness.
  9. If applicable, update the species entity's other release_ fields as well:

    • If the Pairwise Indel Finder tool should switch over to the new release as well, update the release_pif value to the new version; otherwise, leave the value as-is. For more information, see the section Pairwise Indel Finder Release.
    • If the Strain Variant Annotation tool should switch over to the new release as well, update the release_sva field (and the legacy sva_ver field, if it exists) to the new version; otherwise, leave the value(s) as-is. For more information, see the section Strain Variant Annotation Release.
  10. When you are finished, save the species entity.




Pinned Releases

A few of the site features may be pinned to older releases, i.e. they may continue to use data from a previous release until that data is ready for the most recent release.


Pairwise Indel Finder Release

The Pairwise Indel Finder tool uses the value release_pif. This is used to determine the BED and VCF files used in the display & operation of the tool, as well as the list of available strains.


Strain Variant Annotation Release

The Pairwise Indel Finder tool uses the value release_sva. In older versions of the codebase, this field was called sva_ver.

This may be used to determine which SVA_CSVGZ file to use when building the strain_variant_annotation SQL table, by using the token SVA in the filename template. For more on this variable, see the Strain Variant Annotation section of the Data Dependencies page.

⚠️ **GitHub.com Fallback** ⚠️