Dataset Releases - AndersenLab/CAENDR GitHub Wiki
This page describes the workflow for creating a new dataset release for a specific species.
Dataset releases are species-specific and version-specific. This page makes frequent use of the tokens SPECIES and RELEASE for these values. For more information on token values, please see the page Tokens & Variables.
- Data Locations: The set of files involved in a single release, as defined in the codebase.
- Release Page: Notes on what parts of the release page are rendered from what files.
- Deploying a New Release: Instructions for creating a new dataset release.
- Pinned Releases: Notes on what site features may remain pinned to an older release's data files.
This section describes the data available through the individual release pages.
All files are held in the Dataset Release bucket, under the specified release folder. For more information, see the V2 instance of the ReportType class in the dataset release file (code link).
| File Name | Filepath |
|---|---|
release_notes |
release_notes_v2.md |
summary |
summary.md |
methods |
methods.md |
alignment_report |
alignment_report.html |
gatk_report |
gatk_report.html |
concordance_report |
concordance_report.html |
| File Name | Filepath |
|---|---|
divergent_regions_strain_bed_gz |
browser_tracks/{RELEASE}_{SPECIES}_divergent_regions_strain.bed.gz |
divergent_regions_strain_bed |
browser_tracks/{RELEASE}_{SPECIES}_divergent_regions_strain.bed |
| File Name | Filepath |
|---|---|
soft_filter_vcf_gz |
variation/WI.{RELEASE}.soft-filter.vcf.gz |
soft_filter_vcf_gz_tbi |
variation/WI.{RELEASE}.soft-filter.vcf.gz.tbi |
soft_filter_isotype_vcf_gz |
variation/WI.{RELEASE}.soft-filter.isotype.vcf.gz |
soft_filter_isotype_vcf_gz_tbi |
variation/WI.{RELEASE}.soft-filter.isotype.vcf.gz.tbi |
hard_filter_vcf_gz |
variation/WI.{RELEASE}.hard-filter.vcf.gz |
hard_filter_vcf_gz_tbi |
variation/WI.{RELEASE}.hard-filter.vcf.gz.tbi |
hard_filter_isotype_vcf_gz |
variation/WI.{RELEASE}.hard-filter.isotype.vcf.gz |
hard_filter_isotype_vcf_gz_tbi |
variation/WI.{RELEASE}.hard-filter.isotype.vcf.gz.tbi |
impute_isotype_vcf_gz |
variation/WI.{RELEASE}.impute.isotype.vcf.gz |
impute_isotype_vcf_gz_tbi |
variation/WI.{RELEASE}.impute.isotype.vcf.gz.tbi |
| File Name | Filepath |
|---|---|
hard_filter_min4_tree |
tree/WI.{RELEASE}.hard-filter.min4.tree |
hard_filter_min4_tree_pdf |
tree/WI.{RELEASE}.hard-filter.min4.tree.pdf |
hard_filter_isotype_min4_tree |
tree/WI.{RELEASE}.hard-filter.isotype.min4.tree |
hard_filter_isotype_min4_tree_pdf |
tree/WI.{RELEASE}.hard-filter.isotype.min4.tree.pdf |
| File Name | Filepath |
|---|---|
haplotype_png |
haplotype/haplotype.png |
haplotype_pdf |
haplotype/haplotype.pdf |
sweep_pdf |
haplotype/sweep.pdf |
sweep_summary_tsv |
haplotype/sweep_summary.tsv |
| File Name | Filepath |
|---|---|
transposon_calls |
{RELEASE}_{SPECIES}_transposon_calls.bed |
This section describes how the release data is used to render the release page tabs.
The "Release Notes" section is rendered from the Markdown file release_notes, and the "Release Summary" section is rendered from the Markdown file summary. For the locations of these files, see the section Report Files.
NOTE: In the Release Summary, the "Genome" value should correspond with the GENOME token for this release. For more information on tokens, please see the page Tokens & Variables.
Rendered from the methods Markdown file.
For the location of this file, see the section Report Files.
Rendered from the alignment_report Markdown file.
For the location of this file, see the section Report Files.
Rendered from the gatk_report Markdown file.
For the location of this file, see the section Report Files.
Rendered from the concordance_report Markdown file.
For the location of this file, see the section Report Files.
Rendered from the haplotype_png and haplotype_pdf files.
For the locations of these files, see the section Haplotypes.
Rendered from the sweep_pdf and sweep_summary_tsv files.
For the locations of these files, see the section Haplotypes.
Rendered from the hard_filter_isotype_min4_tree_pdf file.
For the location of this file, see the section Filter Trees.
This section describes how to create and publish a new dataset release.
NOTE: As of March 2025, not all tasks are automated yet! Parts of this flow changed during the site-v2 development cycle, and a few steps must be performed manually. Further development might change this.
To deploy a new release, you will need:
- Admin access to the CaeNDR site
- Access to the datastore back-end (GCP)
The datastore access is required to manually fill out a few fields & make some updates that have not yet been integrated into the automated new release flow.
-
Upload all relevant files to the new release bucket. For more information on required files, see the section Data Locations and/or consult the spreadsheet of required data.
-
Log in to the site as an admin user and navigate to the Admin portal.
-
Under the section "Content Updates", select "Update 'Download Data' Releases". (This may change to a new name.)
-
Click "Create Release", and fill out the form:
Field Value Dataset Release Version The RELEASEvalue for this new release. See the page Tokens & Variables for details on the appropriate value. NOTE: Remember this value for later steps!Wormbase Version The GENOMEvalue for this new release. See the page Tokens & Variables for details on the appropriate value.Report Type Select V2(or otherwise the most recent version).Disabled; Hidden If you wish to keep the release hidden from public users, you may select one of these. When you are finished, click "Save".
-
In the GCP datastore back-end, navigate to "Datastore Studio", and select the database for this project.
-
Query by the kind
dataset_release, and locate the release you just created - you should be able to find it by the fieldsversionand/orcreated_on. Here, you will have to add a few additional fields:Field Value browser_tracksThe list of tracks to make available in the Genome Browser tool. Unless this list has changed, you can copy the value from the previous release OF THE CURRENT SPECIES. genomeThe GENOMErelease value. See the page Tokens & Variables for details on the appropriate value.speciesThe SPECIESrelease value. See the page Tokens & Variables for details on the appropriate value.When you are finished, save the new entity.
-
Query by the kind
species, and locate the entity representing the species for the new release. -
Update the species entity's
release_latestvalue to theRELEASEvalue of the new release, i.e. the value in the release'sversionfield. If the species entity has alatest_releasefield, update it to the new release value as well.- The value
release_latestis the one that is used across the site. - The value
latest_releaseis a legacy value, and can likely be dropped, but I include it here for the sake of completeness.
- The value
-
If applicable, update the species entity's other
release_fields as well:- If the Pairwise Indel Finder tool should switch over to the new release as well, update the
release_pifvalue to the new version; otherwise, leave the value as-is. For more information, see the section Pairwise Indel Finder Release. - If the Strain Variant Annotation tool should switch over to the new release as well, update the
release_svafield (and the legacysva_verfield, if it exists) to the new version; otherwise, leave the value(s) as-is. For more information, see the section Strain Variant Annotation Release.
- If the Pairwise Indel Finder tool should switch over to the new release as well, update the
-
When you are finished, save the species entity.
A few of the site features may be pinned to older releases, i.e. they may continue to use data from a previous release until that data is ready for the most recent release.
The Pairwise Indel Finder tool uses the value release_pif. This is used to determine the BED and VCF files used in the display & operation of the tool, as well as the list of available strains.
The Pairwise Indel Finder tool uses the value release_sva. In older versions of the codebase, this field was called sva_ver.
This may be used to determine which SVA_CSVGZ file to use when building the strain_variant_annotation SQL table, by using the token SVA in the filename template. For more on this variable, see the Strain Variant Annotation section of the Data Dependencies page.