Dataset Releases - AndersenLab/CAENDR GitHub Wiki
This page describes the workflow for creating a new dataset release for a specific species.
Dataset releases are species-specific and version-specific. This page makes frequent use of the tokens SPECIES
and RELEASE
for these values. For more information on token values, please see the page Tokens & Variables.
- Data Locations: The set of files involved in a single release, as defined in the codebase.
- Release Page: Notes on what parts of the release page are rendered from what files.
- Deploying a New Release: Instructions for creating a new dataset release.
- Pinned Releases: Notes on what site features may remain pinned to an older release's data files.
This section describes the data available through the individual release pages.
All files are held in the Dataset Release bucket, under the specified release folder. For more information, see the V2
instance of the ReportType
class in the dataset release file (code link).
File Name | Filepath |
---|---|
release_notes |
release_notes_v2.md |
summary |
summary.md |
methods |
methods.md |
alignment_report |
alignment_report.html |
gatk_report |
gatk_report.html |
concordance_report |
concordance_report.html |
File Name | Filepath |
---|---|
divergent_regions_strain_bed_gz |
browser_tracks/{RELEASE}_{SPECIES}_divergent_regions_strain.bed.gz |
divergent_regions_strain_bed |
browser_tracks/{RELEASE}_{SPECIES}_divergent_regions_strain.bed |
File Name | Filepath |
---|---|
soft_filter_vcf_gz |
variation/WI.{RELEASE}.soft-filter.vcf.gz |
soft_filter_vcf_gz_tbi |
variation/WI.{RELEASE}.soft-filter.vcf.gz.tbi |
soft_filter_isotype_vcf_gz |
variation/WI.{RELEASE}.soft-filter.isotype.vcf.gz |
soft_filter_isotype_vcf_gz_tbi |
variation/WI.{RELEASE}.soft-filter.isotype.vcf.gz.tbi |
hard_filter_vcf_gz |
variation/WI.{RELEASE}.hard-filter.vcf.gz |
hard_filter_vcf_gz_tbi |
variation/WI.{RELEASE}.hard-filter.vcf.gz.tbi |
hard_filter_isotype_vcf_gz |
variation/WI.{RELEASE}.hard-filter.isotype.vcf.gz |
hard_filter_isotype_vcf_gz_tbi |
variation/WI.{RELEASE}.hard-filter.isotype.vcf.gz.tbi |
impute_isotype_vcf_gz |
variation/WI.{RELEASE}.impute.isotype.vcf.gz |
impute_isotype_vcf_gz_tbi |
variation/WI.{RELEASE}.impute.isotype.vcf.gz.tbi |
File Name | Filepath |
---|---|
hard_filter_min4_tree |
tree/WI.{RELEASE}.hard-filter.min4.tree |
hard_filter_min4_tree_pdf |
tree/WI.{RELEASE}.hard-filter.min4.tree.pdf |
hard_filter_isotype_min4_tree |
tree/WI.{RELEASE}.hard-filter.isotype.min4.tree |
hard_filter_isotype_min4_tree_pdf |
tree/WI.{RELEASE}.hard-filter.isotype.min4.tree.pdf |
File Name | Filepath |
---|---|
haplotype_png |
haplotype/haplotype.png |
haplotype_pdf |
haplotype/haplotype.pdf |
sweep_pdf |
haplotype/sweep.pdf |
sweep_summary_tsv |
haplotype/sweep_summary.tsv |
File Name | Filepath |
---|---|
transposon_calls |
{RELEASE}_{SPECIES}_transposon_calls.bed |
This section describes how the release data is used to render the release page tabs.
The "Release Notes" section is rendered from the Markdown file release_notes
, and the "Release Summary" section is rendered from the Markdown file summary
. For the locations of these files, see the section Report Files.
NOTE: In the Release Summary, the "Genome" value should correspond with the GENOME
token for this release. For more information on tokens, please see the page Tokens & Variables.
Rendered from the methods
Markdown file.
For the location of this file, see the section Report Files.
Rendered from the alignment_report
Markdown file.
For the location of this file, see the section Report Files.
Rendered from the gatk_report
Markdown file.
For the location of this file, see the section Report Files.
Rendered from the concordance_report
Markdown file.
For the location of this file, see the section Report Files.
Rendered from the haplotype_png
and haplotype_pdf
files.
For the locations of these files, see the section Haplotypes.
Rendered from the sweep_pdf
and sweep_summary_tsv
files.
For the locations of these files, see the section Haplotypes.
Rendered from the hard_filter_isotype_min4_tree_pdf
file.
For the location of this file, see the section Filter Trees.
This section describes how to create and publish a new dataset release.
NOTE: As of March 2025, not all tasks are automated yet! Parts of this flow changed during the site-v2 development cycle, and a few steps must be performed manually. Further development might change this.
To deploy a new release, you will need:
- Admin access to the CaeNDR site
- Access to the datastore back-end (GCP)
The datastore access is required to manually fill out a few fields & make some updates that have not yet been integrated into the automated new release flow.
-
Upload all relevant files to the new release bucket. For more information on required files, see the section Data Locations and/or consult the spreadsheet of required data.
-
Log in to the site as an admin user and navigate to the Admin portal.
-
Under the section "Content Updates", select "Update 'Download Data' Releases". (This may change to a new name.)
-
Click "Create Release", and fill out the form:
Field Value Dataset Release Version The RELEASE
value for this new release. See the page Tokens & Variables for details on the appropriate value. NOTE: Remember this value for later steps!Wormbase Version The GENOME
value for this new release. See the page Tokens & Variables for details on the appropriate value.Report Type Select V2
(or otherwise the most recent version).Disabled; Hidden If you wish to keep the release hidden from public users, you may select one of these. When you are finished, click "Save".
-
In the GCP datastore back-end, navigate to "Datastore Studio", and select the database for this project.
-
Query by the kind
dataset_release
, and locate the release you just created - you should be able to find it by the fieldsversion
and/orcreated_on
. Here, you will have to add a few additional fields:Field Value browser_tracks
The list of tracks to make available in the Genome Browser tool. Unless this list has changed, you can copy the value from the previous release OF THE CURRENT SPECIES. genome
The GENOME
release value. See the page Tokens & Variables for details on the appropriate value.species
The SPECIES
release value. See the page Tokens & Variables for details on the appropriate value.When you are finished, save the new entity.
-
Query by the kind
species
, and locate the entity representing the species for the new release. -
Update the species entity's
release_latest
value to theRELEASE
value of the new release, i.e. the value in the release'sversion
field. If the species entity has alatest_release
field, update it to the new release value as well.- The value
release_latest
is the one that is used across the site. - The value
latest_release
is a legacy value, and can likely be dropped, but I include it here for the sake of completeness.
- The value
-
If applicable, update the species entity's other
release_
fields as well:- If the Pairwise Indel Finder tool should switch over to the new release as well, update the
release_pif
value to the new version; otherwise, leave the value as-is. For more information, see the section Pairwise Indel Finder Release. - If the Strain Variant Annotation tool should switch over to the new release as well, update the
release_sva
field (and the legacysva_ver
field, if it exists) to the new version; otherwise, leave the value(s) as-is. For more information, see the section Strain Variant Annotation Release.
- If the Pairwise Indel Finder tool should switch over to the new release as well, update the
-
When you are finished, save the species entity.
A few of the site features may be pinned to older releases, i.e. they may continue to use data from a previous release until that data is ready for the most recent release.
The Pairwise Indel Finder tool uses the value release_pif
. This is used to determine the BED and VCF files used in the display & operation of the tool, as well as the list of available strains.
The Pairwise Indel Finder tool uses the value release_sva
. In older versions of the codebase, this field was called sva_ver
.
This may be used to determine which SVA_CSVGZ
file to use when building the strain_variant_annotation
SQL table, by using the token SVA
in the filename template. For more on this variable, see the Strain Variant Annotation section of the Data Dependencies page.