Tokens & Variables - AndersenLab/CAENDR GitHub Wiki

Tokens

Tokens are changing values that may be used in strings. They are typically used in filenames / filepaths, to distinguish e.g. between data files for different species or different releases.

For example, some filepaths may be given in the format:

... /{ SPECIES }/{ RELEASE }/sample_file_name-{ SPECIES }-{ RELEASE }.txt

If we wanted to use the version of this file associated with C. elegans from the (made-up) CaeNDR release on Jan 1, 2000, we would look for the file:

... /c_elegans/20000101/sample_file_name-c_elegans-20000101.txt

Token List

Token Description Values
SPECIES The relevant Caenorhabditis species for a given dataset or operation. The (unofficial) format is genus initial + underscore + species name -- in our case, this means all valid values begin with c_. c_elegans, c_briggsae, c_tropicalis
RELEASE The CaeNDR release that a data file was released in. New values added when new dataset releases are created. "YYYYMMDD" format
SVA The CaeNDR release to use for the Strain Variant Annotation table & tool. This has the same format (and possible values?) as the RELEASE token, but is kept separate because this tool may lag behind the most recent CaeNDR release - i.e. the SVA release and the "main" release may be different. "YYYYMMDD" format, same as RELEASE
GENOME An attached to a given release, relevant for locating the correct FASTA file. I believe this is now a more generic version of the WB token, since CaeNDR expanded from using specifically WormBase data to also using lab-produced genomes. Typically 1-to-1 with RELEASE values, but (a) it is possible for multiple CaeNDR releases to use the same genome, and (b) the genome value may be lab-internal or lab-specific, i.e. have some meaning within the lab that isn't captured by the CaeNDR release value alone. Could be anything - WS276 (from WormBase), Feb2020 (from Andersen Lab), etc.
STRAIN An identifier for a specific strain. Mostly relevant for the Genome Browser tool's IGV Browser element, which needs to pull files for specific strains. BRC20067, etc.
USER_ID The unique internal ID of a CaeNDR user who submitted a given data file, e.g. a phenotype trait file. GCP Datastore Key
PRJ WormBase project number associated with a dataset or release. Pulled from WormBase. May not be relevant going forward, with new CaeNDR release format. PRJNA13758, etc.
WB WormBase version associated with a dataset or release. Pulled from WormBase. May not be relevant going forward, with new CaeNDR release format. WS276, etc.

A few notes:

  • Probably 85~90% of the time, you only need to care about the SPECIES and RELEASE tokens. All the others are pretty niche and context-specific; the SVA token, for example, is pretty much only relevant to the Strain Variant Annotation tool.
  • The SVA token may be referred to with RELEASE, if the Strain Variant Annotation versioning scheme isn't relevant from a particular perspective. (This can be a bit confusing, but also, it's pretty infrequent that we actually care about this difference.)
  • The GENOME token value is relevant when creating a new dataset release, and will appear on the Dataset Release page. It's mostly used to associate lab-internal versioning with CaeNDR release versioning. It's possible for multiple releases to use the same GENOME value, if they're based on the same FASTA genome file.
  • The tokens PRJ and WB are holdovers from an older dataset versioning system, and don't appear to be very relevant going forward. It may be helpful to know what they meant, though, if dealing with old data.



Site Environment Variables

This is NOT an extensive list of all environment variables required to use the site! Rather, this is an outline of most of the "custom" environment variables defined & used by the CaeNDR source code, specifically for running the site.

All environment variables here are directly read & used in the source code itself.


Module Configuration

Variable Description Type
MODULE_{NAME}_CONTAINER_NAME The name of the Docker container to look for to access this module. string
MODULE_{NAME}_CONTAINER_VERSION The version tag to look for on the Docker container. string

Note that the Image Thumbnail Generator module uses a slightly different format:

Variable Description Type
MODULE_IMG_THUMB_GEN_SOURCE_PATH The path in GCP where images are held. string
MODULE_IMG_THUMB_GEN_VERSION The version tag to look for on the Docker container. (Same as above) string

Containerized Tool Configuration

See the tools section below.


GCP Bucket Names

See (buckets page?)

Variable Description Type
MODULE_SITE_BUCKET_PHOTOS_NAME The "photos" bucket name. string
MODULE_SITE_BUCKET_ASSETS_NAME The "assets" bucket name. string
MODULE_SITE_BUCKET_PRIVATE_NAME The "private" bucket name. string
MODULE_SITE_BUCKET_PUBLIC_NAME The "public" bucket name. string
MODULE_DB_OPERATIONS_BUCKET_NAME The "db" bucket name. string
MODULE_SITE_BUCKET_DATASET_RELEASE_NAME ... string
ETL_LOGS_BUCKET_NAME The "logs" bucket where database operation logs are uploaded. Might be obsolete? string

Override:

Variable Description Type
MODULE_SITE_BUCKET_PUBLIC_NAME_OVERRIDE ... string

Filepaths & Filenames

All(?) filepaths and filenames are handled as "tokenized strings", i.e. strings that may contain one or more "Token" values (see above).

BAM/BAI Files

Variable Description File Type Type
BAM_BAI_PREFIX The location of the BAM and BAI files within the private bucket. - string (tokenized)
BAM_BAI_DOWNLOAD_SCRIPT_NAME The filename to download the "download BAM/BAI files" script with; that is, the filename that the user will see when they generate this file & download it. Bash (.sh) string (tokenized)

FASTA Files

Variable Description Variable Type
FASTA_FILENAME_TEMPLATE Name of the FASTA file for a given species & release, NOT including the file extension. This is because the full file and the index file share the same name. string (tokenized)
FASTA_EXTENSION_FILE The filename extension for the full FASTA file. Typically .fa. string
FASTA_EXTENSION_INDEX The filename extension for the FASTA index file. Typically .fa.fai. Note - NOT appended to the plain file extension, so they can be different. string

SQL Table Source Files

Files used primarily to build the SQL database tables. These are all read as tokenized strings, in case the filenames need to change for species / release / etc, but as of March 2025, only the SVA_CSVGZ filename actually uses any tokens.

Filepaths:

Variable Description Variable Type
MODULE_DB_OPERATIONS_RELEASE_FILEPATH The filepath in the DB_OPERATIONS bucket containing gene files below. string (tokenized)
MODULE_DB_OPERATIONS_SVA_FILEPATH The filepath in the DB_OPERATIONS bucket containing the SVA file below. string (tokenized)
MODULE_DB_OPERATIONS_PHENOTYPE_FILEPATH deprecated? string (tokenized)
MODULE_DB_OPERATIONS_TRAITFILE_PUBLIC_FILEPATH The path in the DB_OPERATIONS bucket containing the user-uploaded phenotype trait files. string

Filenames:

Variable Description File Path File Type Type
GENE_GFF_FILENAME Name of the file used to build the "Wormbase Gene Summary" table. MODULE_DB_OPERATIONS_RELEASE_FILEPATH Zipped GFF (.gff3.gz) string (tokenized)
GENE_GTF_FILENAME Name of one file used to build the "Wormbase Gene" table. MODULE_DB_OPERATIONS_RELEASE_FILEPATH Zipped GTF (.gtf.gz) string (tokenized)
GENE_IDS_FILENAME Name of one file used to build the "Wormbase Gene" table. MODULE_DB_OPERATIONS_RELEASE_FILEPATH Zipped Text (.txt.gz) string (tokenized)
SVA_CSVGZ_FILENAME Name of the file used to build the "Strain Variant Annotation" table. MODULE_DB_OPERATIONS_SVA_FILEPATH Zipped CSV (.csv.gz) string (tokenized)

For more information on how these files are used, please consult SQL Database Source Files.

Miscellaneous Files

Other files used to populate the site.

Variable Description File Type Type
EULA_FILE_NAME The file containing the site's End-User License Agreement. Markdown (.md) string

URLs

Project-Internal URLs

Variable Description Type
MODULE_SITE_HOST The root URL that the CaeNDR site should be hosted on. string

Project-External URLs

Variable Description Type
MODULE_SITE_STRAIN_SUBMISSION_URL URL for Google Sheet tracking user-submitted strains. string
SENTRY_URL The ingest URL with Sentry for tracking site bugs / errors. string

Misc

Variable Description Type
MODULE_SITE_PASSWORD_PROTECTED Whether to request a password to access the site at all. Relevant for QA site. boolean
MODULE_SITE_CART_COOKIE_NAME Name for the cookie to use to store the user's cart, if they are requesting strains. string
MODULE_SITE_CART_COOKIE_AGE_SECONDS Timeout until the cart cookie expires, in seconds. int
MODULE_SITE_PASSWORD_RESET_EXPIRATION_SECONDS Timeout until a "password reset" link expires, in seconds. int
USER_OWNED_ENTITY_CACHE_AGE_SECONDS Timeout for the local cache of User entities, specifically when loading entities that are owned by users. Makes big queries where many objects have the same users much more efficient (e.g. pulling up all of a user's generated reports). int



Containerized Tool Variables

General Tool Variables

These are variables that (mostly) exist for all three tools. The actual situation is a bit more complicated (because it always is), but as a rule of thumb, these are useful for all the tools.

As a quick refresher, the tools are:

Short Name Display Name (on CaeNDR site) Tool Name code
Nemascan Genetic Mapping NEMASCAN (sometimes NEMASCAN_NXF)
Heritability Heritability Calculator HERITABILITY
Indel Primer, Indel Finder Pairwise Indel Finder INDEL PRIMER
Variable Description Type Notes
{TOOL_CODE}_CONTAINER_NAME The name of the Docker Container for the relevant tool. string For historical reasons, the Nemascan variable is actually prefixed with NEMASCAN_NXF, instead of just NEMASCAN.
{TOOL_CODE}_TASK_QUEUE_NAME The name of the GCP Cloud Task queue that handles job submissions for this tool. string
{TOOL_CODE}_EXAMPLE_FILE The filepath / filename of the example data file for this tool, made available on the CaeNDR site as a sample / template for users to check against before uploading their data. Species-specific. string (tokenized) Not necessary for Indel Finder.

Nemascan

Variables specific to the Nemascan ("Genetic Mapping") tool.

Variable Description Type
NEMASCAN_SOURCE_GITHUB_ORG The GitHub organization to pull the Nemascan code from. Used when pushing new versions of the image, in the nemascan-proxy module. string
NEMASCAN_SOURCE_GITHUB_REPO The GitHub repository in the above organization to pull the Nemascan code from. Used when pushing new versions of the image, in the nemascan-proxy module. string

The tag to publish with is pulled from the command line.

Heritability

Variables specific to the Heritability Calculator tool.

Variable Description Type
HERITABILITY_CONTAINER_VERSION The Docker image version tag to use when pushing new versions of the Heritability tool. Used in the heritability_proxy module. string

Pairwise Indel Finder

Variables specific to the Pairwise Indel Finder (or "Indel Primer") tool.

Variable Description Type
INDEL_PRIMER_TOOL_PATH The GCP path to pull Indel Finder static data files from (BED, VCF, index files, etc). Located in the private bucket. string
INDEL_PRIMER_SOURCE_FILENAME The naming schema for the BED and VCF files. Omits the file extension, since there are multiple files if different types using this same name. string (tokenized)



Google Cloud Platform Variables

Variables used to configure GCP access. See GCP docs for details.

GCP Project Configuration

Token Type
GOOGLE_CLOUD_PROJECT_ID string
GOOGLE_CLOUD_PROJECT_NUMBER string
GOOGLE_CLOUD_REGION string
GOOGLE_CLOUD_ZONE string
GOOGLE_CLOUD_APP_LOCATION string

Google Datastore

Token Type
GOOGLE_STORAGE_SERVICE_ACCOUNT_NAME string

Google SQL

Token Type
GOOGLE_CLOUDSQL_SERVICE_ACCOUNT_NAME string

Google Analytics

Token Type
GOOGLE_ANALYTICS_SERVICE_ACCOUNT_NAME string
GOOGLE_ANALYTICS_PROPERTY_ID string

Google Sheets

Token Type
GOOGLE_SHEETS_SERVICE_ACCOUNT_NAME string
⚠️ **GitHub.com Fallback** ⚠️