Tokens & Variables - AndersenLab/CAENDR GitHub Wiki
Tokens are changing values that may be used in strings. They are typically used in filenames / filepaths, to distinguish e.g. between data files for different species or different releases.
For example, some filepaths may be given in the format:
... /{ SPECIES }/{ RELEASE }/sample_file_name-{ SPECIES }-{ RELEASE }.txt
If we wanted to use the version of this file associated with C. elegans from the (made-up) CaeNDR release on Jan 1, 2000, we would look for the file:
... /c_elegans/20000101/sample_file_name-c_elegans-20000101.txt
Token | Description | Values |
---|---|---|
SPECIES |
The relevant Caenorhabditis species for a given dataset or operation. The (unofficial) format is genus initial + underscore + species name -- in our case, this means all valid values begin with c_ . |
c_elegans , c_briggsae , c_tropicalis
|
RELEASE |
The CaeNDR release that a data file was released in. New values added when new dataset releases are created. | "YYYYMMDD" format |
SVA |
The CaeNDR release to use for the Strain Variant Annotation table & tool. This has the same format (and possible values?) as the RELEASE token, but is kept separate because this tool may lag behind the most recent CaeNDR release - i.e. the SVA release and the "main" release may be different. |
"YYYYMMDD" format, same as RELEASE
|
GENOME |
An attached to a given release, relevant for locating the correct FASTA file. I believe this is now a more generic version of the WB token, since CaeNDR expanded from using specifically WormBase data to also using lab-produced genomes. Typically 1-to-1 with RELEASE values, but (a) it is possible for multiple CaeNDR releases to use the same genome, and (b) the genome value may be lab-internal or lab-specific, i.e. have some meaning within the lab that isn't captured by the CaeNDR release value alone. |
Could be anything - WS276 (from WormBase), Feb2020 (from Andersen Lab), etc. |
STRAIN |
An identifier for a specific strain. Mostly relevant for the Genome Browser tool's IGV Browser element, which needs to pull files for specific strains. |
BRC20067 , etc. |
USER_ID |
The unique internal ID of a CaeNDR user who submitted a given data file, e.g. a phenotype trait file. | GCP Datastore Key |
PRJ |
WormBase project number associated with a dataset or release. Pulled from WormBase. May not be relevant going forward, with new CaeNDR release format. |
PRJNA13758 , etc. |
WB |
WormBase version associated with a dataset or release. Pulled from WormBase. May not be relevant going forward, with new CaeNDR release format. |
WS276 , etc. |
A few notes:
- Probably 85~90% of the time, you only need to care about the
SPECIES
andRELEASE
tokens. All the others are pretty niche and context-specific; theSVA
token, for example, is pretty much only relevant to the Strain Variant Annotation tool. - The
SVA
token may be referred to withRELEASE
, if the Strain Variant Annotation versioning scheme isn't relevant from a particular perspective. (This can be a bit confusing, but also, it's pretty infrequent that we actually care about this difference.) - The
GENOME
token value is relevant when creating a new dataset release, and will appear on the Dataset Release page. It's mostly used to associate lab-internal versioning with CaeNDR release versioning. It's possible for multiple releases to use the sameGENOME
value, if they're based on the same FASTA genome file. - The tokens
PRJ
andWB
are holdovers from an older dataset versioning system, and don't appear to be very relevant going forward. It may be helpful to know what they meant, though, if dealing with old data.
This is NOT an extensive list of all environment variables required to use the site! Rather, this is an outline of most of the "custom" environment variables defined & used by the CaeNDR source code, specifically for running the site.
All environment variables here are directly read & used in the source code itself.
Variable | Description | Type |
---|---|---|
MODULE_{NAME}_CONTAINER_NAME |
The name of the Docker container to look for to access this module. | string |
MODULE_{NAME}_CONTAINER_VERSION |
The version tag to look for on the Docker container. | string |
Note that the Image Thumbnail Generator module uses a slightly different format:
Variable | Description | Type |
---|---|---|
MODULE_IMG_THUMB_GEN_SOURCE_PATH |
The path in GCP where images are held. | string |
MODULE_IMG_THUMB_GEN_VERSION |
The version tag to look for on the Docker container. (Same as above) | string |
See the tools section below.
See (buckets page?)
Variable | Description | Type |
---|---|---|
MODULE_SITE_BUCKET_PHOTOS_NAME |
The "photos" bucket name. | string |
MODULE_SITE_BUCKET_ASSETS_NAME |
The "assets" bucket name. | string |
MODULE_SITE_BUCKET_PRIVATE_NAME |
The "private" bucket name. | string |
MODULE_SITE_BUCKET_PUBLIC_NAME |
The "public" bucket name. | string |
MODULE_DB_OPERATIONS_BUCKET_NAME |
The "db" bucket name. | string |
MODULE_SITE_BUCKET_DATASET_RELEASE_NAME |
... | string |
ETL_LOGS_BUCKET_NAME |
The "logs" bucket where database operation logs are uploaded. Might be obsolete? | string |
Override:
Variable | Description | Type |
---|---|---|
MODULE_SITE_BUCKET_PUBLIC_NAME_OVERRIDE |
... | string |
All(?) filepaths and filenames are handled as "tokenized strings", i.e. strings that may contain one or more "Token" values (see above).
Variable | Description | File Type | Type |
---|---|---|---|
BAM_BAI_PREFIX |
The location of the BAM and BAI files within the private bucket. | - |
string (tokenized) |
BAM_BAI_DOWNLOAD_SCRIPT_NAME |
The filename to download the "download BAM/BAI files" script with; that is, the filename that the user will see when they generate this file & download it. | Bash (.sh ) |
string (tokenized) |
Variable | Description | Variable Type |
---|---|---|
FASTA_FILENAME_TEMPLATE |
Name of the FASTA file for a given species & release, NOT including the file extension. This is because the full file and the index file share the same name. |
string (tokenized) |
FASTA_EXTENSION_FILE |
The filename extension for the full FASTA file. Typically .fa . |
string |
FASTA_EXTENSION_INDEX |
The filename extension for the FASTA index file. Typically .fa.fai . Note - NOT appended to the plain file extension, so they can be different. |
string |
Files used primarily to build the SQL database tables. These are all read as tokenized strings, in case the filenames need to change for species / release / etc, but as of March 2025, only the SVA_CSVGZ filename actually uses any tokens.
Filepaths:
Variable | Description | Variable Type |
---|---|---|
MODULE_DB_OPERATIONS_RELEASE_FILEPATH |
The filepath in the DB_OPERATIONS bucket containing gene files below. |
string (tokenized) |
MODULE_DB_OPERATIONS_SVA_FILEPATH |
The filepath in the DB_OPERATIONS bucket containing the SVA file below. |
string (tokenized) |
MODULE_DB_OPERATIONS_PHENOTYPE_FILEPATH |
deprecated? |
string (tokenized) |
MODULE_DB_OPERATIONS_TRAITFILE_PUBLIC_FILEPATH |
The path in the DB_OPERATIONS bucket containing the user-uploaded phenotype trait files. |
string |
Filenames:
Variable | Description | File Path | File Type | Type |
---|---|---|---|---|
GENE_GFF_FILENAME |
Name of the file used to build the "Wormbase Gene Summary" table. | MODULE_DB_OPERATIONS_RELEASE_FILEPATH |
Zipped GFF (.gff3.gz ) |
string (tokenized) |
GENE_GTF_FILENAME |
Name of one file used to build the "Wormbase Gene" table. | MODULE_DB_OPERATIONS_RELEASE_FILEPATH |
Zipped GTF (.gtf.gz ) |
string (tokenized) |
GENE_IDS_FILENAME |
Name of one file used to build the "Wormbase Gene" table. | MODULE_DB_OPERATIONS_RELEASE_FILEPATH |
Zipped Text (.txt.gz ) |
string (tokenized) |
SVA_CSVGZ_FILENAME |
Name of the file used to build the "Strain Variant Annotation" table. | MODULE_DB_OPERATIONS_SVA_FILEPATH |
Zipped CSV (.csv.gz ) |
string (tokenized) |
For more information on how these files are used, please consult SQL Database Source Files.
Other files used to populate the site.
Variable | Description | File Type | Type |
---|---|---|---|
EULA_FILE_NAME |
The file containing the site's End-User License Agreement. | Markdown (.md ) |
string |
Variable | Description | Type |
---|---|---|
MODULE_SITE_HOST |
The root URL that the CaeNDR site should be hosted on. | string |
Variable | Description | Type |
---|---|---|
MODULE_SITE_STRAIN_SUBMISSION_URL |
URL for Google Sheet tracking user-submitted strains. | string |
SENTRY_URL |
The ingest URL with Sentry for tracking site bugs / errors. | string |
Variable | Description | Type |
---|---|---|
MODULE_SITE_PASSWORD_PROTECTED |
Whether to request a password to access the site at all. Relevant for QA site. | boolean |
MODULE_SITE_CART_COOKIE_NAME |
Name for the cookie to use to store the user's cart, if they are requesting strains. | string |
MODULE_SITE_CART_COOKIE_AGE_SECONDS |
Timeout until the cart cookie expires, in seconds. | int |
MODULE_SITE_PASSWORD_RESET_EXPIRATION_SECONDS |
Timeout until a "password reset" link expires, in seconds. | int |
USER_OWNED_ENTITY_CACHE_AGE_SECONDS |
Timeout for the local cache of User entities, specifically when loading entities that are owned by users. Makes big queries where many objects have the same users much more efficient (e.g. pulling up all of a user's generated reports). | int |
These are variables that (mostly) exist for all three tools. The actual situation is a bit more complicated (because it always is), but as a rule of thumb, these are useful for all the tools.
As a quick refresher, the tools are:
Short Name | Display Name (on CaeNDR site) | Tool Name code |
---|---|---|
Nemascan | Genetic Mapping |
NEMASCAN (sometimes NEMASCAN_NXF ) |
Heritability | Heritability Calculator | HERITABILITY |
Indel Primer, Indel Finder | Pairwise Indel Finder | INDEL PRIMER |
Variable | Description | Type | Notes |
---|---|---|---|
{TOOL_CODE}_CONTAINER_NAME |
The name of the Docker Container for the relevant tool. | string |
For historical reasons, the Nemascan variable is actually prefixed with NEMASCAN_NXF , instead of just NEMASCAN . |
{TOOL_CODE}_TASK_QUEUE_NAME |
The name of the GCP Cloud Task queue that handles job submissions for this tool. | string |
|
{TOOL_CODE}_EXAMPLE_FILE |
The filepath / filename of the example data file for this tool, made available on the CaeNDR site as a sample / template for users to check against before uploading their data. Species-specific. |
string (tokenized) |
Not necessary for Indel Finder. |
Variables specific to the Nemascan ("Genetic Mapping") tool.
Variable | Description | Type |
---|---|---|
NEMASCAN_SOURCE_GITHUB_ORG |
The GitHub organization to pull the Nemascan code from. Used when pushing new versions of the image, in the nemascan-proxy module. |
string |
NEMASCAN_SOURCE_GITHUB_REPO |
The GitHub repository in the above organization to pull the Nemascan code from. Used when pushing new versions of the image, in the nemascan-proxy module. |
string |
The tag to publish with is pulled from the command line.
Variables specific to the Heritability Calculator tool.
Variable | Description | Type |
---|---|---|
HERITABILITY_CONTAINER_VERSION |
The Docker image version tag to use when pushing new versions of the Heritability tool. Used in the heritability_proxy module. |
string |
Variables specific to the Pairwise Indel Finder (or "Indel Primer") tool.
Variable | Description | Type |
---|---|---|
INDEL_PRIMER_TOOL_PATH |
The GCP path to pull Indel Finder static data files from (BED, VCF, index files, etc). Located in the private bucket. | string |
INDEL_PRIMER_SOURCE_FILENAME |
The naming schema for the BED and VCF files. Omits the file extension, since there are multiple files if different types using this same name. |
string (tokenized) |
Variables used to configure GCP access. See GCP docs for details.
Token | Type |
---|---|
GOOGLE_CLOUD_PROJECT_ID |
string |
GOOGLE_CLOUD_PROJECT_NUMBER |
string |
GOOGLE_CLOUD_REGION |
string |
GOOGLE_CLOUD_ZONE |
string |
GOOGLE_CLOUD_APP_LOCATION |
string |
Token | Type |
---|---|
GOOGLE_STORAGE_SERVICE_ACCOUNT_NAME |
string |
Token | Type |
---|---|
GOOGLE_CLOUDSQL_SERVICE_ACCOUNT_NAME |
string |
Token | Type |
---|---|
GOOGLE_ANALYTICS_SERVICE_ACCOUNT_NAME |
string |
GOOGLE_ANALYTICS_PROPERTY_ID |
string |
Token | Type |
---|---|
GOOGLE_SHEETS_SERVICE_ACCOUNT_NAME |
string |