Overview

This page outlines all the internal & external data dependencies for the CaeNDR site.

Name	Description	Bucket	Read More
Containerized Tool Data Files	Data files used by the CaeNDR tools, both on the site page and during tool operation.	multiple	link
SQL Database Source Files	Large data files used to build the SQL tables, which in turn allow much more efficient querying of the data.	`DB`	link
Dataset Release Files	Species-specific data files included in the periodic CaeNDR dataset releases.	`PUBLIC`	link
Strain Photos	Pictures of where specific strains were collected.	`PHOTOS`	link
Profile Photos	Pictures of staff, advisors, etc on various "About Us" pages.	`PUBLIC`	link
BAM/BAI Files	Genome data files used in some tools & available for user download.	`PRIVATE`	link

Containerized Tool Data Files

Nemascan

Nemascan requires species data to be manually uploaded to cloud storage to make it accessible to the pipeline:

${MODULE_SITE_BUCKET_PRIVATE_NAME}/NemaScan/input_data

SQL Database Source Files

This section describes the data sources required for each of the SQL database tables. For more information on these tables, please consult the page Managing the SQL Database.

Strains

The Google Sheets document IDs are specified with a set of GCP secret values, one for each species:

ANDERSEN_LAB_STRAIN_SHEET_{ SPECIES }

This is the part of the URL that identifies the document, e.g. the URL will be something like:

https://docs.google.com/spreadsheets/d/{ SHEET_ID }

For more information on linking to these Google Sheets, see the section Lab Strain Data. For more information on how this table is used and how to (re)build it, please consult the page Managing the SQL Database. For more information on managing strains & isotypes across the site, please consult the page Strains & Isotypes.

Wormbase Genes

The wormbase gene & wormbase gene summary tables are linked, and built from the following files:

File	Table	Bucket	Path (Env Var)	Filename (Env Var)	File Type
Gene GFF	`wormbase_gene_summary`	`DB_OPERATIONS`	`MODULE_DB_OPERATIONS_RELEASE_FILEPATH`	`GENE_GFF_FILENAME`	Zipped GFF (`.gff3.gz`)
Gene GTF	`wormbase_gene`	`DB_OPERATIONS`	`MODULE_DB_OPERATIONS_RELEASE_FILEPATH`	`GENE_GTF_FILENAME`	Zipped GTF (`.gtf.gz`)
Gene IDs	`wormbase_gene`	`DB_OPERATIONS`	`MODULE_DB_OPERATIONS_RELEASE_FILEPATH`	`GENE_IDS_FILENAME`	Zipped Text (`.txt.gz`)

For more information on configuring these filenames, consult the page Tokens & Variables.

Strain Variant Annotation

The strain variant annotation data is built from a CSV file zipped with gzip:

File	Table	Bucket	Path (Env Var)	Filename (Env Var)	File Type
SVA CSV	`strain_variant_annotation`	`DB_OPERATIONS`	`MODULE_DB_OPERATIONS_SVA_FILEPATH`	`SVA_CSVGZ_FILENAME`	Zipped CSV (`.csv.gz`)

For more information on configuring this filename, consult the page Tokens & Variables.

Phenotype Database

The phenotype database table is build from two sources:

The trait_file datastore entities, recording trait files uploaded (a) by the Andersen Lab or (b) by public CaeNDR users. These can be found under GCP Datastore Studio by querying for the kind trait_file.
The individual CSV trait files in the database. These are contained in the DB_OPERATIONS bucket, in the filepath specified by the MODULE_DB_OPERATIONS_TRAITFILE_PUBLIC_FILEPATH environment variable.

For more information on configuring the trait file path, consult the page Tokens & Variables.

Dataset Release Files

To add a Dataset Release to the site through the Admin panel, you will first have to upload the release files to:

{ MODULE_SITE_BUCKET_PUBLIC_NAME }/dataset_release/{ SPECIES }/{ RELEASE }

Originally, the uploaded {RELEASE} folder was required to use the file and directory structure described in the AndersenLab dry guide. I believe this is still the case, but the most up-to-date version of the structure is in the code itself. For more information, please consult the section Data Locations on the page Dataset Releases, and/or the ReportType object V2 in the Dataset Release model file.

Strain Photos

Uploading Photos

Strain photos should be uploaded to the PHOTOS bucket under the appropriate species folder, and named using the format { STRAIN }.jpg. If there are multiple photos of one strain, subsequent photos may be named { STRAIN }_2.jpg, { STRAIN }_3.jpg, etc. The full filename template is:

Name	Bucket	Filepath
First Photo	`PHOTOS`	`{ SPECIES }/{ STRAIN }.jpg`
Subsequent Photos	`PHOTOS`	`{ SPECIES }/{ STRAIN }_N.jpg`

For example, if you want to upload two photos of the (fictional) C. elegans strain "ABC1234", then you would upload them as:

Name	Bucket	Filepath
Picture 1	`PHOTOS`	`c_elegans/ABC1234.jpg`
Picture 2	`PHOTOS`	`c_elegans/ABC1234_2.jpg`

For more information on the tokens & variables in these path names, see the Tokens & Variables page.

Auto-Generated Thumbnail Photos

When a photo is uploaded to this bucket, the img_thumb_gen module will automatically create a thumbnail version of this photo, with the suffix .thumb added before the file extension. Using the above example, img_thumb_gen should create the following additional files for you:

Name	Bucket	Filepath
Picture 1 Thumbnail	`PHOTOS`	`c_elegans/ABC1234.thumb.jpg`
Picture 2 Thumbnail	`PHOTOS`	`c_elegans/ABC1234_2.thumb.jpg`

For more information on buckets, see __.

Profile Photos

These photos are currently contained in the "public" bucket, under the path profile/photos.

To add a new photo, upload an appropriate image file to this folder, and make its name the same as the entity ID for the profile it's for. For more information on adding a new profile, see __.

BAM/BAI Files

BAM and BAI files are stored in:

File Type	Bucket	Path
BAM	`MODULE_SITE_BUCKET_PRIVATE_NAME`	`bam/{ SPECIES }/{ STRAIN }.bam`
BAI	`MODULE_SITE_BUCKET_PRIVATE_NAME`	`bam/{ SPECIES }/{ STRAIN }.bam.bai`

For more information on buckets, see __. For more information on the tokens & variables in these path names, see the Tokens & Variables page.

Data Dependencies - AndersenLab/CAENDR GitHub Wiki

Overview

Containerized Tool Data Files

Nemascan

SQL Database Source Files

Strains

Wormbase Genes

Strain Variant Annotation

Phenotype Database

Dataset Release Files

Strain Photos

Uploading Photos

Auto-Generated Thumbnail Photos

Profile Photos

BAM/BAI Files

⚠️ GitHub.com Fallback ⚠️

Data Dependencies - AndersenLab/CAENDR GitHub Wiki

Overview

Containerized Tool Data Files

Nemascan

SQL Database Source Files

Strains

Wormbase Genes

Strain Variant Annotation

Phenotype Database

Dataset Release Files

Strain Photos

Uploading Photos

Auto-Generated Thumbnail Photos

Profile Photos

BAM/BAI Files

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️