Data Dependencies - AndersenLab/CAENDR GitHub Wiki

Overview

This page outlines all the internal & external data dependencies for the CaeNDR site.

Name Description Bucket Read More
Containerized Tool Data Files Data files used by the CaeNDR tools, both on the site page and during tool operation. multiple link
SQL Database Source Files Large data files used to build the SQL tables, which in turn allow much more efficient querying of the data. DB link
Dataset Release Files Species-specific data files included in the periodic CaeNDR dataset releases. PUBLIC link
Strain Photos Pictures of where specific strains were collected. PHOTOS link
Profile Photos Pictures of staff, advisors, etc on various "About Us" pages. PUBLIC link
BAM/BAI Files Genome data files used in some tools & available for user download. PRIVATE link



Containerized Tool Data Files

Nemascan

Nemascan requires species data to be manually uploaded to cloud storage to make it accessible to the pipeline:

${MODULE_SITE_BUCKET_PRIVATE_NAME}/NemaScan/input_data




SQL Database Source Files

This section describes the data sources required for each of the SQL database tables. For more information on these tables, please consult the page Managing the SQL Database.


Strains

The Google Sheets document IDs are specified with a set of GCP secret values, one for each species:

ANDERSEN_LAB_STRAIN_SHEET_{ SPECIES }

This is the part of the URL that identifies the document, e.g. the URL will be something like:

https://docs.google.com/spreadsheets/d/{ SHEET_ID }

For more information on linking to these Google Sheets, see the section Lab Strain Data. For more information on how this table is used and how to (re)build it, please consult the page Managing the SQL Database. For more information on managing strains & isotypes across the site, please consult the page Strains & Isotypes.


Wormbase Genes

The wormbase gene & wormbase gene summary tables are linked, and built from the following files:

File Table Bucket Path (Env Var) Filename (Env Var) File Type
Gene GFF wormbase_gene_summary DB_OPERATIONS MODULE_DB_OPERATIONS_RELEASE_FILEPATH GENE_GFF_FILENAME Zipped GFF (.gff3.gz)
Gene GTF wormbase_gene DB_OPERATIONS MODULE_DB_OPERATIONS_RELEASE_FILEPATH GENE_GTF_FILENAME Zipped GTF (.gtf.gz)
Gene IDs wormbase_gene DB_OPERATIONS MODULE_DB_OPERATIONS_RELEASE_FILEPATH GENE_IDS_FILENAME Zipped Text (.txt.gz)

For more information on configuring these filenames, consult the page Tokens & Variables.


Strain Variant Annotation

The strain variant annotation data is built from a CSV file zipped with gzip:

File Table Bucket Path (Env Var) Filename (Env Var) File Type
SVA CSV strain_variant_annotation DB_OPERATIONS MODULE_DB_OPERATIONS_SVA_FILEPATH SVA_CSVGZ_FILENAME Zipped CSV (.csv.gz)

For more information on configuring this filename, consult the page Tokens & Variables.


Phenotype Database

The phenotype database table is build from two sources:

  • The trait_file datastore entities, recording trait files uploaded (a) by the Andersen Lab or (b) by public CaeNDR users. These can be found under GCP Datastore Studio by querying for the kind trait_file.

  • The individual CSV trait files in the database. These are contained in the DB_OPERATIONS bucket, in the filepath specified by the MODULE_DB_OPERATIONS_TRAITFILE_PUBLIC_FILEPATH environment variable.

For more information on configuring the trait file path, consult the page Tokens & Variables.




Dataset Release Files

To add a Dataset Release to the site through the Admin panel, you will first have to upload the release files to:

{ MODULE_SITE_BUCKET_PUBLIC_NAME }/dataset_release/{ SPECIES }/{ RELEASE }

Originally, the uploaded {RELEASE} folder was required to use the file and directory structure described in the AndersenLab dry guide. I believe this is still the case, but the most up-to-date version of the structure is in the code itself. For more information, please consult the section Data Locations on the page Dataset Releases, and/or the ReportType object V2 in the Dataset Release model file.




Strain Photos

Uploading Photos

Strain photos should be uploaded to the PHOTOS bucket under the appropriate species folder, and named using the format { STRAIN }.jpg. If there are multiple photos of one strain, subsequent photos may be named { STRAIN }_2.jpg, { STRAIN }_3.jpg, etc. The full filename template is:

Name Bucket Filepath
First Photo PHOTOS { SPECIES }/{ STRAIN }.jpg
Subsequent Photos PHOTOS { SPECIES }/{ STRAIN }_N.jpg

For example, if you want to upload two photos of the (fictional) C. elegans strain "ABC1234", then you would upload them as:

Name Bucket Filepath
Picture 1 PHOTOS c_elegans/ABC1234.jpg
Picture 2 PHOTOS c_elegans/ABC1234_2.jpg

For more information on the tokens & variables in these path names, see the Tokens & Variables page.


Auto-Generated Thumbnail Photos

When a photo is uploaded to this bucket, the img_thumb_gen module will automatically create a thumbnail version of this photo, with the suffix .thumb added before the file extension. Using the above example, img_thumb_gen should create the following additional files for you:

Name Bucket Filepath
Picture 1 Thumbnail PHOTOS c_elegans/ABC1234.thumb.jpg
Picture 2 Thumbnail PHOTOS c_elegans/ABC1234_2.thumb.jpg

For more information on buckets, see __.




Profile Photos

These photos are currently contained in the "public" bucket, under the path profile/photos.

To add a new photo, upload an appropriate image file to this folder, and make its name the same as the entity ID for the profile it's for. For more information on adding a new profile, see __.




BAM/BAI Files

BAM and BAI files are stored in:

File Type Bucket Path
BAM MODULE_SITE_BUCKET_PRIVATE_NAME bam/{ SPECIES }/{ STRAIN }.bam
BAI MODULE_SITE_BUCKET_PRIVATE_NAME bam/{ SPECIES }/{ STRAIN }.bam.bai

For more information on buckets, see __. For more information on the tokens & variables in these path names, see the Tokens & Variables page.

⚠️ **GitHub.com Fallback** ⚠️