Data Dependencies - AndersenLab/CAENDR GitHub Wiki
This page outlines all the internal & external data dependencies for the CaeNDR site.
Name | Description | Bucket | Read More |
---|---|---|---|
Containerized Tool Data Files | Data files used by the CaeNDR tools, both on the site page and during tool operation. | multiple | link |
SQL Database Source Files | Large data files used to build the SQL tables, which in turn allow much more efficient querying of the data. | DB |
link |
Dataset Release Files | Species-specific data files included in the periodic CaeNDR dataset releases. | PUBLIC |
link |
Strain Photos | Pictures of where specific strains were collected. | PHOTOS |
link |
Profile Photos | Pictures of staff, advisors, etc on various "About Us" pages. | PUBLIC |
link |
BAM/BAI Files | Genome data files used in some tools & available for user download. | PRIVATE |
link |
Nemascan requires species data to be manually uploaded to cloud storage to make it accessible to the pipeline:
${MODULE_SITE_BUCKET_PRIVATE_NAME}/NemaScan/input_data
This section describes the data sources required for each of the SQL database tables. For more information on these tables, please consult the page Managing the SQL Database.
The Google Sheets document IDs are specified with a set of GCP secret values, one for each species:
ANDERSEN_LAB_STRAIN_SHEET_{ SPECIES }
This is the part of the URL that identifies the document, e.g. the URL will be something like:
https://docs.google.com/spreadsheets/d/{ SHEET_ID }
For more information on linking to these Google Sheets, see the section Lab Strain Data. For more information on how this table is used and how to (re)build it, please consult the page Managing the SQL Database. For more information on managing strains & isotypes across the site, please consult the page Strains & Isotypes.
The wormbase gene & wormbase gene summary tables are linked, and built from the following files:
File | Table | Bucket | Path (Env Var) | Filename (Env Var) | File Type |
---|---|---|---|---|---|
Gene GFF | wormbase_gene_summary |
DB_OPERATIONS |
MODULE_DB_OPERATIONS_RELEASE_FILEPATH |
GENE_GFF_FILENAME |
Zipped GFF (.gff3.gz ) |
Gene GTF | wormbase_gene |
DB_OPERATIONS |
MODULE_DB_OPERATIONS_RELEASE_FILEPATH |
GENE_GTF_FILENAME |
Zipped GTF (.gtf.gz ) |
Gene IDs | wormbase_gene |
DB_OPERATIONS |
MODULE_DB_OPERATIONS_RELEASE_FILEPATH |
GENE_IDS_FILENAME |
Zipped Text (.txt.gz ) |
For more information on configuring these filenames, consult the page Tokens & Variables.
The strain variant annotation data is built from a CSV file zipped with gzip
:
File | Table | Bucket | Path (Env Var) | Filename (Env Var) | File Type |
---|---|---|---|---|---|
SVA CSV | strain_variant_annotation |
DB_OPERATIONS |
MODULE_DB_OPERATIONS_SVA_FILEPATH |
SVA_CSVGZ_FILENAME |
Zipped CSV (.csv.gz ) |
For more information on configuring this filename, consult the page Tokens & Variables.
The phenotype database table is build from two sources:
-
The
trait_file
datastore entities, recording trait files uploaded (a) by the Andersen Lab or (b) by public CaeNDR users. These can be found under GCP Datastore Studio by querying for the kindtrait_file
. -
The individual CSV trait files in the database. These are contained in the
DB_OPERATIONS
bucket, in the filepath specified by theMODULE_DB_OPERATIONS_TRAITFILE_PUBLIC_FILEPATH
environment variable.
For more information on configuring the trait file path, consult the page Tokens & Variables.
To add a Dataset Release to the site through the Admin panel, you will first have to upload the release files to:
{ MODULE_SITE_BUCKET_PUBLIC_NAME }/dataset_release/{ SPECIES }/{ RELEASE }
Originally, the uploaded {RELEASE}
folder was required to use the file and directory structure described in the AndersenLab dry guide. I believe this is still the case, but the most up-to-date version of the structure is in the code itself. For more information, please consult the section Data Locations on the page Dataset Releases, and/or the ReportType
object V2
in the Dataset Release model file.
Strain photos should be uploaded to the PHOTOS
bucket under the appropriate species folder, and named using the format { STRAIN }.jpg
. If there are multiple photos of one strain, subsequent photos may be named { STRAIN }_2.jpg
, { STRAIN }_3.jpg
, etc. The full filename template is:
Name | Bucket | Filepath |
---|---|---|
First Photo | PHOTOS |
{ SPECIES }/{ STRAIN }.jpg |
Subsequent Photos | PHOTOS |
{ SPECIES }/{ STRAIN }_N.jpg |
For example, if you want to upload two photos of the (fictional) C. elegans strain "ABC1234", then you would upload them as:
Name | Bucket | Filepath |
---|---|---|
Picture 1 | PHOTOS |
c_elegans/ABC1234.jpg |
Picture 2 | PHOTOS |
c_elegans/ABC1234_2.jpg |
For more information on the tokens & variables in these path names, see the Tokens & Variables page.
When a photo is uploaded to this bucket, the img_thumb_gen
module will automatically create a thumbnail version of this photo, with the suffix .thumb
added before the file extension. Using the above example, img_thumb_gen
should create the following additional files for you:
Name | Bucket | Filepath |
---|---|---|
Picture 1 Thumbnail | PHOTOS |
c_elegans/ABC1234.thumb.jpg |
Picture 2 Thumbnail | PHOTOS |
c_elegans/ABC1234_2.thumb.jpg |
For more information on buckets, see __.
These photos are currently contained in the "public" bucket, under the path profile/photos
.
To add a new photo, upload an appropriate image file to this folder, and make its name the same as the entity ID for the profile
it's for. For more information on adding a new profile
, see __.
BAM and BAI files are stored in:
File Type | Bucket | Path |
---|---|---|
BAM | MODULE_SITE_BUCKET_PRIVATE_NAME |
bam/{ SPECIES }/{ STRAIN }.bam |
BAI | MODULE_SITE_BUCKET_PRIVATE_NAME |
bam/{ SPECIES }/{ STRAIN }.bam.bai |
For more information on buckets, see __. For more information on the tokens & variables in these path names, see the Tokens & Variables page.