Add a dataset - legumeinfo/ZZBrowse GitHub Wiki

How to add a dataset

Not necessarily in this order,

  1. Make sure that your annotations file exists on the LIS data store.

  2. Make sure your genetic marker files (in .gff3.gz and .gff3.gz.tbi format), and raw GWAS and/or QTL data files (in .tsv.gz format), exist in the LIS data store. ZZBrowse will use these to generate datasets from your raw data.

  3. To do: more thorough explanation of the data generation process ("Poor man's DSCensor").

  • Run the first part of datasets.R to scan the data store to create lists of GWAS and QTL data files and marker files.
  • The second part of datasets.R generates tables of traits and their ontology codes.
  • Back up, then delete any outdated ZZBrowse GWAS and QTL datasets from www/config/data
  • Launch ZZBrowse (or restart shiny-server), this detects that the files from step 3 no longer exist, and regenerates them from the data store. This can take a long time, but only needs to be done once (per data update or disaster).
  • Also run combine-gwas-qtl.R for each species to generate the combined GWAS-QTL datasets.
  1. If your organism file does not already exist, create it in the organisms subdirectory.
    Line 1 - the organism display name
    Line 2 - its chromosome lengths, either numeric or in the form name:length
    Line 3 - forms of the organism name: Genus species,G.species,Gensp
    Line 4 - URL or local file path of the annotations file (from step 1)
    Line 5 - full chromosome name format, as in the annotations file. ZZBrowse will automatically create the short display format and matching regex from this.
    Line 6 - base URL for Services API genomic linkage queries
    Line 7 - tags for constructing annotations table: strand column name, forward strand code, reverse strand code, start-of-gene column name, end-of-gene column name, URL format for returning gene links, gene id column name (to plug into URL format), gene name column name, chromosome column name, gene description column name

  2. In www/config/datasetProperties.csv, add a line for each of your new GWAS and/or QTL datasets.
    dataset = the dataset's display name.
    chrColumn = which column in the dataset contains the chromosome name. Note that this must begin with "chr" (case-insensitive).
    bpColumn = which column contains the SNP position (for GWAS data) or interval center position (for QTL data).
    traitCol = which column contains the trait or phenotype.
    yAxisColumn = which column contains the p-value (or other significance value or score).
    logP = whether to use -log10(yAxisColumn) in the charts (generally TRUE for p-values, FALSE for others).
    axisLim = whether to specify hard y-axis limits on the charts (always FALSE for our data).
    axisMin = hard bottom of y-axis (or 0 if axisLim = FALSE).
    axisMax = hard top of y-axis (or 1 if axisLim = FALSE).
    organism = the species to which the dataset refers.
    plotAll = whether all data are for the same trait (probably always FALSE for our data).
    supportInterval = whether to support interval data, as for QTL data. Set the remaining columns to something meaningful if supportInterval is TRUE:
    SIyAxisColumn = which column contains the significance value for interval data ("val" for those we generate on the fly).
    SIbpStart = which column contains the start position for interval data.
    SIbpEnd = which column contains the end position for interval data.
    SIaxisLimBool = whether to specify hard y-axis limits for interval data (always FALSE for our data).
    SIaxisMin = hard bottom of interval y-axis (or 0 if SIaxisLimBool = FALSE).
    SIaxisMax = hard bottom of interval y-axis (or 1 if SIaxisLimBool = FALSE).

  3. Tell ZZBrowse where to find your data:
    buildGWAS.R - add its lis.datastore.info
    buildQTL.R - add its lis.datastore.info
    server.R - add it to lis.datastore.gwas or lis.datastore.qtl

For GWAS data that live elsewhere than the data store: in buildGWAS.R, add any remote GWAS URLs, specify their column names, and do any special handling.

  1. Other notes

buildQTL.R needs no p-value column as it automatically generates a column of 0s and dynamically assigns the y-value for the QTL bars.

Combined GWAS-QTL datasets: use combine-gwas-qtl.R after generating the GWAS and QTL datasets.

Also to do: investigate eliminating legumeInfo.organisms (unused?) from server.R