Example Pipeline - RCChan5/BioLockJ GitHub Wiki

In our example analysis, we investigate the differences between the microbiome of 20 rural and 20 recently urbanized subjects from the Chinese province of Hunan. For more information on this dataset, please review the analysis Fodor Lab published in the Sep 2017 issue of the journal Microbiome: https://microbiomejournal.biomedcentral.com/articles/10.1186/s40168-017-0338-7

Step 1: Prepare BioLockJ Config File

The BioLockJ project Config chinaKrakenFullDB.properties lists 5 BioModules to run (lines 3-7) + 13 properties:

  #BioModule biolockj.module.implicit.RegisterNumReads
  #BioModule biolockj.module.classifier.wgs.KrakenClassifier
  #BioModule biolockj.module.report.taxa.NormalizeTaxaTables
  #BioModule biolockj.module.report.r.R_PlotPvalHistograms
  #BioModule biolockj.module.report.r.R_PlotOtus

In addition to the 5 listed BioModules, 4 additional implicit BioModules will also run:

Mod# Module Description
1 ImportMetadata Always run 1st (for all pipelines)
2 KrakenParser Always run after KrakenClassifier
3 AddMetadataToOtuTables Always run just before the 1st R module
4 CalculateStats Always run as the 1st R module.

Key properties:

Line# Property Description
08 cluster.jobHeader Each script will run on 1 node, 16 cores, and 128GB RAM for up to 30 minutes
10 pipeline.defaultProps Default config file defines most properties – in this case copperhead.properties
12 input.dirPaths Directory path containing 40 gzipped whole genome sequencing (WGS) fastq files
18 metadata.filePath Metadata file path: chinaMetadata.tsv

BioLockJ must associate sequence files in input.dirPaths with the correct metadata row. This is done by matching sequence file names to the 1st column in the metadata file. If the Sample ID is not found in your file names, the file names must be updated. Use the following properties to ignore a file prefix or suffix when matching the sample IDs.

  • input.suffixFw
  • input.suffixRv
  • input.trimPrefix
  • input.trimSuffix

Sample IDs from 1st column of the metadata file: 081A, 082A, 083A...etc.
Sequence file names: 081A_R1.fq.gz, 082A_R1.fq.gz, 083A_R1.fq.gz...etc.

The default Config file, copperhead.properties, has its own default Config file standard.properties which defines the property input.suffixFw=_R1. As a result, all characters starting with (and including) “_R1” are ignored when matching the file name to the metadata sample ID.

Step 2: Run BioLockJ Pipeline

  > biolockj ~/chinaKrakenFullDB.properties
  • Look in the BioLockJ pipeline output directory defined by $BLJ_PROJ for a new pipeline directory named after the property file + today’s date: ~/projects/chinaKrakenFullDB_2018Apr09
  • The 5 configured modules have run in order, with the addition of 2 implicit modules (1st and last) which are added to all pipelines automatically.
  • The biolockjComplete file indicates the pipeline ran successfully.

Step 3: Review Pipeline Summary

Step 4: Download R Reports

  • Run the blj_download command to get the command needed to download the analysis.

    > blj_download
    > rsync
    

Step 5: Analyze R Reports

  • Open downloadDir on your local filesystem to review the analysis. This directory contains:
Output Description
/temp Directory where R log files are saved if R script runs locally.
/tables Directory containing the OTU tables.
/local Directory where R script output is saved if R script runs locally and r.debug=Y.
*.RData The saved R sessions for R modules run if r.saveRData=Y.
chinaKrakenFullDB.log The pipeline Java log file.
MAIN_*.R Each R script for each module that generated reports has been updated to run on your local filesystem.
*.tsv files Spreadsheets containing p-value and R^2 statistics for each OTU in the taxonomy level.
*.pdf files P-value histograms, and bar-charts or scatterplots for each OTU in the taxonomy level.
  • Each R module generates a report for each report.taxonomyLevel configured:

Open chinaKrakenFullDB_Log10_genus.pdf

  • The report begins with the unadjusted P-Value Distributions:
  • Since r.numHistogramBreaks=20 so the 1st bar represents the p-values < 0.05. The ruralUrban attribute appears significant, as indicated by the high number p-values < 0.05.
  • For each OTU, a bar-chart or scatterplot is output with adjusted parametric and non-parametric p-values formatted using in the plot header.
  • The p-value format is defined by r.pValFormat.
  • The p-adjust method is defined by rStats.pAdjustMethod.
  • P-values that meet the r.pvalCutoff threshold are highlighted with r.colorHighlight.