Example Pipeline - RCChan5/BioLockJ GitHub Wiki

In our example analysis, we investigate the differences between the microbiome of 20 rural and 20 recently urbanized subjects from the Chinese province of Hunan. For more information on this dataset, please review the analysis Fodor Lab published in the Sep 2017 issue of the journal Microbiome: https://microbiomejournal.biomedcentral.com/articles/10.1186/s40168-017-0338-7

Step 1: Prepare BioLockJ Config File

The BioLockJ project Config chinaKrakenFullDB.properties lists 5 BioModules to run (lines 3-7) + 13 properties:

  #BioModule biolockj.module.implicit.RegisterNumReads
  #BioModule biolockj.module.classifier.wgs.KrakenClassifier
  #BioModule biolockj.module.report.taxa.NormalizeTaxaTables
  #BioModule biolockj.module.report.r.R_PlotPvalHistograms
  #BioModule biolockj.module.report.r.R_PlotOtus

In addition to the 5 listed BioModules, 4 additional implicit BioModules will also run:

Mod#	Module	Description
1	ImportMetadata	Always run 1st (for all pipelines)
2	KrakenParser	Always run after KrakenClassifier
3	AddMetadataToOtuTables	Always run just before the 1st R module
4	CalculateStats	Always run as the 1st R module.

Key properties:

Line#	Property	Description
08	cluster.jobHeader	Each script will run on 1 node, 16 cores, and 128GB RAM for up to 30 minutes
10	pipeline.defaultProps	Default config file defines most properties – in this case copperhead.properties
12	input.dirPaths	Directory path containing 40 gzipped whole genome sequencing (WGS) fastq files
18	metadata.filePath	Metadata file path: chinaMetadata.tsv

BioLockJ must associate sequence files in input.dirPaths with the correct metadata row. This is done by matching sequence file names to the 1st column in the metadata file. If the Sample ID is not found in your file names, the file names must be updated. Use the following properties to ignore a file prefix or suffix when matching the sample IDs.

input.suffixFw
input.suffixRv
input.trimPrefix
input.trimSuffix

Sample IDs from 1st column of the metadata file: 081A, 082A, 083A...etc.
Sequence file names: 081A_R1.fq.gz, 082A_R1.fq.gz, 083A_R1.fq.gz...etc.

The default Config file, copperhead.properties, has its own default Config file standard.properties which defines the property input.suffixFw=_R1. As a result, all characters starting with (and including) “_R1” are ignored when matching the file name to the metadata sample ID.

Step 2: Run BioLockJ Pipeline

  > biolockj ~/chinaKrakenFullDB.properties

Look in the BioLockJ pipeline output directory defined by $BLJ_PROJ for a new pipeline directory named after the property file + today’s date: ~/projects/chinaKrakenFullDB_2018Apr09
The 5 configured modules have run in order, with the addition of 2 implicit modules (1st and last) which are added to all pipelines automatically.
The biolockjComplete file indicates the pipeline ran successfully.

Step 3: Review Pipeline Summary

Run the blj_summary command to review the pipeline execution summary.
```
> blj_summary
```
Pipeline Summary

Step 4: Download R Reports

Run the blj_download command to get the command needed to download the analysis.
```
> blj_download
> rsync
```

Step 5: Analyze R Reports

Open downloadDir on your local filesystem to review the analysis. This directory contains:

Output	Description
/temp	Directory where R log files are saved if R script runs locally.
/tables	Directory containing the OTU tables.
/local	Directory where R script output is saved if R script runs locally and r.debug=Y.
*.RData	The saved R sessions for R modules run if r.saveRData=Y.
chinaKrakenFullDB.log	The pipeline Java log file.
MAIN_*.R	Each R script for each module that generated reports has been updated to run on your local filesystem.
*.tsv files	Spreadsheets containing p-value and R^2 statistics for each OTU in the taxonomy level.
*.pdf files	P-value histograms, and bar-charts or scatterplots for each OTU in the taxonomy level.

Each R module generates a report for each report.taxonomyLevel configured:

Open chinaKrakenFullDB_Log10_genus.pdf

The report begins with the unadjusted P-Value Distributions:
Since r.numHistogramBreaks=20 so the 1st bar represents the p-values < 0.05. The ruralUrban attribute appears significant, as indicated by the high number p-values < 0.05.
For each OTU, a bar-chart or scatterplot is output with adjusted parametric and non-parametric p-values formatted using in the plot header.
The p-value format is defined by r.pValFormat.
The p-adjust method is defined by rStats.pAdjustMethod.
P-values that meet the r.pvalCutoff threshold are highlighted with r.colorHighlight.