running geneshot - Golob-Minot/geneshot GitHub Wiki
To run geneshot
, you must be able to describe your input data, provide additional metadata, and select your desired analysis options.
The raw data consumed by geneshot
is paired-end whole-genome shotgun metagenomic sequencing datasets. To unpack that rather verbose description, the data:
- Originates from a mixture of microbes
- DNA has been extracted from that biological specimen
- Extracted DNA has been randomly fragmented in some way
- DNA fragments have been sequenced on an Illumina sequencer (or the like)
- Paired-end reads are available in GZIP-compressed FASTQ format, with one file for the 'forward' reads and another for the 'reverse' reads
- Optionally, the 'index' barcode reads may be available in a second pair of compressed FASTQ files
In order to process these datasets using geneshot
, you must make a manifest file which describes the location of every pair of FASTQ files for each biological specimen. This manifest file must be formatted as a CSV (comma-separated values) table, similar to the example below.
specimen | R1 | R2 | I1 | I2 | etc. |
---|---|---|---|---|---|
sample1 | <> | <> | <> | <> | <> |
sample2 | <> | <> | <> | <> | <> |
sample3 | <> | <> | <> | <> | <> |
sample4 | <> | <> | <> | <> | <> |
Each of the values in the R1
and R2
columns must be filled in with the PATH to the appropriate file. The I1
, I2
, and additional columns are optional and not required for geneshot
to run.
Note on CSV Files: When saving tables in CSV format using Microsoft Excel, it is very common for an invisible character called a carriage return to be silently inserted at the end of every line, which can break some of the downstream processing. To easily fix this, you can use the dos2unix
function available on most computers. Just run the command dos2unix MANIFEST.csv
(if your manifest is named 'MANIFEST.csv') and all of the potentially dangerous carriage returns will be removed.
Presumably, you have some additional information about each of the biological samples for which you have performed metagenomic sequencing. We tend to refer to these sample descriptors ('treatment' vs. 'control', etc.) as metadata. You can fill in this additional metadata as additional columns in the manifest described above (instead of 'etc.').
Every piece of information that you want to provide to geneshot
will be in the form of 'flags' which are added to the nextflow run Golob-Minot/geneshot
command. For example, to add a manifest file using the --manifest
flag, we would use the command nextflow run Golob-Minot/geneshot --manifest PATH_TO_MANIFEST.csv
.
In practice, you will likely end up specifying a large handful of flags with running geneshot
. For good record keeping, we recommend that you make a small BASH script (typically named run.sh
) in the folder used for your analysis which contains the command you use to run geneshot
. That way you don't have to remember all of the flags you used to re-run an analysis, you can just re-run run.sh
.
A small example run script might look like:
nextflow \
run \
Golob-Minot/geneshot \
--manifest microbiome_expt_1.csv \
--output output/ \
-w work/ \
-resume
You would then run that script from the command line with the command:
bash run.sh
You will notice that the nextflow run Golob-Minot/nextflow ...
command is written across multiple lines in the BASH script above. This is enabled by the escape character (\
), which instructs the BASH interpreter to ignore whatever character comes next. In this case, the escape character is escaping a newline, and so BASH treats the entire block of code as a single line. Note that if there were a space after the escape, this would fail. So if you are emulating the multi-line format seen above, make sure that you don't include any spaces after the \
.
- Output Folder: All output data is written to a folder specified by
--output
- Preprocessing: By default,
geneshot
performs a handful of preprocessing steps on the raw WGS data prior to genome assembly. These steps can be customized with a handful of different options, or turned off entirely. - Working Directory: All intermediate files generated by
geneshot
must be stored in a working directory which is specified with-work-dir
or-w
-
De novo Assembly: The steps that
geneshot
takes to perform de novo assembly, identify microbial genes, and annotate those genes can be customized with a variety of assembly options - Short Read Alignment: The steps that
geneshot
takes to align short WGS reads against the catalog of microbial genes and quantify abundance can be customized with a variety of alignment options - Grouping Genes:
geneshot
greatly enhances the efficiency of gene-level metagenomic analysis by grouping genes by co-abundance, and the used can customize the parameters used for exhaustive linkage clustering - Statistical Analysis: The user has the option to perform streamlined statistical analysis within the
geneshot
pipeline using the --formula flag
The following help message can be generated by running nextflow run Golob-Minot/geneshot --help
Usage:
nextflow run Golob-Minot/geneshot <ARGUMENTS>
Required Arguments:
--manifest CSV file listing samples (see below)
Options:
--output Folder to place analysis outputs (default ./results)
--output_prefix Text used as a prefix for summary HDF5 output files (default: geneshot)
--nopreprocess If specified, omit the preprocessing steps (removing adapters and human sequences)
--savereads If specified, save the preprocessed reads to the output folder (inside qc/)
-w Working directory. Defaults to `./work`
For preprocessing:
--hg_index_url URL for human genome index, defaults to current HG
--hg_index Cached copy of the bwa indexed human genome, TGZ format
--adapter_F Forward sequencing adapter sequence (to be removed)
--adapter_R Reverse sequencing adapter sequence (to be removed)
(Adapter sequences default to nextera adapters)
--min_hg_align_score Minimum alignment score for human genome (default 30)
For Assembly:
--gene_fasta (optional) Compressed FASTA with pre-generated catalog of microbial genes.
If provided, along with --assembly_folder then the entire de novo assembly process will be omitted.
--assembly_folder (optional) Folder containing de novo assemblies of every specimen.
If provided along with --gene_fasta, then the entire de novo assembly process will be omitted.
--cags_csv (optional) If a pre-generated --gene_fasta is provided, CAG groups must be provided
in "*.csv.gz" format from the geneshot run which generated that gene catalog.
--phred_offset for spades. Default 33.
--min_identity Amino acid identity cutoff used to combine similar genes (default: 90)
--min_coverage Length cutoff used to combine similar genes (default: 50) (linclust)
For Annotation:
--noannot If specified, disable annotation for taxonomy or function.
Individual annotations can also be disabled by, e.g., setting --eggnog_db false
--gene_shard_size How many genes to include in a shard for annotation. (default: 100000)
--taxonomic_dmnd Database used for taxonomic annotation (default: false)
(Data available at s3://fh-ctr-public-reference-data/tool_specific_data/geneshot/2020-01-15-geneshot/DB.refseq.tax.dmnd)
--tax_evalue Maximum E-value threshold for taxonomic annotation by DIAMOND alignment
--ncbi_taxdump Reference describing the NCBI Taxonomy
(default: ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz)
--eggnog_dmnd One of two databases used for functional annotation with eggNOG (default: false)
(Data available at s3://fh-ctr-public-reference-data/tool_specific_data/geneshot/2020-06-17-eggNOG-v5.0/eggnog_proteins.dmnd)
--eggnog_db One of two databases used for functional annotation with eggNOG (default: false)
(Data available at s3://fh-ctr-public-reference-data/tool_specific_data/geneshot/2020-06-17-eggNOG-v5.0/eggnog.db)
--eggnog_evalue Maximum E-value threshold for eggNOG ortholog alignment (default: 0.00001)
For Alignment:
--dmnd_min_identity Amino acid identity cutoff used to align short reads (default: 90) (DIAMOND)
--dmnd_min_coverage Query coverage cutoff used to align short reads (default: 50) (DIAMOND)
--dmnd_top_pct Keep top X% of alignments for each short read (default: 1) (DIAMOND)
--dmnd_min_score Minimum score for short read alignment (default: 20) (DIAMOND)
--gencode Genetic code used for conceptual translation (default: 11) (DIAMOND)
--sd_mean_cutoff Ratio of standard deviation / mean depth of sequencing used to filter genes (default: 3.0) (FAMLI)
--famli_batchsize Number of alignments to deduplicate in batches (default: 10000000) (FAMLI)
--famli_folder Optional: Specify a folder containing previously-computed FAMLI outputs
For CAGs:
--distance_metric Distance metric used to group genes by co-abundance (default: cosine)
--distance_threshold Distance threshold used to group genes by co-abundance (default: 0.25)
--min_contig_size Only cluster genes which are found on contigs with at least this number of genes (default: 3; 0 to disable)
--min_contig_depth Minimum depth of sequencing per contig (default: 5; 0 to disable)
--min_specimens Only cluster genes which are detected by alignment in at least this number of specimens (default: 1)
For Statistical Analysis:
--formula Optional formula used to estimate associations with CAG relative abundance
--fdr_method FDR method used to calculate q-values for associations (default: 'fdr_bh')
--corncob_batches Number of parallel processes to use processing each formula
For Compositional Analysis:
--composition When included, metaPhlAn2 will be run on all specimens
Batchfile:
The manifest is a CSV with a header indicating which samples correspond to which files.
The file must contain a column `specimen`. This can be repeated.
Data is only accepted as paired reads.
Reads are specified by columns, `R1` and `R2`.
If index reads are provided, the column titles should be 'I1' and 'I2'