running geneshot - Golob-Minot/geneshot GitHub Wiki

To run geneshot, you must be able to describe your input data, provide additional metadata, and select your desired analysis options.

Input Data

The raw data consumed by geneshot is paired-end whole-genome shotgun metagenomic sequencing datasets. To unpack that rather verbose description, the data:

Originates from a mixture of microbes
DNA has been extracted from that biological specimen
Extracted DNA has been randomly fragmented in some way
DNA fragments have been sequenced on an Illumina sequencer (or the like)
Paired-end reads are available in GZIP-compressed FASTQ format, with one file for the 'forward' reads and another for the 'reverse' reads
Optionally, the 'index' barcode reads may be available in a second pair of compressed FASTQ files

In order to process these datasets using geneshot, you must make a manifest file which describes the location of every pair of FASTQ files for each biological specimen. This manifest file must be formatted as a CSV (comma-separated values) table, similar to the example below.

specimen	R1	R2	I1	I2	etc.
sample1	<>	<>	<>	<>	<>
sample2	<>	<>	<>	<>	<>
sample3	<>	<>	<>	<>	<>
sample4	<>	<>	<>	<>	<>

Each of the values in the R1 and R2 columns must be filled in with the PATH to the appropriate file. The I1, I2, and additional columns are optional and not required for geneshot to run.

Note on CSV Files: When saving tables in CSV format using Microsoft Excel, it is very common for an invisible character called a carriage return to be silently inserted at the end of every line, which can break some of the downstream processing. To easily fix this, you can use the dos2unix function available on most computers. Just run the command dos2unix MANIFEST.csv (if your manifest is named 'MANIFEST.csv') and all of the potentially dangerous carriage returns will be removed.

Metadata

Presumably, you have some additional information about each of the biological samples for which you have performed metagenomic sequencing. We tend to refer to these sample descriptors ('treatment' vs. 'control', etc.) as metadata. You can fill in this additional metadata as additional columns in the manifest described above (instead of 'etc.').

Running `geneshot` with additional options

Every piece of information that you want to provide to geneshot will be in the form of 'flags' which are added to the nextflow run Golob-Minot/geneshot command. For example, to add a manifest file using the --manifest flag, we would use the command nextflow run Golob-Minot/geneshot --manifest PATH_TO_MANIFEST.csv.

In practice, you will likely end up specifying a large handful of flags with running geneshot. For good record keeping, we recommend that you make a small BASH script (typically named run.sh) in the folder used for your analysis which contains the command you use to run geneshot. That way you don't have to remember all of the flags you used to re-run an analysis, you can just re-run run.sh.

A small example run script might look like:

nextflow \
    run \
    Golob-Minot/geneshot \
        --manifest microbiome_expt_1.csv \
        --output output/ \
        -w work/ \
        -resume

You would then run that script from the command line with the command:

bash run.sh

BASH Script Gottchas

You will notice that the nextflow run Golob-Minot/nextflow ... command is written across multiple lines in the BASH script above. This is enabled by the escape character (\), which instructs the BASH interpreter to ignore whatever character comes next. In this case, the escape character is escaping a newline, and so BASH treats the entire block of code as a single line. Note that if there were a space after the escape, this would fail. So if you are emulating the multi-line format seen above, make sure that you don't include any spaces after the \.

Analysis Options

Output Folder: All output data is written to a folder specified by --output
Preprocessing: By default, geneshot performs a handful of preprocessing steps on the raw WGS data prior to genome assembly. These steps can be customized with a handful of different options, or turned off entirely.
Working Directory: All intermediate files generated by geneshot must be stored in a working directory which is specified with -work-dir or -w
De novo Assembly: The steps that geneshot takes to perform de novo assembly, identify microbial genes, and annotate those genes can be customized with a variety of assembly options
Short Read Alignment: The steps that geneshot takes to align short WGS reads against the catalog of microbial genes and quantify abundance can be customized with a variety of alignment options
Grouping Genes: geneshot greatly enhances the efficiency of gene-level metagenomic analysis by grouping genes by co-abundance, and the used can customize the parameters used for exhaustive linkage clustering
Statistical Analysis: The user has the option to perform streamlined statistical analysis within the geneshot pipeline using the --formula flag

Full Options

The following help message can be generated by running nextflow run Golob-Minot/geneshot --help

    Usage:

    nextflow run Golob-Minot/geneshot <ARGUMENTS>
    
    Required Arguments:
      --manifest            CSV file listing samples (see below)

    Options:
      --output              Folder to place analysis outputs (default ./results)
      --output_prefix       Text used as a prefix for summary HDF5 output files (default: geneshot)
      --nopreprocess        If specified, omit the preprocessing steps (removing adapters and human sequences)
      --savereads           If specified, save the preprocessed reads to the output folder (inside qc/)
      -w                    Working directory. Defaults to `./work`

    For preprocessing:
      --hg_index_url        URL for human genome index, defaults to current HG
      --hg_index            Cached copy of the bwa indexed human genome, TGZ format
      --adapter_F           Forward sequencing adapter sequence (to be removed)
      --adapter_R           Reverse sequencing adapter sequence (to be removed)
                              (Adapter sequences default to nextera adapters)
      --min_hg_align_score  Minimum alignment score for human genome (default 30)

    For Assembly:
      --gene_fasta          (optional) Compressed FASTA with pre-generated catalog of microbial genes.
                            If provided, along with --assembly_folder then the entire de novo assembly process will be omitted.
      --assembly_folder     (optional) Folder containing de novo assemblies of every specimen.
                            If provided along with --gene_fasta, then the entire de novo assembly process will be omitted.
      --cags_csv            (optional) If a pre-generated --gene_fasta is provided, CAG groups must be provided
                            in "*.csv.gz" format from the geneshot run which generated that gene catalog.
      --phred_offset        for spades. Default 33.
      --min_identity        Amino acid identity cutoff used to combine similar genes (default: 90)
      --min_coverage        Length cutoff used to combine similar genes (default: 50) (linclust)

    For Annotation:
      --noannot             If specified, disable annotation for taxonomy or function.
                            Individual annotations can also be disabled by, e.g., setting --eggnog_db false
      --gene_shard_size     How many genes to include in a shard for annotation. (default: 100000)
      --taxonomic_dmnd      Database used for taxonomic annotation (default: false)
                            (Data available at s3://fh-ctr-public-reference-data/tool_specific_data/geneshot/2020-01-15-geneshot/DB.refseq.tax.dmnd)
      --tax_evalue          Maximum E-value threshold for taxonomic annotation by DIAMOND alignment
      --ncbi_taxdump        Reference describing the NCBI Taxonomy
                            (default: ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz)
      --eggnog_dmnd         One of two databases used for functional annotation with eggNOG (default: false)
                            (Data available at s3://fh-ctr-public-reference-data/tool_specific_data/geneshot/2020-06-17-eggNOG-v5.0/eggnog_proteins.dmnd)
      --eggnog_db           One of two databases used for functional annotation with eggNOG (default: false)
                            (Data available at s3://fh-ctr-public-reference-data/tool_specific_data/geneshot/2020-06-17-eggNOG-v5.0/eggnog.db)
      --eggnog_evalue       Maximum E-value threshold for eggNOG ortholog alignment (default: 0.00001)
    
    For Alignment:
      --dmnd_min_identity   Amino acid identity cutoff used to align short reads (default: 90) (DIAMOND)
      --dmnd_min_coverage   Query coverage cutoff used to align short reads (default: 50) (DIAMOND)
      --dmnd_top_pct        Keep top X% of alignments for each short read (default: 1) (DIAMOND)
      --dmnd_min_score      Minimum score for short read alignment (default: 20) (DIAMOND)
      --gencode             Genetic code used for conceptual translation (default: 11) (DIAMOND)
      --sd_mean_cutoff      Ratio of standard deviation / mean depth of sequencing used to filter genes (default: 3.0) (FAMLI)
      --famli_batchsize     Number of alignments to deduplicate in batches (default: 10000000) (FAMLI)
      --famli_folder        Optional: Specify a folder containing previously-computed FAMLI outputs

    For CAGs:
      --distance_metric     Distance metric used to group genes by co-abundance (default: cosine)
      --distance_threshold  Distance threshold used to group genes by co-abundance (default: 0.25)
      --min_contig_size     Only cluster genes which are found on contigs with at least this number of genes (default: 3; 0 to disable)
      --min_contig_depth    Minimum depth of sequencing per contig (default: 5; 0 to disable)
      --min_specimens       Only cluster genes which are detected by alignment in at least this number of specimens (default: 1)
      
    For Statistical Analysis:
      --formula             Optional formula used to estimate associations with CAG relative abundance
      --fdr_method          FDR method used to calculate q-values for associations (default: 'fdr_bh')
      --corncob_batches     Number of parallel processes to use processing each formula

    For Compositional Analysis:
      --composition         When included, metaPhlAn2 will be run on all specimens

    Batchfile:
      The manifest is a CSV with a header indicating which samples correspond to which files.
      The file must contain a column `specimen`. This can be repeated. 
      Data is only accepted as paired reads.
      Reads are specified by columns, `R1` and `R2`.
      If index reads are provided, the column titles should be 'I1' and 'I2'