Summarizing genes by functional annotation - Golob-Minot/geneshot GitHub Wiki
For some projects it can be helpful to extract a simplified summary containing the proportion of gene copies from each specimen which have a given functional annotation. In the future these outputs may be included in the base geneshot output, but until that time we have provided a small utility to generate those summary files.
The entrypoint to perform this function is called gene_abund.nf, which can be
run as follows:
    nextflow run Golob-Minot/geneshot/gene_abund.nf <ARGUMENTS>
    
    Options:
      --results_hdf         Location for results.hdf5 generated by geneshot
      --details_hdf         Location for details.hdf5 generated by geneshot
      --genes_fasta         Location for input 'genes.fasta.gz'
      --output_folder       Location for output files
      --output_prefix       Prefix for output files
      --query               Query string to use to subset eggNOG gene descriptions
The utility will extract all of the genes which contain the --query string as
part of the eggNOG description. For example, using --query "Pectate lyase" will
include the annotations Pectate lyase as well as Pectate lyase superfamily protein.
When multiple annotations match the query string, each will be reported independently
to the user.
The outputs generated by this utility are:
- 
$OUTPUT_PREFIX.genes.csv: A CSV table with the annotations for all genes which match the--querystring, including the gene length, the CAG it is assigned to, and taxonomic annotation
- 
$OUTPUT_PREFIX.genes.fasta.gz: Amino acid sequences for all identified genes in FASTA format
- 
$OUTPUT_PREFIX.long.csv.gz: A long-format CSV table listing the abundance of every gene across every specimen, including the depth of sequencing, number of reads aligned, coverage, etc.
- 
$OUTPUT_PREFIX.manifest.csv: The manifest input by the user for this dataset (to provide all required specimen annotations)
- 
$OUTPUT_PREFIX.wide.csv.gz: A wide-format CSV with a single row per specimen and a single column per eggNOG annotation. The value in each cell is the proportion of genome copies from the specimen which were given the same annotation. In this wide-format summary, all genes with the same eggNOG annotation are combined to provide a single estimate