Summarizing genes by functional annotation - Golob-Minot/geneshot GitHub Wiki
For some projects it can be helpful to extract a simplified summary containing the proportion of gene copies from each specimen which have a given functional annotation. In the future these outputs may be included in the base geneshot output, but until that time we have provided a small utility to generate those summary files.
The entrypoint to perform this function is called gene_abund.nf
, which can be
run as follows:
nextflow run Golob-Minot/geneshot/gene_abund.nf <ARGUMENTS>
Options:
--results_hdf Location for results.hdf5 generated by geneshot
--details_hdf Location for details.hdf5 generated by geneshot
--genes_fasta Location for input 'genes.fasta.gz'
--output_folder Location for output files
--output_prefix Prefix for output files
--query Query string to use to subset eggNOG gene descriptions
The utility will extract all of the genes which contain the --query
string as
part of the eggNOG description. For example, using --query "Pectate lyase"
will
include the annotations Pectate lyase
as well as Pectate lyase superfamily protein
.
When multiple annotations match the query string, each will be reported independently
to the user.
The outputs generated by this utility are:
-
$OUTPUT_PREFIX.genes.csv
: A CSV table with the annotations for all genes which match the--query
string, including the gene length, the CAG it is assigned to, and taxonomic annotation -
$OUTPUT_PREFIX.genes.fasta.gz
: Amino acid sequences for all identified genes in FASTA format -
$OUTPUT_PREFIX.long.csv.gz
: A long-format CSV table listing the abundance of every gene across every specimen, including the depth of sequencing, number of reads aligned, coverage, etc. -
$OUTPUT_PREFIX.manifest.csv
: The manifest input by the user for this dataset (to provide all required specimen annotations) -
$OUTPUT_PREFIX.wide.csv.gz
: A wide-format CSV with a single row per specimen and a single column per eggNOG annotation. The value in each cell is the proportion of genome copies from the specimen which were given the same annotation. In this wide-format summary, all genes with the same eggNOG annotation are combined to provide a single estimate