Output Files - Golob-Minot/geneshot GitHub Wiki
Once you've run Geneshot, you will want to look at the results. Because Geneshot runs many different types of analyses, you may have to consult this documentation in order to find the element of the results which is most important for your application.
Many of the individual aspects of the results can be found in two formats:
- A compressed text file in the output directory specified by
--output
- A table in either the
${output_prefix}.results.hdf5
or${output_prefix}.details.hdf5
output files. These are both written to the same output directory using the file name prefix specified by--output_prefix
NOTE: The file format HDF5 is used to collect all of the results because of its flexible support for combining multiple tabular datasets into a single data object. We have found that the most convenient way to read and write this data format is with the Python/Pandas library.
Example:
import pandas as pd
manifest_df = pd.read_hdf("geneshot.results.hdf5", "/manifest")
If the --savereads
flag is used, all of the preprocessed WGS FASTQ files will be written to the qc/
folder.
The table used as input for the entire analysis will be saved in the results.hdf5
under /manifest
.
specimen | R1 | R2 | I1 | I2 | batch | label1 | label2 |
---|---|---|---|---|---|---|---|
Mock__11 | data/Mock__11.R1.fastq.gz | data/Mock__11.R2.fastq.gz | data/Mock__11.I1.fastq.gz | data/Mock__11.I2.fastq.gz | mock | 0 | 1 |
Mock__11 | data/Mock__11.R1.fastq.gz | data/Mock__11.R2.fastq.gz | data/Mock__11.I1.fastq.gz | data/Mock__11.I2.fastq.gz | mock | 0 | 1 |
Mock__12 | data/Mock__12.R1.fastq.gz | data/Mock__12.R2.fastq.gz | data/Mock__12.I1.fastq.gz | data/Mock__12.I2.fastq.gz | mock | 1 | 0 |
Mock__13 | data/Mock__13.R1.fastq.gz | data/Mock__13.R2.fastq.gz | data/Mock__13.I1.fastq.gz | data/Mock__13.I2.fastq.gz | mock | 1 | 1 |
Mock__14 | data/Mock__14.R1.fastq.gz | data/Mock__14.R2.fastq.gz | data/Mock__14.I1.fastq.gz | data/Mock__14.I2.fastq.gz | mock | 0 | 0 |
Each sample can be summarized overall with a number of metrics, including:
- The number of paired-end reads input per sample (2X the number of read pairs, to be explicit)
- The number of paired-end reads which successfully align against the gene catalog
- The estimated total number of protein coding genes in a given sample (using the
breakaway
algorithm) - The number of genes detected by assembly
- The number of genes detected by alignment
For example:
/summary/all
specimen | n_reads | aligned_reads | n_genes_aligned | error | estimand | estimate | interval_lower | interval_upper | reasonable | n_genes_assembled |
---|---|---|---|---|---|---|---|---|---|---|
Mock__11 | 8076 | None | 75 | 0.0473 | richness | 75.0022 | 75.0004 | 75.0114 | False | 63 |
Mock__12 | 876 | None | 7 | 2.2399e-24 | richness | 7.0 | 7.0 | nan | False | 7 |
Mock__13 | 2492 | None | 90 | 0.4269 | richness | 90.1764 | 90.0338 | 90.9194 | False | 116 |
Mock__14 | 8252 | None | 81 | 0.0411 | richness | 81.0017 | 81.0003 | 81.0086 | False | 71 |
Mock__15 | 8952 | None | 82 | 0.0473 | richness | 82.0022 | 82.0004 | 82.0114 | False | 70 |
All of the contigs, scaffolds, gene annotations, and logs produced during de novo assembly for each specimen will be found in the assembly/<specimen>/
folder.
A CSV with information on every protein-coding sequence assembled from every sample (operationally termed 'alleles'), including the contig, position, sequencing depth, and GC content will be written to assembly/allele.assembly.metrics.csv.gz
. This table will also be available in details.hdf5
under /abund/allele/assembly/<specimen>
with one table per specimen.
Example:
contig | flag | gc_cont | gene_name | len | multi | partial | rbs_motif | rbs_spacer | specimen | start | start_type | stop | strand | catalog_gene |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
k59_0 | 1 | 0.386 | Mock__15__GENE__k59_0__flag=1__multi=20.7292__len=43660_1 | 43660 | 20.7292 | 0 | GGAG/GAGG | 5-10bp | Mock__15 | 510 | ATG | 716 | -1 | gene_f33b2fec_69aa |
k59_0 | 1 | 0.326 | Mock__15__GENE__k59_0__flag=1__multi=20.7292__len=43660_2 | 43660 | 20.7292 | 0 | GGA/GAG/AGG | 5-10bp | Mock__15 | 1002 | ATG | 1412 | 1 | gene_4f04b95b_137aa |
k59_0 | 1 | 0.4320000000000001 | Mock__15__GENE__k59_0__flag=1__multi=20.7292__len=43660_3 | 43660 | 20.7292 | 0 | GGA/GAG/AGG | 5-10bp | Mock__15 | 1702 | ATG | 1995 | 1 | gene_c49af40d_98aa |
k59_0 | 1 | 0.513 | Mock__15__GENE__k59_0__flag=1__multi=20.7292__len=43660_4 | 43660 | 20.7292 | 0 | GGA/GAG/AGG | 5-10bp | Mock__15 | 2027 | ATG | 2488 | -1 | gene_305ad6e4_154aa |
k59_0 | 1 | 0.503 | Mock__15__GENE__k59_0__flag=1__multi=20.7292__len=43660_5 | 43660 | 20.7292 | 0 | GGAG/GAGG | 5-10bp | Mock__15 | 2485 | ATG | 2961 | -1 | gene_2550103b_159aa |
The DIAMOND reference database for all of the deduplicated gene sequences will be written to ref/genes.dmnd
. This file can be converted into FASTA by the user if needed.
A two column CSV describing which genes were grouped into which samples will be written to ref/CAGs.csv.gz
. Columns are CAG
and gene
, with one row per gene. It will also be available in results.hdf5
under /annot/gene/cag
.
Example:
CAG | gene |
---|---|
0 | Mock__1_NODE_7510_length_263_cov_1.826923_1 |
0 | Mock__3_NODE_6989_length_270_cov_1.767442_1 |
0 | Mock__1_NODE_8655_length_250_cov_2.646154_1 |
0 | Mock__2_NODE_6791_length_271_cov_1.759259_1 |
0 | Mock__3_NODE_211_length_709_cov_5.498471_1 |
A short summary of every CAG will be available in results.hdf5
under /annot/cag/all
, containing the:
- Size (number of genes) per CAG
- Prevalence: proportion of samples in which the CAG was observed with a relative abundance > 0
- Mean abundance: Average abundance across all samples
- Std. abundance: Standard deviation of the abundance values across all samples
Example:
CAG | mean_abundance | prevalence | size | std_abundance |
---|---|---|---|---|
0.0 | 0.7048885226249695 | 0.8 | 75.0 | 0.42067644000053406 |
1.0 | 0.014859299175441265 | 0.2 | 12.0 | 0.03322640433907509 |
2.0 | 0.2802521586418152 | 0.8 | 7.0 | 0.41875842213630676 |
Abundance of every CAG in every sample in feather format will be written to abund/CAG.abund.feather
. This table will be available in results.hdf5
under /abund/cag/wide
. Abundances are calculated as the sum of the gene-level sequencing depth divided by the sum of sequencing depth for all genes in that sample.
Example:
CAG | Mock__1 | Mock__2 | Mock__4 | Mock__3 |
---|---|---|---|---|
0.0 | 0.08 | 0.12 | 0.12 | 0.11 |
1.0 | 0.08 | 0.07 | 0.02 | 0.1 |
2.0 | 0.09 | 0.01 | 0.0 | 0.0 |
3.0 | 0.0 | 0.08 | 0.02 | 0.05 |
4.0 | 0.01 | 0.04 | 0.08 | 0.01 |
Abundance of every gene in every sample in will be available in results.hdf5
under /abund/gene/long/<specimen>
. Here is an example of the information found in that table:
coverage | depth | id | length | nreads | std | specimen |
---|---|---|---|---|---|---|
1.0 | 3.42 | Mock__2_NODE_12649_length_209_cov_2.454545_1 | 69 | 6 | 0.97 | Mock__1 |
1.0 | 2.48 | Mock__1_NODE_9620_length_240_cov_2.054054_1 | 79 | 4 | 0.85 | Mock__1 |
0.66 | 2.28 | Mock__4_NODE_3651_length_340_cov_1.985965_1 | 86 | 4 | 1.78 | Mock__1 |
0.64 | 1.28 | Mock__3_NODE_10136_length_235_cov_3.166667_1 | 78 | 2 | 0.96 | Mock__1 |
1.0 | 5.69 | Mock__1_NODE_6010_length_280_cov_3.377778_2 | 32 | 6 | 0.73 | Mock__1 |
The taxonomic assignment for each gene (DIAMOND output in taxonomic assignment mode) in TSV format will be written to annot/genes.tax.aln.gz
. It will also be available in results.hdf5
under /annot/gene/tax
.
Example:
gene | tax_id | evalue |
---|---|---|
Mock__13_NODE_19_length_684_cov_3.917329_1 | 1224 | 8.8e-21 |
Mock__13_NODE_80_length_217_cov_2.308642_1 | 562 | 1.3e-29 |
Mock__13_NODE_82_length_215_cov_4.750000_1 | 2608889 | 1.2e-10 |
Mock__13_NODE_16_length_757_cov_4.584046_2 | 543 | 3.2e-51 |
Mock__13_NODE_29_length_514_cov_2.932462_1 | 1236 | 4.8e-17 |
The functional assignment for each gene (eggNOG output) in TSV format will be written to annot/genes.emapper.annotations.gz
. It will also be available in results.hdf5
under /annot/gene/eggnog
.
Example:
query_name | seed_eggNOG_ortholog | seed_ortholog_evalue | seed_ortholog_score | best_tax_level | taxonomic scope | eggNOG OGs | COG Functional cat. | eggNOG free text desc. |
---|---|---|---|---|---|---|---|---|
Mock__12_NODE_1_length_5360_cov_15.598303_1 | 1453503.AU05_02695 | 6.3e-21 | 105.9 | Proteobacteria | Bacteria | 1P1ED@1224,2FFHF@1,347EX@2 | S | Bacteriophage protein K |
Mock__12_NODE_1_length_5360_cov_15.598303_2 | 1116472.MGMO_205c00050 | 1.3999999999999996e-78 | 298.9 | Gammaproteobacteria | Bacteria | 1NZPF@1224,1SRZ6@1236,28HVF@1,2Z81Q@2 | S | Bacteriophage scaffolding protein D |
Mock__12_NODE_1_length_5360_cov_15.598303_3 | 1116472.MGMO_205c00040 | 3.3e-13 | 79.7 | Gammaproteobacteria | Bacteria | 1NKF2@1224,1ST7F@1236,2EFW9@1,339NJ@2 | S | Microvirus J protein |
Mock__12_NODE_1_length_5360_cov_15.598303_4 | 1116472.MGMO_205c00030 | 5.399999999999998e-258 | 896.3 | Gammaproteobacteria | Bacteria | 1NRFQ@1224,1SKG7@1236,28IN7@1,2Z8NM@2 | S | Capsid protein (F protein) |
Mock__12_NODE_1_length_5360_cov_15.598303_5 | 1118153.MOY_16472 | 1.1e-95 | 355.9 | Gammaproteobacteria | Bacteria | 1R4RW@1224,1SM1Q@1236,28HGS@1,2Z7SJ@2 | S | Major spike protein (G protein) |
The taxonomy used for the taxonomic annotation is provided in the output HDF5 under /ref/taxonomy
, e.g.:
tax_id | name | parent | rank |
---|---|---|---|
1 | root | 1 | no rank |
2 | Bacteria | 131567 | superkingdom |
6 | Azorhizobium | 335928 | genus |
7 | Azorhizobium caulinodans | 6 | species |
9 | Buchnera aphidicola | 32199 | species |
For convenience, ordination is performed using the CAG abundance information to help provide
a quick measure of sample similarity to the user. This includes both PCA and t-SNE, and the
output tables can be found in results.hdf5
under /ordination/pca
and /ordination/tsne
. Examples of those outputs
are:
specimen | PC1 (94.4%) | PC2 (5.6%) | PC3 (4.8E-15%) |
---|---|---|---|
Mock__12 | 0.986596 | -0.097462 | -7.470813e-11 |
Mock__14 | -0.395635 | -0.052598 | 6.401242e-09 |
Mock__13 | 0.141401 | 0.257170 | -5.739458e-16 |
Mock__15 | -0.305487 | -0.055524 | -7.774022e-10 |
Mock__11 | -0.426874 | -0.051585 | -5.549131e-09 |
specimen | t-SNE 1 | t-SNE 2 |
---|---|---|
Mock__12 | -67.135109 | 26.804821 |
Mock__14 | -115.964615 | 10.169141 |
Mock__13 | -126.923698 | 62.394192 |
Mock__15 | -38.498631 | 69.708954 |
Mock__11 | -86.144646 | 93.732803 |
A CSV with the Corncob results for this dataset will be written to stats/corncob.results.csv
. It will also be available in results.hdf5
under /stats/cag/corncob
.
Example:
parameter | type | value | CAG |
---|---|---|---|
mu.(Intercept) | estimate | -1.5722784150839353 | 0 |
mu.label1 | estimate | -0.2687179385894458 | 0 |
mu.label2 | estimate | -0.17127772136300107 | 0 |
phi.(Intercept) | estimate | -6.573817086548077 | 0 |
mu.(Intercept) | std_error | 0.15771708448454666 | 0 |
Here's an example of how the output files can be structured. This is what you get when you analyze some of the test data, so depending on how many samples you have, etc., you will see different exact set of files, but in the same pattern.
output
├── [ 160] abund
│ ├── [ 664] CAG.abund.feather
│ ├── [ 130] CAG.readcounts.csv.gz
│ └── [ 224] details
│ ├── [2.9K] Mock__11.json.gz
│ ├── [ 404] Mock__12.json.gz
│ ├── [3.7K] Mock__13.json.gz
│ ├── [3.1K] Mock__14.json.gz
│ └── [3.1K] Mock__15.json.gz
├── [ 224] assembly
│ ├── [ 224] Mock__11
│ │ ├── [ 14K] Mock__11.contigs.fasta.gz
│ │ ├── [1.8K] Mock__11.csv.gz
│ │ ├── [ 10K] Mock__11.faa.gz
│ │ ├── [2.9K] Mock__11.gff.gz
│ │ └── [ 67K] Mock__11.megahit.log
│ ├── [ 224] Mock__12
│ │ ├── [1.8K] Mock__12.contigs.fasta.gz
│ │ ├── [ 399] Mock__12.csv.gz
│ │ ├── [1.3K] Mock__12.faa.gz
│ │ ├── [ 653] Mock__12.gff.gz
│ │ └── [ 51K] Mock__12.megahit.log
│ ├── [ 224] Mock__13
│ │ ├── [ 15K] Mock__13.contigs.fasta.gz
│ │ ├── [3.4K] Mock__13.csv.gz
│ │ ├── [ 11K] Mock__13.faa.gz
│ │ ├── [6.8K] Mock__13.gff.gz
│ │ └── [ 68K] Mock__13.megahit.log
│ ├── [ 224] Mock__14
│ │ ├── [ 16K] Mock__14.contigs.fasta.gz
│ │ ├── [2.1K] Mock__14.csv.gz
│ │ ├── [ 11K] Mock__14.faa.gz
│ │ ├── [3.4K] Mock__14.gff.gz
│ │ └── [ 67K] Mock__14.megahit.log
│ └── [ 224] Mock__15
│ ├── [ 16K] Mock__15.contigs.fasta.gz
│ ├── [2.0K] Mock__15.csv.gz
│ ├── [ 11K] Mock__15.faa.gz
│ ├── [3.2K] Mock__15.gff.gz
│ └── [ 67K] Mock__15.megahit.log
├── [316K] geneshot.details.hdf5
├── [372K] geneshot.results.hdf5
├── [ 96] qc
│ └── [ 86] readcounts.csv
├── [ 96] ref
│ └── [ 21K] genes.dmnd
└── [ 128] stats
├── [1.4K] corncob.results.csv
└── [ 104] geneshot.breakaway.csv.gz