FCS GX taxonomy report - ncbi/fcs GitHub Wiki

Taxonomy Report Output

The initial report from FCS-GX is provided in the file < basename of fasta file>.< tax-id provided>.taxonomy.rpt. For more FCS-GX details and quickstart instructions, please review the FCS-GX documentation.

The following table illustrates column numbers (first column) with corresponding column headers (second column):

1:      #seq-id          OU830638.1
2:      seq-len          6422716
3:      (xp,lc,co,n)-len 5104,9351,29940,0
4:      cvg-by-all       6034920
5:      sep1             |
6:      tax-name-1       Neonectria ditissima
7:      tax-id-1         78410
8:      div-1            fung:ascomycetes
9:      cvg-by-div-1     5994233
10:     cvg-by-tax-1     5528541
11:     score-1          10033
12:     sep2             |
13:     tax-id-2         1735992
14:     div-2            fung:ascomycetes
15:     cvg-by-div-2     5994233
16:     cvg-by-tax-2     5366714
17:     score-2          9852
18:     sep3             |
19:     tax-id-3         2940382
20:     div-3            fung:budding yeasts
21:     cvg-by-div-3     56273
22:     cvg-by-tax-3     31377
23:     score-3          420
24:     sep4             |
25:     tax-id-4         378046
26:     div-4            fung:budding yeasts
27:     cvg-by-div-4     56273
28:     cvg-by-tax-4     8406
29:     score-4          223
30:     sep5             |
31:     reserved         n/a
32:     result           primary-div
33:     div              fung:ascomycetes
34:     div_pct_cvg      93
  • Column 1: A seq-id (sequence ID). This can be in the following formats:

    • A whole sequence with a hit to a taxonomic division.

      #seq-id
      OU830638.1
      
    • A sequence split on runs of 10+Ns. The seq-id includes the start and end for each split range of the sequence formatted as ~start..end.

      #seq-id
      CH476754.1~1..212539
      CH476754.1~212640..216643
      CH476754.1~218504..255730
      
    • A sequence with hits to multiple taxonomic divisions, making it a putative chimeric sequence. The seq-id includes the start and end for each chimeric range of the sequence formatted as ~~start..end.

      #seq-id
      CR382124.1~~1164..1687942
      CR382124.1~~1694735..1696001
      
    • A split sequence that is also chimeric. The seq-id includes ~start..end~~substart..subend where the subranges are relative to the starting coordinate of the split sequence.

      #seq-id
      UYJD01000002.1~1709646..1813733~~5112..84751
      UYJD01000002.1~1709646..1813733~~100474..101416
      
  • Columns 2 and 3: The seq-len (sequence length) and masked-len (masked length) representing the length of the sequence (whole, split, or chimeric) and the masked length of the sequence, respectively. The masked length is a comma-separated tuple corresponding to regions masked on four tracks: transposons (xp), low-complexity (lc), highly-conserved regions (co), Ns (n).

  • Column 4: The cvg-by-all representing the total alignment length found from all sequences in the FCS-GX database.

  • Columns 6-30: The alignment information for a maximum of four sets of tax-ids along with their divisions. FCS-GX prints the taxonomic name (see column 6) for the first set. It also prints the tax-id, division (derived from the “BLAST name” divisions in taxonomy), total alignment length from the division hits or just the specified tax-id, and a score for all four sets. FCS-GX returns information for a maximum of two tax-ids from the same division.

  • Column 31: reserved column

  • Column 32: FCS-GX result. This result can be any one of the following:

    Result Description
    primary-div sequence belongs to division of the input tax-id
    contaminant sequence identified as a contaminant
    contaminant(synthetic) one of the top four taxa belongs to the 'synthetic' division, and the score is close to nearest matching division
    contaminant(virus) one of the top four taxa belongs to the 'virus' division, and the score is close to nearest matching division
    contaminant(repeat) probably belongs to a contaminant division, but the sequence is highly repeat-specific
    contaminant(prok) matches to multiple prokaryotes and suggests the sequence is prokaryote-specific
    contaminant(close-div) strong and unambiguous hit from a closely-related division
    bogus inconclusive because the nearest matching taxon has high overlap with a different division
    repeat inconclusive because the sequence is highly repeat-specific
    low-coverage inconclusive due to low coverage
    inconclusive inconclusive for other reasons
  • Column 33: The taxonomic division assigned to the sequence by FCS-GX.

  • Column 34: The percentage alignment coverage for the sequence in the taxonomic division.

Example Outputs

The sequences below demonstrate some example outputs from taxonomy.rpt for a butterfly. The first sequence is insect. The second sequence is bacteria. While the third sequence is also insect, it has several weaker hits to bacteria.

# column numbers
 1               2          3                 4              5     6                        7          8                       9             10              11      12
                                                                                            13         14                      15            16              17      18
                                                                                            19         20                      21            22              23      24
                                                                                            25         26                      27            28              29      30      31         
                                                                                                                                                                             32            
                                                                                                                                                                             33
                                                                                                                                                                             34

 #seq-id         seq-len    (xp,lc,co,n)-len  cvg-by-all     sep1  tax-name-1               tax-id-1   div-1                   cvg-by-div-1  cvg-by-tax-1    score-1 sep2
                                                                                            tax-id-2   div-2                   cvg-by-div-2  cvg-by-tax-2    score-2 sep3
                                                                                            tax-id-3   div-3                   cvg-by-div-3  cvg-by-tax-3    score-3 sep4
                                                                                            tax-id-4   div-4                   cvg-by-div-4  cvg-by-tax-4    score-4 sep5    reserved
                                                                                                                                                                             result
                                                                                                                                                                             div
                                                                                                                                                                             div_pct_cvg

# example sequence identified as insect (expected)
FARY01017106.1  14773       0,0,0,0          10677           |    Melitaea cinxia           113334     anml:insects            10376         9804            262      |  
                                                                                            171605     anml:insects            10376         9375            250      |  
                                                                                            2829486    fung:basidiomycetes     92            92              12       |  
                                                                                            29144      anml:fishes             86            86              11       |       n/a
                                                                                                                                                                              primary-div
                                                                                                                                                                              anml:insects
                                                                                                                                                                              70

# example Heliconius melpomene (a butterfly) sequence identified as an Enterobacter contaminant
FARY01000050.1  15785       0,0,0,0          15785           |    Enterobacter chengduensis 2494701    prok:g-proteobacteria   15785         15761           886       |  
                                                                                            1812935    prok:g-proteobacteria   15785         15723           885       |  
                                                                                                                                                                       |
                                                                                                                                                                       |      n/a          
                                                                                                                                                                              contaminant     
                                                                                                                                                                              prok:g-proteobacteria
                                                                                                                                                                              100

# conflicting results (this probably is a butterfly sequence for a chitinase, with bacteria homologs)
FARY01021243.1  2942       0,0,0,0           2297            |    Vanessa cardui            171605     anml:insects            2297          2107            112       |  
                                                                                            7111       anml:insects            2297          2062            110       |  
                                                                                            614        prok:g-proteobacteria   1683          1614            75        |  
                                                                                            2864872    prok:g-proteobacteria   1683          1668            74        |      n/a          
                                                                                                                                                                              primary-div
                                                                                                                                                                              anml:insects
                                                                                                                                                                              78

Interpreting Outputs

The following steps will help you parse/interpret the taxonomy.rpt output:

  1. Retrieve a list of sequences with at least one contaminant identifier (including chimeras):
cat GCA_000006565.2.taxonomy.rpt | awk -v FS='\t' -v OFS='\t' '$32~/contaminant/{print $1}' |  cut -d '~' -f 1 | uniq  
  1. Retrieve the FCS-GX output for all sequences with a mix of contaminant and primary-div ranges (putative chimeric sequences):
cat GCA_000006565.2.taxonomy.rpt | grep primary-div | cut -f 1 | cut -d "~" -f 1 | fgrep -f - GCA_000006565.2.taxonomy.rpt | grep contaminant | cut -f 1 | cut -d "~" -f 1 | fgrep -f - GCA_000006565.2.taxonomy.rpt
  1. Calculate the percentage of the total genome length classified as primary-div:
cat GCA_000006565.2.taxonomy.rpt | awk -v FS='\t' -v OFS='\t' '($32 == "primary-div"){ pr_div_len += $2 }; 1{ tot_len += $2 } END{ print pr_div_len/tot_len*100 }'