FCS GX troubleshooting - ncbi/fcs GitHub Wiki
Please check the GitHub Issues page to see whether similar issues have been reported.
Please check genome sequence formatting requirements to see whether an invalid FASTA could be the source of the error.
At what stage in the genome assembly process should I run FCS-GX?
We recommend running FCS-GX after the intitial contig assembly stage. If you are planning to submit the genome to NCBI or another public archive, we also recommend running FCS-GX on the final assembly prior to submission. We recommend re-screening if a genome is identified to be heavily contaminated, as additional contaminants may be identified in a second FCS-GX run. which can occur.
Can FCS-GX run on sequencing reads?
FCS-GX is developed to operate on assembled genomes and is not intended to identify contaminants in sequencing reads.
GX is running slow. What should I do?
The time it takes to complete the initial download of the FCS-GX database is variable depending on method but should complete in under an hour. If the database is loaded into RAM, contamination screening for most genomes should complete on the order of minutes, not hours. See the below table for estimated run times for select model organisms using 48 cores:
organism | accsssion | genome size | run time |
---|---|---|---|
Escherichia coli | GCF_000005845.2 | 4.6 Mb | 10s |
Drosophila melanogaster | GCF_000001215.4 | 143.7 Mb | 34s |
Homo sapiens | GCF_009914755.1 | 3.1 Gb | 6m57s |
Triticum aestivum | GCF_018294505.1 | 14.6 Gb | 17m23s |
If the database is not loaded into RAM, this process should take a little bit longer. The most common reason for FCS-GX running slow is inadequate host memory. A host with 512 GiB shared memory is required to to hold the database and accessory files. Not running on a large-RAM server will result in extremely long run times (as much as a 10000x difference in performance).
The taxonomy assignment in the action/taxonomy report doesn't make sense...
Due to the limitations in the taxonomic representation for some groups, the assigned species may be a different species or even may be a different genus from the reported top organism. See Technical Information for additional details on database construction and taxonomic assignment.
I believe FCS-GX is reporting false positive contamination in my genome
Please report any concerns with false positive contamination results on the GitHub Issues page.
What files are important for debugging purposes?
The validate_fasta.txt
file can reveal invalid FASTA issues for input genomes. The fcs_adaptor.log
can reveal other issues with the FCS-adaptor run. When submitting a GitHub Issue, include fcs_adaptor.log
and run in --debug
mode.
Technical Information
fcs.py is a Python script that runs Docker images wrapping C++ and Python executables. The GX aligner is one such wrapped C++ binary that is used to process the query genome over multiple passes. The query is searched for repeats (for eukaryote genomes) and aligned to the broad genomic database. The results are filtered and further processed using taxonomic information. The resulting output is then classified for contaminants using a Python script.
Method
The GX cross-species aligner constructs initial alignments based on hashed k-mer (h-mer) matches where k-mers are modified to allow matches on similar sequences as you would expect to find between related species. The h-mer matches are extended through several rounds of refinement into cross-species alignments, and filtered and grouped by taxonomy. For each sequence, GX reports alignment information (coverage, GX score) for up to four species (identified by NCBI tax-ids), reporting a maximum of two species from the same division. Divisions are derived from NCBI BLAST divisions, with some aggregation (e.g., diptera→insect). FCS-GX then processes the alignment results to identify if they appear to originate from the division of the declared organism (primary-div), are likely contaminants, or fall into other categories.
The h-mers start with 56-mers and are modified to drop every third base (similar to the process for the discontiguous megablast); collapse purines (A and G) and pyrimidines (C and T) since transition-type changes are more biologically common than transversions; and use a minword approach to make the h-mers orientation-independent. The result is a 38-bit h-mer that is more likely to provide a cross-species match than a traditional 19 base k-mer. The h-mers for the screening database are generated with a 20 bp stride for eukaryotes and 10 bp for prokaryotes. The h-mers are mapped back to the original database sequences allowing for the construction and refinement of alignments between the query sequences and their hits in the screening database.
GX includes additional logic to identify simple repeats with either short or long periodicity, as well as high-copy sequences like transposons, both of which can generate false positives. Alignments are seeded with the query-genome sequence, excluding low-complexity repeats and transposons.
FCS-GX screens against a large database of genomic sequences. Whole genome shotgun (WGS) genomes are included, but sequences shorter than 10 kb for eukaryotes or 1 kb for prokaryotes are omitted, as are certain sequences determined to be contaminants. Ideally, the screening database will be sufficiently large and diverse such that all sequences will generate a hit to either their expected division or a contaminant division. In reality, this is impossible to achieve and certain organisms from poorly represented parts of the taxonomic tree are difficult to identify (e.g., crustaceans and microsporidians). The alignment approach favors hits in coding regions and can align as low as 65-75% identity, which is needed for identifying contamination from novel species of bacteria, fungi, and other common contaminants.
FCS-GX Database and Classification
The FCS-GX database is built from the following:
- representative RefSeq prokaryotes
- representative RefSeq eukaryotes, excluding some closely related assemblies in well represented divisions based on a Jaccard distance approach
- RefSeq viruses
- RefSeq plasmids
- additional GenBank fungi, nematodes, protists, algae, and bacteria (MAGs)
The FCS-GX classification system uses eight larger taxonomic “kingdoms”: animals (Metazoa), plants (Viridiplantae), Fungi, protists (other Eukaryota), Bacteria, Archaea, Viruses, and Synthetic. Each kingdom is further divided into one to 21 taxonomic divisions based on BLAST name groupings (e.g., human taxid 9606 = BLAST name primates = gx division anml:primates) assigned by NCBI Taxonomy, enabling the detection of some types of contaminants below the kingdom level. See here for a listing of the FCS-GX taxonomic divisions.
Known Issues
We are continually reviewing and enhancing our code and database to optimize performance, accuracy, and user experience. However, users may occassionally experience the following known issues:
- FCS-GX may occassionally produce false positive hits, or fail to report some true positive hits. This can be attributed to several reasons, including:
- The presence of contaminants in the FCS-GX database
- A low representation of certain organisms in the database (e.g., sponges)
- Novel contaminants that differ substantially from any known sequence included in the FCS-GX database
- Repetitive sequences can sometimes cause false positives
- Recent cases of Horizontal Gene Transfer (HGT)
- Highly conserved sequences like rDNA or mtDNA that are also prone to being contaminants in the FCS-GX database
- For running on on-prem machines, database downloads from the NCBI FTP site can take from 40 minutes up to several hours depending on bandwidth. Another option for on-prem machines is to download from the S3 bucket. For running in the cloud, downloading from S3 using 's5cmd' is a lot faster.
- Some sequences reported in the final report as "EXCLUDE," "FIX," or "TRIM" may be more complex mixtures of primary and contaminant sequences. Longer sequences (e.g., over 50 kb) that are reported as low coverage (e.g., below 20%) may warrant additional review. Checking the details for those sequence(s) in the taxonomy.rpt file can be useful.
Useful GX subcommands
-
Retrieve sequences from the database:
The sequences used to build the GX database are listed in the file all.seq_info.tsv.gz within the gxdb folder. From there, you can select the sequences of your choice, and then generate the FASTA files using the following GX subcommand.- For Docker:
docker run --rm -it -v $PWD:/host -v /$LOCAL_DB/gxdb/:/db ncbi/fcs-gx gx get-fasta --input=/host/3col.txt --output=/host/3col_fa.out
- For Singularity:
singularity exec --bind /$LOCAL_DB/gxdb/:/db --bind $PWD:/host fcs-gx.sif gx get-fasta --gx-db=/db/all.gxi --input=/host/3col.txt --output=/host/3col_fa.out
The input is a tab delimited 3 column file in the following format, along with the header:
cat 3col.txt ##["GX locs",1,1](/ncbi/fcs/wiki/"GX-locs",1,1) NC_060925.1 . .
To get the FASTA for a specific set of coordinates, format your input file with the start and end coordinates in the second and third columns, respectively:
##["GX locs",1,1](/ncbi/fcs/wiki/"GX-locs",1,1) NC_060925.1 1 200
- Verify functionality by using the small 'test-only' database:
python3 ./fcs.py screen genome --fasta ./fcsgx_test.fa.gz --out-dir ./gx_out/ --gx-db "$GXDB_LOC/test-only" --tax-id 6973
- For Docker: