FCS_GX_README - ncbi/fcs Wiki

FCS-GX

Foreign contamination screening (FCS)-GX is a genome-level contamination detecting tool based on a cross-species alignment component called the GX aligner. FCS-GX is one module within a larger NCBI foreign contamination screening program suite.

To assign contaminants, FCS-GX requires an input assembly in FASTA format and a numeric NCBI taxonomic identifier (tax-id) corresponding to the source genome. The GX aligner then matches sequences to a large database of NCBI genomes through modified k-mer seeds, and resolves matches to alignments.

The GX aligner operates in two passes. In the first pass, GX retains identifiers from taxonomic groups with the highest scoring alignments and filters out lower scoring alignments. During this phase, GX also performs masking for both low-complexity and high-copy repeats. In the second pass, sequence alignments are refined and extended to produce final coverage and scoring metrics.

FCS-GX detects contaminant sequences when their taxonomic assignment is different from the user provided tax-id. A contamination summary report provides the counts and total sizes of contaminant regions. Results are also provided at the sequence-level for the inspection of FCS-GX assignments.

Prerequisites

  1. FCS-GX is available as a Docker or Singularity image. Please ensure either Docker or Singularity is installed on your terminal, and use Python 3.7 or higher to run FCS-GX.
  2. A FASTA file.
  3. The tax-id of the organism.
  4. Technical requirements: a host with sufficient shared memory to hold the database and accessory files (approximately 460 GiB). Execution can utilize up to 48 CPU cores. A Google Cloud Platform (GCP) host such as n2-highmem-64 (64 vCPUs, 512GB) is sufficient for execution. Optionally, sufficient disk space to save a local copy of the database files (approximately 460 GB) should avoid repeated downloads from NCBI's FTP site.

Quickstart

This section contains step-by-step instructions on how to run FCS-GX using Docker. If you are using Singularity, please ensure you follow the additional relevant Singularity instructions included in some of the steps.

  1. Log into your terminal and ensure either Docker or Singularity is installed.
    Note: The current Singularity image is made using version 3.4.0.

  2. Retrieve the run_fcsgenome script and an example FASTA file:

    curl -o run_fcsgenome.py https://github.com/ncbi/fcs/raw/main/dist/run_fcsgenome.py -L   
    curl -o fcsgenome_test.fa.gz https://github.com/ncbi/fcs/raw/main/examples/fcsgenome_test.fa.gz -L
    

    For Singularity, retrieve this additional image file:

    curl -o fcs-genome.sif https://ftp.ncbi.nlm.nih.gov/pub/murphyte/FCS/FCS-genome/singularity/gx-develop-latest.sif -L
    
  3. Create a shared memory space. Many use cases will require elevated privileges and assistance from system administrators. There are several administrative decisions to make at this point. The first is whether to create the shared memory in a permanent or temporary location, and the second is whether to use or avoid /dev/shm. Thirdly, you may also avoid using shared memory altogether:

    • Option 1: To create permanent shared memory at /dev/shm, set an environment variable to the location:

      SHM_LOC=/dev/shm
      

      Edit /etc/fstab:

      sudo vi /etc/fstab
      

      Paste the following line at the end of the fstab file:

      none	/dev/shm	tmpfs	defaults,size=460G	0	0
      

      Save then remount:

      sudo mount -o remount "$SHM_LOC"
      
    • Option 2: Alternatively, you may create a temporary shared memory. This example also illustrates avoiding /dev/shm. Use this option for Singularity:

      SHM_LOC=/mnt/ram0
      sudo mkdir -p "$SHM_LOC"  
      sudo mount -t tmpfs -o size=460g tmpfs "$SHM_LOC"
      
    • Option 3: As an alternative to using shared memory, you may set the path to:

      SHM_LOC=<disk path>
      

      This method still requires a high-memory server but compensates by memory-mapping the database at the beginning of the run and thereby caching it to memory on the go. While it may take a little extra time, it doesn’t require sudo permissions.

  4. Create a directory in your shared memory space. This is where the working copy of the FCS-GX database (GX database) will be stored. Use the path generated in the previous step:

    mkdir -p "${SHM_LOC}/gxdb"
    
  5. Create a directory to store a local backup copy of the FCS-GX database. The complete all (~500GB) and testing test-only (~5GB) database files will be automatically downloaded once from the NCBI website to this location, so please ensure you have enough free storage space:

    mkdir -p ./gxdb
    
  6. Create a directory to store the output of the program:

    mkdir -p ./gx_out
    
  7. Parameterize and run the run_fcsgenome.py script.
    The shared memory database and the local backup database paths are specified using --gx-db and --gx-db-disk, respectively. The script checks the shared memory path for the database and, if needed, loads it from the local backup database path.

    • For Docker

      • Verify functionality by using a small test-only database:
        python3 ./run_fcsgenome.py --fasta ./fcsgenome_test.fa.gz --out-dir ./gx_out/ --gx-db "${SHM_LOC}/gxdb/test-only" --gx-db-disk ./gxdb --split-fasta --tax-id 6973
        
      • For normal runs, use the complete all database:
        python3 ./run_fcsgenome.py --fasta ./fcsgenome_test.fa.gz --out-dir ./gx_out/ --gx-db "${SHM_LOC}/gxdb/all" --gx-db-disk ./gxdb --split-fasta --tax-id 6973
        
    • For Singularity:

      • Verify functionality by using a small test-only database:
        python3 ./run_fcsgenome.py --fasta ./fcsgenome_test.fa.gz --out-dir ./gx_out/ --gx-db "${SHM_LOC}/gxdb/test-only" --gx-db-disk ./gxdb --split-fasta --tax-id 6973 --container-engine=singularity --image=fcs-genome.sif
        
      • For normal runs, use the complete all database:
        python3 ./run_fcsgenome.py --fasta ./fcsgenome_test.fa.gz --out-dir ./gx_out/ --gx-db "${SHM_LOC}/gxdb/all" --gx-db-disk ./gxdb --split-fasta --tax-id 6973 --container-engine=singularity --image=fcs-genome.sif
        

    Note: Downloading and copying to shared memory may require extensive waits due to the large file sizes. These downloads and copies may be monitored with:

    ls -hlt ./gxdb  
    ls -hlt ${SHM_LOC}/gxdb
    

    Note: If you are running this program on a dedicated machine such as a cloud VM, you may not require a backup of the database stored on disk. To skip making a local copy, simply omit the --gx-db-disk parameter.
    Note: There is an option to provide a manifest file which lists the paths to multiple fasta files using the option --fasta <filename>.mft

  8. Inspect the output file:

    head -n 5 gx_out/fcsgenome_test.6973.taxonomy.rpt
    

    Output content should match the example output found in the GitHub repository:
    https://github.com/ncbi/fcs/raw/main/examples/fcsgenome_test.6973.taxonomy.rpt
    and
    https://github.com/ncbi/fcs/raw/main/examples//fcsgenome_test.6973.fcs_genome_report.txt

  9. Apply run_fcsgenome.py to additional genomes you wish to screen. Once you have finished screening, you may remove the database from shared memory:

    rm -rf "${SHM_LOC}/"
    

Usage Examples

You can run FCS-GX on an assembly by providing the FASTA file and a tax-id. Tax-id for an organism can be retrieved from the NCBI Taxonomy website.
https://www.ncbi.nlm.nih.gov/taxonomy

python3 ./run_fcsgenome.py --fasta ./GCA_000006565.2_TGA4_genomic.fna.gz --out-dir ./gx_out/ --gx-db "${SHM_LOC}/gxdb/all" --gx-db-disk ./gxdb --split-fasta --tax-id 508771 

You may check if the complete database is downloaded successfully to ${SHM_LOC}/gxdb:

ls "${SHM_LOC}/gxdb"
all.blast_div.tsv.gz
all.gxi
all.gxs
all.meta.jsonl
all.taxa.tsv

A successful FCS-GX run will print the parameters of the run, sequence masking progress, and a contamination summary report:

fasta:                    /sample-volume/GCA_000006565.2_TGA4_genomic.fna.gz
tax-id:                   508771
blast-div:                'apicomplexans'
gx-div:                   'alveolates'
split-fasta:              True
output:                   /output-volume/GCA_000006565.2_TGA4_genomic.508771.taxonomy.rpt
Collected masking stats:  0.0631311 Gbp; 2.52918s; 24.9611 Mbp/s. Baseline: 1.13985

Processed 2511 queries, 65.5212Mbp in 7.62548s. (8.5924Mbp/s)
Read 508397 lines, kept 485569 lines, tossed 22828 lines
primary-divs: {'alveolates'} (99%)
well-represented divs: {'alveolates', 'nematodes', 'prok|high GC Gram+', 'primates'}
Aggregate coverage: 100%
-------------------------------------------------------------------------------


Contamination summary:
----------------------
#div                        count  len
TOTAL                       314    225952 
-----                       -----  -----
primates                    239    146361 
nematodes                   34     55025  
prok|high GC Gram+          37     20087  
rodents                     1      1460   
mammals                     1      1437   
prok|CFB group bacteria     1      1047   
ascomycetes                 1      535    

The output directory will contain the following files:

  • < base>.taxonomy.rpt
  • < base>.fcs_genome_report.txt

Please review the < base>.fcs_genome_report.txt output first. For more details on the < base>.taxonomy.rpt, please review the Taxonomy Report Output documentation.

Outputs

This section discusses the FCS-GX report output, several output examples, and how to interpret these outputs.

FCS-GX Report Output

A final report of recommended actions from FCS-GX is provided in the file < basename of fasta file>.< tax-id provided>.fcs_genome_report.txt.

The following table illustrates column numbers (first column) with corresponding column headers (second column):

1:      seq_id        AABL01002868.1 
2:      start_pos     1 
3:      end_pos       3228 
4:      length        3228 
5:      action        EXCLUDE 
6:      div           prok|g-proteobacteria 
7:      agg_cont_cov  100 
8:      top_tax_name  Methylococcus capsulatus str. Bath 
  • Column 1: A seq-id (sequence ID) for a whole sequence, as found in the input FASTA.
  • Columns 2 and 3: Start and end coordinates for the identified contamination. If only a portion of the sequence is identified as contaminant, these values indicate the range that should be removed.
  • Column 4: Length of the entire sequence in Column 1. Only a portion may be identified as contaminant, according to the start_pos and end_pos columns.
  • Column 5: The recommended action. Action values are as follows:
    • EXCLUDE: Remove the entire sequence.
    • TRIM: Remove the sequence at the beginning or end of the sequence. GenBank generally requires that sequences do not start or end with Ns, so the recommended course of action is to trim off contaminant sequences.
    • FIX: If a contaminant range is found in the middle of a sequence, it should either be hardmasked (converted to Ns) or split into two new sequences if it suggests misassembly.
    • REVIEW: Some cases are suggestive of contamination, but should be examined further before taking action.
  • Column 6: The taxonomic division assigned to the contaminant sequence by FCS-GX.
  • Column 7: The percentage alignment coverage for the contaminant in the range indicated by the start_pos and end_pos columns. If the range is composed of multiple contigs separated by gaps, these gaps are ignored for computing coverage. Low coverage values often indicate contamination with novel organisms (e.g., novel genera of bacteria). FCS-GX is tuned for high specificity, even in cases of low reported coverage values. See Known Issues for more details.
  • Column 8: The taxonomic name of the top contaminant organism identified by FCS-GX; the actual contaminant may be from a related species.

Example Outputs

The output below demonstrates some examples from the fcs_genome_report.txt output:

#seq_id      start_pos   end_pos   length   action    div               agg_cont_cov   top_tax_name 

DS028646.1   1           43771     43771    EXCLUDE   prok|firmicutes   97             Paenibacillus yonginensis 

DS028170.1   1           5917      932874   TRIM      prok|firmicutes   17             Paenibacillus yonginensis 

DS028172.1   288312      304838    924226   FIX       prok|firmicutes   16             Paenibacillus physcomitrellae 

Interpreting Outputs

The following steps will help you parse/interpret the fcs_genome_report.txt output:

Retrieve accessions marked as EXCLUDE:
grep -w EXCLUDE GCA_000222875.2.fcs_genome_report.txt

Retrieve the total number of accessions marked as EXCLUDE:
grep -w EXCLUDE GCA_000222875.2.fcs_genome_report.txt | cut -f 1 | sort -u | wc -l

Retrieve the total number of basepairs marked as EXCLUDE:
grep -w EXCLUDE GCA_000222875.2.fcs_genome_report.txt | awk '{sum+=$3-$2+1}END{print sum}'

You can also replace "EXCLUDE" in the above commands with FIX, REVIEW, or TRIM to get corresponding values for these actions, respectively.

Technical Information

FCS-GX is a Python script that runs Docker images wrapping C++ and Python executables. The GX aligner is one such wrapped C++ binary that is used to process the query genome over multiple passes. The query is searched for repeats and aligned to the broad genomic database. The results are filtered and further processed using taxonomic information. The resulting output is then classified for contaminants using a Python script.

Method

FCS-GX’s cross-species aligner GX constructs initial alignments based on hashed k-mer (h-mer) matches where k-mers are modified to allow matches on similar sequences as you would expect to find between related species. The h-mer matches are extended through several rounds of refinement into cross-species alignments, and filtered and grouped by taxonomy. For each sequence, GX reports alignment information (coverage, GX score) for up to four species (identified by NCBI tax-ids), reporting a maximum of two species from the same division. Divisions are derived from NCBI BLAST divisions, with some aggregation (e.g., diptera→insect). FCS-GX then processes the alignment results to identify if they appear to originate from the division of the declared organism (primary-div), are likely contaminants, or fall into other categories.

The h-mers start with 56-mers and are modified to drop every third base (similar to the process for the discontiguous megablast); collapse purines (A and G) and pyrimidines (C and T) since transition-type changes are more biologically common than transversions; and use a minword approach to make the h-mers orientation-independent. The result is a 38-bit h-mer that is more likely to provide a cross-species match than a traditional 19 base k-mer. The h-mers for the screening database are generated with a 20 bp stride for eukaryotes and 10 bp for prokaryotes. The h-mers are mapped back to the original database sequences allowing for the construction and refinement of alignments between the query sequences and their hits in the screening database.

GX includes additional logic to identify simple repeats with either short or long periodicity, as well as high-copy sequences like transposons, both of which can generate false positives. Alignments are seeded with the query-genome sequence, excluding low-complexity repeats and transposons.

FCS-GX screens against a large database of genomic sequences. Whole genome shotgun (WGS) genomes are included, but sequences shorter than 10 kb for eukaryotes or 1 kb for prokaryotes are omitted, as are certain sequences determined to be contaminants. Ideally, the screening database will be sufficiently large and diverse such that all sequences will generate a hit to either their expected division or a contaminant division. In reality, this is impossible to achieve and certain organisms from poorly represented parts of the taxonomic tree are difficult to identify (e.g., crustaceans and microsporidians). The alignment approach favors hits in coding regions and can align as low as 65-75% identity, which is needed for identifying contamination from novel species of bacteria, fungi, and other common contaminants.

FCS-GX Database

The FCS-GX database is built from the following:

  1. representative RefSeq prokaryotes
  2. representative RefSeq eukaryotes, excluding some assemblies from vertebrate groups with dense taxonomic coverage
  3. RefSeq viruses
  4. RefSeq plasmids
  5. additional GenBank fungi, nematodes, protists, and algae (~500 genomes)

Known Issues

We are continually reviewing and enhancing our code and database to optimize performance, accuracy, and user experience. However, users may occassionally experience the following known issues:

  1. FCS-GX may occassionally produce false positive hits, or fail to report some true positive hits. This can be attributed to several reasons, including:
    • The presence of contaminants in the FCS-GX database
    • A low representation of certain organisms in the database (e.g., sponges)
    • Novel contaminants that differ substantially from any known sequence included in the FCS-GX database
    • High-copy contaminants that may be masked (e.g., viruses)
    • Repetitive sequences
    • Known integrants (e.g., viruses or Wolbachia)
    • Highly conserved sequences like rDNA or mtDNA that are also prone to be contaminants in the FCS-GX database
  2. Database downloads from the NCBI FTP site can take from 40 minutes up to several hours depending on bandwidth. When running FCS-GX locally, this is mitigated by saving a local copy of the files. For running in the cloud, we are currently exploring additional options for cloud storage to reduce download time to approximately 10 minutes, but this option may be subject to egress charges.
  3. Several steps are slow and lack efficient progress reports. These includes the following:
    • After "(OK):download completed." appears, run_fcsgenome will copy the database files to the local path, if not already present. This may take 30-60 minutes, depending on system characteristics. You may monitor progress in a second shell, and check the presence and size of the download files.
    • After the docker run command ("'docker', 'run',...") appears, the GX aligner begins running, but currently lacks a progress indicator. You may monitor progress in a second shell, and check the presence and size of the "* .taxonomy.rpt.tmp" file in the output directory.
⚠️ **GitHub.com Fallback** ⚠️