FCS_adaptor_README - ncbi/fcs Wiki

FCS-adaptor

Foreign contamination screening (FCS)-adaptor is a VecScreen-based program to detect adaptor contamination in genomic sequences. This tool is one module within a larger NCBI FCS program suite.

System Prerequisites

FCS-adaptor is available as a Docker image. Please ensure Docker is installed on your computer. If you are running FCS-adaptor on the Google Cloud Platform (GCP), any general-purpose host should be sufficient including a Cloud Shell virtual machine (VM).

Quickstart

  1. Log into your terminal and ensure Docker is installed.

  2. Create a working directory such as the following:

    mkdir fcsx
    cd fcsx
    
  3. Get a copy of the run script:

    curl https://github.com/ncbi/fcs/raw/main/dist/run_fcsadaptor.sh -o run_fcsadaptor.sh
    
  4. Change the permissions of the run_fcsadaptor.sh:

    chmod 755 run_fcsadaptor.sh
    
  5. Create input and output directories:

    mkdir inputdir outputdir
    
  6. Place your FASTA file inside the inputdir directory. You may also use this example FASTA file:

    curl https://github.com/ncbi/fcs/raw/main/examples/fcsadaptor_prok_test.fa.gz -o ./inputdir/fcsadaptor_prok_test.fa.gz
    
  7. From the fcsx directory, run the following command. Use either --prok or --euk depending on the input sequence organism domain:

    ./run_fcsadaptor.sh --fasta-input ./inputdir/fcsadaptor_prok_test.fa.gz --output-dir ./outputdir --prok
    
  8. Look inside outputdir for the output of the adaptor screen.

Output

After the program has sucessfully run, you will see the following files in the outputdir:

av_screen.log
cleaned_fasta/validated.fna_0.cleaned_fa
combined.calls
fcs.log
fcs_report.txt
logs.jsonl
pipeline_args.yaml
skipped_trims.jsonl

The output calls are placed in fcs_report.txt (tab delimited) and combined.calls (JSON formatted). The partially cleaned FASTA file is in cleaned_fasta/validated.fna_0.cleaned_fa.

The fcs_report.txt lists the following:

  • accessions of sequences with adaptor hits
  • lengths of these sequences
  • action required (“TRIM” or “EXCLUDE”)
  • ranges of sequences detected as adaptor sequences
  • adaptor source type

See the script below for an example of the output:

#accession              length  action          range           name
CAJDEL010000529.1       56      ACTION_EXCLUDE                  CONTAMINATION_SOURCE_TYPE_ADAPTOR:NGB00731.1:Illumina Nextera PCR primer i5 index N505 (Oligonucleotide sequence copyright 2007-2012 Illumina, Inc. All rights reserved.)

CAJDEL010000001.1       449092  ACTION_TRIM     288380..288430  CONTAMINATION_SOURCE_TYPE_ADAPTOR:NGB00731.1:Illumina Nextera PCR primer i5 index N505 (Oligonucleotide sequence copyright 2007-2012 Illumina, Inc. All rights reserved.)

Example Run

See the script below for an example of the program run:

$ ./run_fcsadaptor.sh --fasta-input ./inputdir/fcsadaptor_prok_test.fa.gz --output-dir outputdir --prok
...
[step all_cleaned_fasta] start
[step all_cleaned_fasta] completed success
[workflow ] completed success
Output will be placed in: /output-volume
Executing vecscreen
run_av_screen_x
run_av_screen_x
$

The output from this example (fcs_report.txt) should match this file:
https://github.com/ncbi/fcs/raw/main/examples/fcsadaptor_prok_test_output.txt

Rules for Action Assignment

  • If adaptors are found at the beginning or end of the sequence, the matching span is reported as "ACTION_TRIM," and is removed in the new validated sequence output.
  • If adaptors are found within 100 bp of either end of the sequence, the span to trim is extended to the end of the contig. If additional adaptors are found within 100 bp of the proposed trim range, then the trim span is transitively extended to cover the additional hits. These spans are reported as "ACTION_TRIM," and are removed in the new validated sequence output.
  • If adaptors are found at greater than 100 bp from either end of the sequence, the matching span is reported as “ACTION_TRIM,” but the internal span is not removed in the new validated sequence output.
  • If adaptors are found at greater than 100 bp from either end of the sequence but 50 bp or less from each other, the spans are joined and reported as “ACTION_TRIM,” but the internal span is not removed in the new validated sequence output.
  • If more than 75% of the sequence matches the adaptors, the whole sequence is reported as “ACTION_EXCLUDE,” and is removed.
  • If less than 200 bp of the sequence remains unmatched to the adaptors, the whole sequence is reported as “ACTION_EXCLUDE,” and is removed.