FAQ - ncbi/pgap GitHub Wiki

Does Docker need to be installed on my machine?

pgap.py uses Docker to install the PGAP software. Once installed, PGAP can be run with either Docker or Docker-compatible software such as Apptainer or Podman; see the --docker and --container-path parameters for additional information.

What are the runtime resource requirements?

8 cores, 16G RAM, and 100GB of storage is a good place to start.

Can I run PGAP under MacOS or Windows?

Yes.

Can I run PGAP from an airgapped computer?

Yes; specifying the --no-internet flag will remove the container's network interface.

Will PGAP work on any CPU?

PGAP requires an x86-64 CPU that supports SSE4.2. This includes most processors released after 2008 (see https://en.wikipedia.org/wiki/SSE4#SSE4.2).

Can I run PGAP in distributed compute clusters (UGE/SGE, SLURM, Biowulf)?

Yes. While we are unable to provide support for your specific cluster, PGAP can be used in batch scheduler environments.

Do you provide PGAP as Singularity/Apptainer images?

Not at the present time; please create a GitHub issue if this functionality is desired.

How long does PGAP take to run?

A 0.58 Mb Mycoplasmoides genitalium genome takes about 5 minutes, and a 5.7 Mb Escherichia coli genome takes about 25 minutes.

Can PGAP assign a taxonomic classification to my genome automatically?

Yes. Use the flags --taxcheck and --auto-correct-tax, so the process assigns the assembly to an organism prior to running PGAP. With --taxcheck, ANI will identify the best matching assembly in GenBank that is of well-defined origin. The scientific name you provided on input will be overriden by the scientific name determined by ANI, resulting in a more accurate annotation. The scientific name in the final results is the ANI-chosen name.

Can I run PGAP on a metagenomic sample?

No. PGAP runs on a single genome at a time. It uses the genus of the organism provided on input by the user to determine sets of proteins to align to the genome for gene prediction. The user is therefore required to associate a genus- or species-level organism name with the input FASTA.

What if PGAP fails in the validation of the input FASTA?

An input genome will fail validation if it contains vector or adaptor contamination, or is smaller or larger than expected for the species. For organisms for which no size range is defined, the minimum and maximum size allowed for the input genome are 15 Kb and 100 Mb respectively. You can choose to ignore the validation errors by setting the flag --ignore-all-errors in pgap.py. Keep in mind that the annotation obtained with the --ignore-all-errors flag may not comply with GenBank's standards of quality.

What information is reported to NCBI when I turn on the report usage flag (-r or --report-usage-true)?

When telemetry is enabled, we collect:

  1. Run start and end times
  2. An identifier, randomly generated for each run
  3. IP address
  4. PGAP version
  5. Host OS

I need help diagnosing a failure. What files do I need to provide?

Please run PGAP with the --debug flag, open an issue and attach an archive (e.g. zip or tarball) of the logs in the directory: debug/tmp-outdir/*/*.log