Types of input data - BelenJM/supeRbaits GitHub Wiki

Different types of files can be used in supeRbaits:

Required input files

1. Genomic reference

Argument name = database

This input file refers to the genomic information available from the species of interest that you want to use as a reference for the designed baits. This reference database should be in FASTA format. See more information about the FASTA format here. For each entry, the FASTA format consists of at least two lines: one introduced by '>' and followed by a string (with the name of that chromosome, contig or piece of sequence of DNA), and the following lines containing the genomic sequence ('ATTTCAGGGTATGG'). Hence forth, each individual entity in a database file is called a 'sequence'.

Note that we have implemented a function within the package, i.e. standardize_lengths, that makes sure that the sequences are properly organised in the FASTA file.

Optional input files:

2. Exclusion areas

Argument name = exclusions

This type of input file is only used if you want to exclude certain areas from your genomic database and not generate baits from those (using the argument exclusions). The input file consists on the first three columns of a BED file, where the first column represents the chromosome/contig name (same names used in the database), and the second and third column represent the bp where the exclusion region starts and end. Each row contains a separate exclusion region. This file does not require column headers, and the data should be separated by tabs.

3. Inclusion areas

Argument name = regions

This type of input file refers to regions of the genomic database that you are very interested in including within your baits. This type of input file is used if you want to make use of the argument regions. A region file is structured in a similar fashion to the exclusions, where for each gene you have one or more intervals of base pairs you are interested in. The input file consists on the first three columns of a BED file, i.e. Chromosome_name \t start_bp \t end_bp\n, where each row contains a single region of interest.

4. Restriction areas

Argument name = restrict

This type of input file is a vector of chromosome names OR position numbers to which the analysis should be restricted to. This argument allows supeRbaits to only design baits for specific genes, specified either by name or position on the database.