Tutorial - BelenJM/supeRbaits GitHub Wiki

Introduction

This is the tutorial describing how to work with supeRbaits and create your bait sets. We have included an example dataset from a "real" example to follow the tutorial. You can download the datasets from the Zenodo repository. More information about the datasets and where we collected the data from can be found below. To install supeRbaits, follow the instructions here.

Getting help

Once you have installed supeRbaits, opened R and loaded the package following the instructions here, you can get access to some of the usage documentation of supeRbaits through this command in R:

help("supeRbaits")

More comprehensive documentation is available here in this GitHub wiki. Further information and the background/guidelines to design the baits can be found in the pre-print.

If you have further questions about the use of supeRbaits or you want to report a bug, please post those in Issues via GitHub. This way more users will benefit from the answers and discussion posted there.

Preparing the input files

supeRbaits needs the input files in a specific format. You can check which is the format here. You can see more examples of the formats by looking at the files that we will be using for the tutorial, which you can find it in the Zenodo repository.

Dataset for the tutorial

For this tutorial, we will be working with part of one of the examples of datasets described in the pre-print. You can download the files from the Zenodo repository. This tutorial dataset consists of (i) (a short part of a) reference genome, (ii) two files containing SNPs and genes, and (iii) another file containing positions where we have repeated areas within the reference genome that we want to exclude. The file of the reference genome included in the tutorial has been extracted from the genome assembly of the Atlantic salmon, which can be downloaded from NCBI, i.e. for the purpose of practicing with a real dataset but also save computational time when running the examples, we have scaled down the genome to work with a sub-selection of it. One way to do this is how we have done it, selecting only a few chromosomes. For this tutorial, we have selected four chromosomes with shorter lengths, named 'NC_027308.1' (ssa9), 'NC_027323.1' (ssa24), 'NC_027324.1' (ssa25) and 'NC_027325.1' (ssa26); see details of lengths of each chromosome/scaffold here. To select the chromosomes, we used samtools faidx (for each chromosome, adding it to the new fasta file by using '>>') through the following command in a UNIX terminal (note: not in R!):

samtools faidx GCF_000233375.1_ICSASG_v2_genomic.fa NC_027323.1 >> ICSASG_v2_selected.fa

For the tutorial, we have also sub-selected the regions ('candidateGenes_tutorial.txt') and the SNPs ('candidateSNPs_tutorial.txt') for the chromosomes that we are interested in designing baits on, as well as the areas that we would like to exclude from the bait area ('repeatMasker_tutorial.txt'), respectively.

For the purpose of the tutorial, download the files and place them in a folder called "example_data" at your preferred location in your laptop, e.g. PATH='home/user/Documents/'. Remember to indicate your path in R so it finds the folder with the files that you have just stored. You can do that by using the setwd command in R:

setwd("your/own/path/to/folder") # to be changed with your own path ;)

Giving a first try

do_baits() is the main function of supeRbaits. This function has a set of different arguments that the user can utilize to create the set of baits. To start off, let's exemplify how to generate a sample of 1000 baits of 100-bp long from the reference.

With the n option, you can specify the overall number of total baits you would like to generate in your bait set, e.g. 1000. With the size option, we set up the length (in base pairs or bp) of the baits, e.g. 100 bp. Your reference (the input to database) is usually a reference genome or transcriptome, or any other type of genomic information in FASTA format. Every two rows in the reference make reference to a contig or chromosome, or in supeRbaits' terms, a sequence (the first line is usually the name of the sequence/contig/chromosome, that is introduced by '>', and the second line is the DNA sequence itself). The database included within the tutorial dataset, i.e. the sub-selected reference genome of the Atlantic salmon ('ICSASG_v2_tutorial.fa'), consists on 4 chromosomes, i.e. 4 sequences (in supeRbaits terms). With the n option, the baits will be distributed around the sequences proportionally to the sequences’ size.

To simply get started, we can type the following command:

do_baits(n=1000, size= 100, database = "example_data/ICSASG_v2_tutorial.fa")

In the example of command above, the 4 chromosomes of the database "ICSASG_v2_tutorial.fa" will thus be supeRbaits' sequences over which supeRbaits will distribute the 1000 baits, each with a length of 100 base pairs, that we want to create.

You might have noticed that supeRbaits prints out some messages while you are waiting for your baits to be generated. The messages are intended to let the user know what is going on behind supeRbaits, that all is going fine and that there might be a good time to go for a cup of coffee meanwhile supeRbaits is busy. Examples of these messages are 'M: Compiling the sequences' lengths. This operation can take some time.', 'M: Finding bait positions for each sequence.', 'M: Retrieving bait base pairs. This operation can take some time.'. If there is an actual error, supeRbaits will notify it too and stop running, so the user can fix what's wrong and run supeRbaits again.

If we want to define the minimum number of baits to be placed per each of those sequences we can use the min.per.seq argument. The default is 1. This argument (together with n.per.seq) is interesting to use if our sequences have different lengths and the user does not want that supeRbaits distributes baits around the sequences proportionally to the sequences’ size. To use this argument, we'll type:

do_baits(n=1000, size=100, min.per.seq=200, database = "example_data/ICSASG_v2_tutorial.fa")

You can also specify a number of baits to generate per sequence (n.per.seq). You cannot use n and n.per.seq simultaneously. When possible, the n.per.seq argument will generate the same number of baits to all sequences. This is done by:

# if we want to generate 1000 baits, we'll assign 250 in each sequence:
do_baits(n.per.seq=250, size=100, database = "example_data/ICSASG_v2_tutorial.fa")

Note that if you plan to design more than 100000 baits, we recommend that you divide the sequences in various sets for memory reasons (otherwise, supeRbaits could complain (or more importantly, your computer may crash!).

So far we have not stored the output of supeRbaits in an R object, as we have been simply been focused on how we get supeRbaits to design baits. For the following part of the tutorial, we will store the output in R objects so we can explore the output and play with different functionalities of the package, through a series of examples.

First look at the output

supeRbaits provides an output with different pieces of information. To visualize the output, you can first store the output from supeRbaits under an R object, so it is easier to manipulate it and explore its different parts: baits, excluded_baits and input.summary.

For instance, let's generate 100 baits per sequence, 100-base pairs long, in our database. We will store the output under the object name 'hello_baits'. We do this by using the following command:

hello_baits <- do_baits(n.per.seq=100, size=100, database = "example_data/ICSASG_v2_tutorial.fa")

The output consists on different parts, which can be explored by typing:

hello_baits$baits
hello_baits$excluded_baits
hello_baits$input.summary

The part '$baits' contains the characteristics of each bait per sequence/chromosome/scaffold. Let's have a look at the output of the first 3 baits of the NC_027308.1 sequence:

hello_baits$baits$NC_027308.1[1:3,]

The column bait_type refers to the type of bait (e.g. if it is a 'random', 'target' or 'region' bait); if no targets/regions areas have been indicated, all the baits will appear under the category 'random', as in the hello_baits example. Other columns in '$baits' are: the start and end position of the bait (in base pairs; i.e. bait_start and bait_stop); the length of the bait (bait_seq_size); the counts of each nucleotide type (no_A; no_T; no_G; no_C), the unknown bases (no_UNK), the counts of AT (no_AT) or GC (no_GC) and the percentage of GC content (pGC).

The '$excluded_baits' part of the output will contain sequences that have been filtered out as part of the filtering criteria in the command. For instance, here supeRbaits will store sequences that have lower or higher content of GC bases than we have specified to supeRbaits under the gc argument. See the following section for more information. In the hello_baits example, this part of the output is empty, as we did not specify to supeRbaits what we wanted the GC filter (the gc argument was not used) so all baits were stored under '$baits'.

Under '$input.summary' it is possible to recall the information from the run and datasets (i.e. a summary of the sequence/chromosome/scaffold lengths, the different input files, and the size of desired bait). In our hello_baits example, only '$input.summary$chr.lengths' and '$input.summary$size' contain data; '$input.summary$exclusions', '$input.summary$targets' and '$input.summary$regions' remain empty because we did not specify any inputs for exclusions, targets or regions so there weren't baits generated under those categories.

Filtering for GC content

The '$excluded_baits' part of the output will contain, for example, sequences that have lower or higher content of GC bases than what we want our baits to have. We can specify the GC contents under the 'gc' argument within the do_baits() function in supeRbaits. For instance, if we want to generate baits with a GC content between 35% and 55%, we would indicate this in supeRbaits by:

baits_ex1 <- do_baits(n=1000, size=100, database = "example_data/ICSASG_v2_tutorial.fa", gc = c(0.35, 0.55))

The baits that have a lower or higher content of GC bases than the range specified in the gc argument will be filtered out by supeRbaits. The filtered baits can be found under '$excluded_baits'. It is useful to have them stored at this stage so one can see, if interested, the proportion of potential baits that passed the GC criteria vs. the total amount. The information stored in '$excluded_baits' follows the same structure as for '$baits'.

Generating the bait set excluding certain areas

Areas of exclusion in a supeRbaits context are areas of the genome that we typically want to avoid when designing baits, for several reasons - one of them could be because they are repeated areas. We can use datasets such as the ones generated by RepeatMasker or similar softwares to make an exclusion dataset that we can feed to supeRbaits to exclude those areas for the baits. For our example, we used a repeat-region dataset from the Atlantic salmon genome to generate a BED file that we could use for supeRbaits so we can use it like this:

baits_ex2 <- do_baits(n=1000, size=100, database = "example_data/ICSASG_v2_tutorial.fa",
                         exclusions = "example_data/repeatMasker_tutorial.txt")

Note that the baits that will fall within the exclusions' areas won't be stored under '$excluded_baits'.

Generating the bait set including areas or points of interest

If we have specific targets where we want to generate baits on, e.g. SNPs, we can feed this information to supeRbaits to generate baits in those areas. It is needed to specify in what proportion from the total number of baits we want baits to fall within those targets, by using the argument targets.prop (see the next section for more details into this argument). For example, if we want to generate 25% of the total baits (n = 1000) within targets, we'll use supeRbaits like this:

baits_ex3 <- do_baits(n=1000, size = 100, database = "example_data/ICSASG_v2_tutorial.fa", 
                          targets = "example_data/candidateSNPs_tutorial.txt",
                          targets.prop = 0.25,
                          exclusions = "example_data/repeatMasker_tutorial.txt")

We can also do similar as above but for region areas, e.g. genic areas. In this case, we want to generate a bait panel of a 1000 baits, with 25% of baits falling in region areas using the regions.prop argument. Same principle:

baits_ex4 <- do_baits(n=1000, size=100, database = "example_data/ICSASG_v2_tutorial.fa", 
                          regions= "example_data/candidateGenes_tutorial.txt",
                          regions.prop = 0.25,
                          exclusions = "example_data/repeatMasker_tutorial.txt")

Note that baits will never go beyond the interval of a region (see the section below for more information on this).

Finally, it is possible to combine all the information together. Let's generate 1000 baits, from which we want 25% to fall within the region areas and 25% within targets:

baits_ex5 <- do_baits(n=1000, size=100, database = "example_data/ICSASG_v2_tutorial.fa", 
                         regions= "example_data/candidateGenes_tutorial.txt", regions.prop = 0.25,
                         targets = "example_data/candidateSNPs_tutorial.txt", targets.prop = 0.25,
                         exclusions = "example_data/repeatMasker_tutorial.txt")

An important note is that, without specifying targets.prop or regions.prop, baits will all fall under random areas even if we specify a file under targets or regions arguments.

Playing with the proportion of baits and tiling

As mentioned in the section above, you can define the proportion of baits from the total baits (n) that you want to be designed in regions/targets by using the argument regions.prop (for regions) and targets.prop (for targets). If the percentage of each type of regions and targets of interest does not amount to 100%, supeRbaits will fill the remaining baits with randomly-placed baits along the genome (always avoiding the exclusion zones). For some of the following exercises, we will use the seed argument, which is a random seed, so users can reproduce the exact results.

For instance, let's design 50% of baits within targets:

baits_ex6 <- do_baits(n=1000, size = 100, database = "example_data/ICSASG_v2_tutorial.fa", 
                          targets = "example_data/candidateSNPs_tutorial.txt",
                          targets.prop = 0.50,
                          exclusions = "example_data/repeatMasker_tutorial.txt", seed=1258)

To check the number of bait types, we can type the command below, where it should reflect the proportion that we indicated in the targets.prop argument (around 50% of random and target baits). Let's have a look at the different sequences:

type_bait_ex6 <- sapply(baits_ex6$baits, `[[`, 'bait_type')
table(type_bait_ex6$NC_027308.1) # random: 247; target: 246
table(type_bait_ex6$NC_027323.1) # random: 88; target: 88
table(type_bait_ex6$NC_027324.1) # random: 92; target: 91
table(type_bait_ex6$NC_027325.1) # random: 74; target: 74

And we can see that it is actually the case. In the example above, we notice that some sequences have more baits than others - why is this? It is because supeRbaits will distribute baits around the sequences proportionally to the sequences’ size; that's why we observe that supeRbaits is assigning proportionally more baits to the largest sequence, i.e. NC_027308.1, for both random and target baits. If we wanted a minimum number of baits per sequence we could have added the min.per.seq argument, or used the n.per.seq instead.

We can also do similar as above but for region areas, e.g. genic areas. In this case, we want to generate a bait panel of a 1000 baits exclusively in the regions' area, so we'll indicate that we want 100% of the baits in regions (_regions.prop_= 1). Same principle:

baits_ex7 <- do_baits(n=1000, size=100, database = "example_data/ICSASG_v2_tutorial.fa", 
                          regions= "example_data/candidateGenes_tutorial.txt",
                          regions.prop = 1,
                          exclusions = "example_data/repeatMasker_tutorial.txt", seed = 12354)

As we did for baits_ex6, let's check the number of bait types in each chromosome:

type_bait_ex7 <- sapply(baits_ex7$baits, `[[`, 'bait_type')
table(type_bait_ex7$NC_027308.1) # region: 455
table(type_bait_ex7$NC_027323.1) # random: 83; region: 88
table(type_bait_ex7$NC_027324.1) # random: 55; region: 132
table(type_bait_ex7$NC_027325.1) # region: 187

Why has supeRbaits, in the baits_ex7 example, designed some baits in random areas where we indicated to exclusively design baits under regions? This is not a bug: if supeRbaits does not have enough 'space' within a region of a sequence, it will automatically assign the remaining number of baits that should go in that sequence to random baits. So that is why in the baits_ex7 output, in chromosomes NC_027323.1 and NC_027324.1 supeRbaits also assigned baits in random areas. The total sum of all baits should be _n_=1000, which it is. The take-home message is that supeRbaits will shift percentages to random if/when needed, i.e. if targets/regions are too small to fit the desired number of baits at the requested bait sizes.

Tiling allows you to include more coverage on a specific region, as you can design more than 1 bait in one specific area of interest. For example, in targets, which are typically one base-pair or nucleotide that contain a Single Nucleotide Polymorphisms (SNP). You can include tiling using the argument targets.tiling (for the targets) or regions.tiling (for the regions).

As an example, if we want to generate a set of 50 baits in a target area with 3x tiling, we can use:

baits_ex8 <- do_baits(n=50, size=100, database = "example_data/ICSASG_v2_tutorial.fa", 
                         targets = "example_data/candidateSNPs_tutorial.txt", targets.prop=1, targets.tiling=3, 
                         exclusions = "example_data/repeatMasker_tutorial.txt")

If we inspect the output of baits_ex8$baits for e.g. NC_027323.1,

baits_ex8$baits$NC_027323.1

we can see that different sets of 3 baits (corresponding to the 3x tiling that we have indicated as an argument) have been generated for this sequence, which are targeting different targets of interest in the 'candidateSNPs_tutorial.txt' file. Within each sets of 3 baits, the starting positions differ between each bait, even for the same target, where the target of interest is placed somewhere along the bait interval.

Similarly, for the regions:

baits_ex9 <- do_baits(n=50, size=100,  database = "example_data/ICSASG_v2_tutorial.fa", 
                         regions= "example_data/candidateGenes_tutorial.txt", regions.prop=1, regions.tiling=3,
                         exclusions = "example_data/repeatMasker_tutorial.txt")

If we inspect the output of baits_ex9$baits, this time, for e.g. NC_027325.1,

baits_ex9$baits$NC_027325.1

different baits have been generated for this area (out of the 50 baits that we indicated for all the chromosomes), targeting different region areas in 'candidateGenes_tutorial.txt'. As mentioned above, note that the baits have different starting positions within the intervals and never expand the limits of the intervals. So what would happen if the size of a region is not long enough to locate all the tiling we indicate supeRbaits? For instance, given a region of a certain length (e.g. 210 bp) where we are interested in designing 120-bp baits, with 4x tiling: what would happen to the last 10 bp? (this was actually one of the questions from one of the reviewers during the review process of the article - very good point!). In that case, the region-baits will be put entirely inside the region. If the bait size does not fit the region (e.g. if it goes beyond the last 10 bp), that region-bait will be discarded by supeRbaits. For regions that are smaller than the bait size, we recommend setting a target-bait instead. Example: having a region between bp 50 and 100, with baits of size 20, would produce up to 32 region baits (first one would be 50-69 and last would be 81-100). From these, the user can select the baits that fit what it is needed.

Blasting your baits

The function named blast_baits() allows to blast your designed baits and check how many hits they have against the database you used to create them (the default option), or another one that you specify. In order to use blast_baits(), you first need to install the BLAST+ software in your laptop or system (here we have indicated the instructions on how to do it). The BLAST+ software contains stand-alone executables to implement different functionalities. Refer to BLAST and BLAST+ documentation for more information.

Once BLAST+ is correctly installed and the path is located, you are ready to run blast_baits(). The input of this function is the output of do_baits() and you have the option to interact with blastn by the argument blastn_args . The following parameters are already set and should not be changed: -db ([-db database_name]), -query [-query input_file], -out ([-out output_file]) and -outfmt ([-outfmt format]), but you can change the rest, which are:

blastn [-h] [-help] [-import_search_strategy filename]
    [-export_search_strategy filename] [-task task_name] 
    [-dbsize num_letters] [-gilist filename] [-seqidlist filename]
    [-negative_gilist filename] [-negative_seqidlist filename]
    [-taxids taxids] [-negative_taxids taxids] [-taxidlist filename]
    [-negative_taxidlist filename] [-entrez_query entrez_query]
    [-db_soft_mask filtering_algorithm] [-db_hard_mask filtering_algorithm]
    [-subject subject_input_file] [-subject_loc range] 
    [-evalue evalue] [-word_size int_value]
    [-gapopen open_penalty] [-gapextend extend_penalty]
    [-perc_identity float_value] [-qcov_hsp_perc float_value]
    [-max_hsps int_value] [-xdrop_ungap float_value] [-xdrop_gap float_value]
    [-xdrop_gap_final float_value] [-searchsp int_value]
    [-sum_stats bool_value] [-penalty penalty] [-reward reward] [-no_greedy]
    [-min_raw_gapped_score int_value] [-template_type type]
    [-template_length int_value] [-dust DUST_options]
    [-filtering_db filtering_database]
    [-window_masker_taxid window_masker_taxid]
    [-window_masker_db window_masker_db] [-soft_masking soft_masking]
    [-ungapped] [-culling_limit int_value] [-best_hit_overhang float_value]
    [-best_hit_score_edge float_value] [-subject_besthit]
    [-window_size int_value] [-off_diagonal_range int_value]
    [-use_index boolean] [-index_name string] [-lcase_masking]
    [-query_loc range] [-strand strand] [-parse_deflines] 
    [-show_gis] [-num_descriptions int_value] [-num_alignments int_value]
    [-line_length line_length] [-html] [-sorthits sort_hits]
    [-sorthsps sort_hsps] [-max_target_seqs num_sequences]
    [-num_threads int_value] [-mt_mode int_value] [-remote] [-version]

For instance, the user could run blast_baits() using default parameters with the line of code below, using the output from baits_ex9:

baits_ex9_blasted <- blast_baits(baits_ex9)

In the example above, the baits generated in baits_ex9 have now been blasted against the same database that it was used to generate them. Using the same reference is the default option in blast_baits(); if you want to blast the baits against other database you should specify it to supeRbaits using the argument db in blast_baits().

Note that supeRbaits is merely an interface for BLAST+, there may appear some messages or warnings when running this function that are related to warnings from BLAST+ itself; examples of messages could be in relation to e.g. the building of the database in BLAST+ (e.g. "Building a new DB, current time: xxxxx"). Examples of warnings are written as 'Warning: [blastn]' following some piece of information (e.g. 'Warning: [blastn] lcl|Query_6587 6587: Warning: Could not calculate ungapped Karlin-Altschul parameters due to an invalid query sequence or its translation. Please verify the query sequence(s) and/or filtering options'). Despite the warnings, supeRbaits will run unless there is an actual error in supeRbaits itself. To check the warnings, you should refer to them at BLAST+ documentation, so the user can correctly interpret the results. supeRbaits relies on the user to know how to use BLAST+ and to understand the BLAST terminology (you can read more about BLAST here).

The output of blast_baits() adds an extra column to the output of do_baits() (column 'n_matches'). This column refers to the number of regions where each generated bait blasts to the provided reference genome. This new column will allow the user to decide which baits they want to keep based on the BLAST results, and this can be easily done using R. For instance, if the user would like to keep only baits in NC_027308.1 that blast 1 time to the reference database this command could be used:

baits_ex9_keep <- baits_ex9_blasted$baits$NC_027308.1[baits_ex9_blasted$baits$NC_027308.1$n_matches == 1,]

or by using:

subset(baits_ex9_blasted$baits$NC_027308.1, n_matches==1)