Output data - BelenJM/supeRbaits GitHub Wiki

supeRbaits provides an output with different pieces of information. To visualize the output, you can first store the output from supeRbaits under an R object, so it is easier to manipulate it and explore its different parts: baits, excluded_baits and input.summary.

For instance, let's generate 100 baits per sequence, 100-base pairs long, in our database. We will store the output under the object name 'hello_baits'. We do this by using the following command:

hello_baits <- do_baits(n.per.seq=100, size=100, database = "example_data/ICSASG_v2_tutorial.fa")

The output consists on different parts, which can be explored by typing:

hello_baits$baits
hello_baits$excluded_baits
hello_baits$input.summary

The part '$baits' contains the characteristics of each bait per sequence/chromosome/scaffold. Let's have a look at the output of the first 3 baits of the NC_027308.1 sequence:

hello_baits$baits$NC_027308.1[1:3,]

The column 'bait_type' refers to the type of bait (e.g. if it is a 'random', 'target' or 'region' bait); if no targets/regions areas have been indicated, all the baits will appear under the category 'random', as in the hello_baits example. Other columns in '$baits' are: the start and end position of the bait (in base pairs; i.e. bait_start and bait_stop); the length of the bait (bait_seq_size); the counts of each nucleotide type (no_A; no_T; no_G; no_C), the unknown bases (no_UNK), the counts of AT (no_AT) or GC (no_GC) and the percentage of GC content (pGC).

The '$excluded_baits' part of the output will contain sequences that have been filtered out as part of the filtering criteria in the command. For instance, here supeRbaits will store sequences that have lower or higher content of GC bases than we have specified to supeRbaits under the gc argument. In the hello_baits example, this part of the output is empty, as we did not specify to supeRbaits what we wanted the GC filter (the gc argument was not used) so all baits were stored under '$baits'. See this section of the Tutorial to learn more about how to set up a GC filter.

Under '$input.summary' it is possible to recall the information from the run and datasets (i.e. a summary of the sequence/chromosome/scaffold lengths, the different input files, and the size of desired bait). When we specify an exclusion, target or a region file, the information on those areas will be also included in '$input.summary$exclusions', '$input.summary$targets' and '$input.summary$regions'. Go to this section of the Tutorial to read about how to set up exclusions areas where you don't want your baits to be, and this other section to how to generate baits on target and region areas.

If running the blast_baits() function, an extra column ('n_matches') will be generated to the supeRbaits output under '$baits'. The new column 'n_matches' will show the number of times that each bait has blasted the reference indicated in the function blast_baits() (the default reference is the reference used in the do_baits() run that originated the input to blast_baits(). Through the output of this column, the user can decide which baits they want to keep depending on their filtering criteria or research question.