2. Tutorial (Basic Settings) - AmirUCR/allegro GitHub Wiki

This tutorial assumes that you have downloaded ALLEGRO and installed its dependencies, and that python src/main.py --soundcheck produces a success message and exits. To conduct an experiment using the default settings in your config.yaml file, simply execute the following:

python src/main.py

ALLEGRO will output the smallest gRNA library to target every record/gene in the 50 input files and place your library under data/output/ALLEGRO_EXAMPLE_RUN/ALLEGRO_EXAMPLE_RUN_library.txt.

Configuring ALLEGRO (Basic Settings)

There are two way to configure ALLEGRO: via command line arguments, or by directly modifying the config.yaml. If you specify any arguments via the command line, they override those in the config.yaml, and any unspecified arguments will default to those in the config. If you simply run python src/main.py, ALLEGRO uses all arguments in the config.

We will continue exploring the basic capabilities of ALLEGRO using the provided example input. First we will open config.yaml and modify a few parameters. Go ahead and modify the value of experiment_name (-n) as you wish. We will leave path settings as they are for now, but here's a brief description:

  1. input_directory or -id
  • By default points to 'data/input/example_input' This is where your input fasta files live. There must be at least one fasta file with at least one fasta record in this directory which ALLEGRO will read as input.
  1. input_species_path or -isp
  • By default points to 'data/input/fifty_example_input_species.csv' You can create your own input species CSV file and point ALLEGRO to it. Note that ALLEGRO requires this CSV file to have at least two columns, one must be named 'species_name'. You may name the second column whatever you wish to, however, the values of this second column must correspond to file names existing under the directory specified by the input_directory parameter above. For example, if your input_species.csv looks like the following:

    species_name filename
    test_fasta my_test_fasta.fna

    and you have specified input_directory: 'data/input/my_test_directory/', then your ALLEGRO file structure must look like:

    ├── data
    │   ├── input
    │   │   ├── my_test_directory
    │   │   │   ├── my_test_fasta.fna
    

    You may also refer to the provided data/input/fourdbs_input_species.csv to inspect the file we used for our experiments. Notice how you may have as many columns as you want, but ALLEGRO will only use 'species_name' and the other column(s) specified in the config file.

  1. input_species_path_column or -ispc
  • This tells ALLEGRO the name of the second required column in the input_species_path CSV file as described above. In the example above, we have two columns: 'species_name', and 'filename'. Therefore, the value for this parameter would be 'filename'.
  1. track or -t, and multiplicity or -m
  • The value for track May be 'track_a' or 'track_e' (or simply 'a' or 'e'). By specifying track: 'track_a' (and multiplicity: 1), you require ALLEGRO to generate a guide RNA library that includes guides targeting anywhere in each input fasta file at least once. Increasing the multiplicity parameter increases the required number of guides per input fasta. By specifying track: 'track_e', you require ALLEGRO to target each gene/record in each input fasta file at least once, increased by the multiplicity.

    In the most trivial example, using track: 'track_a' and multiplicity: 1 on the following my_test_fasta.fna input

    >gene1
    AAAAAAAAAAAAAAAAAAAATGG|TTTTTTTTTTTTTTTTTTTTTGG
    >gene2
    ACACACACACACACACACACTGG|CCCCCCCCCCCCCCCCCCCCTGG
    

    yields a single guide as output, whereas using track: 'track_e' and multiplicity: 1 yields 2 guides, one to target gene1 and one gene2. Using track: 'track_e' and multiplicity: 2 yields 4 guides with 2 guides per gene. Using a higher multiplicity in this example causes ALLEGRO to warn you that not enough guides exist, and gracefully exit. Note that you may mark the boundary of an intron and exon via the pipe | character. Guides that are split by (or span through) this delimiter are ignored by ALLEGRO.

  1. filter_by_gc or -gc
  • Dictates whether the guides output by ALLEGRO should be excluded if their GC content falls outside of the specified range. For example, if gc_max: 0.7, a guide with a GC content of 0.71 is excluded while a guide with GC: 0.7 is included. This is a boolean True/False value. The value for this filter is not a string ('False' with the quotation marks is not valid), and it must be capitalized (false is not valid). gc_max and gc_min are floating point values.

Manually Excluding Certain Guides

When you navigate to data/input, you will see an empty text file called _the_blocklist_.txt. Place any guides without its NGG PAM inside, separated by line breaks, for ALLEGRO to ignore in its calculation.