Design Doc Detect Restriction Modification Systems in Bacteria - TScottLUC/rmsystemsproject GitHub Wiki
Overview
REBASE (the restriction enzyme database) has a BLAST tool, allowing researchers to see what RM systems may be present in their genomes. However, it currently only supports one sequence at a time with a limit of 1,000,000 base pair length. We will be developing a solution that allows batch runs of genomes (via multi-FASTA files/assemblies) against REBASE, providing a report of results. To implement this solution, we will use two Python scripts to automate the process.
The first script will be a “setup” script. This script will retrieve sequence files via FTP from REBASE, using them to create a local BLAST database. Furthermore, it will retrieve the data files that contain information about each enzyme on the database. These files are updated regularly by REBASE and are distributed each month. Therefore, it would be recommended to update the local database and data files when new distributions are made available.
The second script will run the analysis. For each multi-FASTA file provided (most likely all in the same directory), it will BLAST against the local database, retrieving just the top hit for each genome. This top hit will be used to parse the REBASE data files. An output file will be created to contain the genome name, contig of top hit, name of system, type of system, and closest sequence for each genome ran. Once the setup script has been run, this script can be used for screening of any number of genomes.
Context
Restriction modification (RM) systems in bacteria provide defense from foreign DNA (plasmids, bacteriophages, etc.). Dr. Putonti and her lab have 1500 genomes that they are examining for phages and plasmids. To complement this analysis, they would like to screen the bacterial genomes for these RM systems, as a possible explanation for phages and plasmids they find in their initial analysis (or lack thereof). The current way to screen for RM systems is through REBASE. REBASE (The Restriction Enzyme Database) is a “dynamic, curated database of restriction enzymes and related proteins.” You can BLAST a sequence against REBASE at http://tools.neb.com/blast/. By doing this, you can get results about what RM systems are present in your sequence, including their type, length, specificity, and other information. However, this site was designed to BLAST just one genome at a time. Dr. Putonti wants to be able to batch run hundreds of genomes at a time, instead of having to enter every single genome individually. Furthermore, REBASE limits sequences entered to a 1,000,000 bp length, which would be a further hurdle in analysis when entire genomes need to be queried. Therefore, a solution needs to be developed that allows large batch runs of genomes against REBASE, providing a report of the results for all genomes screened.
Goals and Non-Goals
Goals
- Local BLAST must match the web results
- Must be able to batch run 100s of genomes at a time
- Length of sequences should not matter Results needed for each sequence in a batch run:
- Best result for each genome
- What RM system?
- RM system type
- Closest sequence
- What contig of assembly?
If time: Statistical analysis
- Hypothesis: Strains with recognizable RM system would not have plasmids or prophages
- Yes/no for RM system and yes/no for plasmids and prophages
Non-Goals
- Only one top hit will be provided for each genome in the report
- May not get to statistical analysis
- The only results that will be provided are those listed above
Proposed Solution
Main Pipeline:
Statistical Analysis:
Milestones
Week of... | Thomas | Matt | Anne | Deadlines |
---|---|---|---|---|
Mar 16 | Start RebaseBlast.py script, made pipeline & statistical analysis flowcharts, explore FTP (Sequence and data files) | Create local database, research on background information | Prepare slides, research on background information | Repo Check 1 |
Mar 23 | Set up BLAST script for multi-FASTA files, test examples in terminal | Finalize local database creation, test examples in terminal | Finish presentation slides, test examples in terminal | Initial Presentation |
Mar 30 | Prepare for progress presentation, retrieve and parse data files, test small dataset | Research output options in REBASE, prepare for presentation, test small dataset, automate FTP for database script | Start app note draft, test small dataset in terminal, prepare for progress presentation | 5 Min Progress Presentation |
Apr 6 | Test running both scripts, help with app note as needed | Test running both scripts, help with app note as needed | Retrieve genomes with plasmids/phages from Dr. Putonti, Run both scripts together, continue app note draft | Repo Check 2 |
Apr 13 | Work on app note, start statistical analysis (plasmids), troubleshoot errors | Work on app note, start statistical analysis if time (plasmids), troubleshoot errors | Work on app note, start statistical analysis if time (plasmids), troubleshoot errors | 5 Min Progress Presentation, Rough Draft App Note |
Apr 20 | Statistical analysis if time (phages), finish final app note | Statistical analysis if time (phages), finish final app note | Statistical analysis if time (phages), finish final app note | Repo Check 3, Final App Note |
Authors: Thomas Scott, Anne Jankowski, Matt Loffredo