Deep MSA and Statistical Coupling Analysis - glasgowlab/home GitHub Wiki

1. Compile a database of homologous sequences

All of the scripts I used are available in /ifs/scratch/home/mm6732/. They will need to be modified with your correct file paths before running.

Download phmmer
Phmmer compares a query sequence to a database of protein sequences. The database used to compare sequences to is stored in the server at /ifs/data/glab/uniref90/uniref90.fasta
Use the command phmmer -o output.txt query_protein.fasta /path/to/database which takes in an input amino acid sequence fasta file and returns a hmmer txt file with ranked homologs

The phmmer output only contains the accession numbers, not EC numbers or sequences, so you will need to map the accession number to EC numbers.
Start by making a copy of the phmmer output text file and adding an additional column for ec numbers by mapping accession numbers to the uniref database. [accession_to_ec.py]
Then filter the original phmmer output text file to only include accession numbers that map to the ec number for your protein. [filter_phmmer_ec.py]

Install mafft
Command line instructions for running mafft are available on their website with different options for algorithms. L-insi tends to be faster than E-insi
Example using L-insi and all 128 server threads: mafft --thread 128 --localpair pfk.fasta > pfk.aln

Follow all the steps for installation, processing, and doing calculations available on the pySCA website https://ranganathanlab.gitlab.io/pySCA/install/
My code for visualizing and bootstrapping SCA data is on the server at share/PFK_Project/melody/pySCA/data