How to Create and Use Your Own Peptide Database in pepFunk - northomics/pepFunk GitHub Wiki

By Isaac Kuk and Caitlin Simopoulos

pepFunk allows you to analyze a peptide sample for its functional groups in one of the annotation types supported (KEGG and soon to be COG, eggNOG). This short tutorial provides instructions and example code which demonstrates how to create your own peptide-to-annotation database for use in pepFunk.

A database for pepFunk includes the following: a complete list of peptides in a given environment, the annotation terms (whichever type) which correspond to each peptide, and the frequency which a peptide will belong to an annotation term.

To compile this data, one will require a list of the proteins in the environment of interest,the peptides which correspond to each protein when digested, and the UniRef90 terms which correspond to each protein. This can then be matched to a database of UniRef90 matched with the annotation of choice, which can be found here under UniProtKB:ID mapping.

The example code below will follow a process to create a eggNOG-peptide dataset for pepFunk.

1. Using Diamond to match your protein database to the UniRef90 database

First, you'll need to match your proteins to the UniRef90 database. Here is an example command and also the command that was used to generate the pepFunk KEGG-peptide database.

diamond blastp --db uniref90.dmnd -q yourproteindb.fa -o diamondout.m8 –sensitive -e 0.1 –top 5 -f 6 qseqid qlen sseqid slen evalue length mident

2. In silico tryptic protein digestion

You will also need to create a file of peptides with their corresponding protein. Feel free to use your own in silico digestion tool. We have a tool available on the Northomics GitHub.

NOTE: These steps can be very slow with large databases. We recommend you split your files and run in parallel if possible.

3. Extracting relevant terms from UniProtKB using Bash

Once you have prepared your matched peptide and protein files, you're ready to start annotating. From the large UniProtKB annotation file only select the UniRef90 terms which correspond to an eggNOG term. This following process is done with Bash, a Unix shell. (This can be done with any annotation type.)

## this is to get the UniRef90 and eggNOG annotations of proteins
grep 'eggNOG\|UniRef90' idmapping.UniprotKB2all.dat > eggNOGUniRef90.idmapping.dat  #get the lines which contain UniRef90 and eggNOG terms

The file will be very large, so in order to make it easier to work with in RStudio, you can split the dataset into smaller fractions.

wc -l eggNOGUniRef90.idmapping.dat #count the number of lines in the file
split -l 1000000 eggNOGUniRef90.idmapping.dat UR2eggNOG #split the dataset into 1M-line files with names UR2eggNOGaa etc.

4. Selecting for useful data using R

In RStudio, you will need to load the tidyverse package, which contains commands used in the processing of data.

## this is to select for the proteins which have an eggNOG annotation
install.packages(tidyverse)
library(tidyverse)

Next, filter each fraction to only include the UniRef90 terms which have corresponding eggNOG terms. For each fraction, import the dataset into the R environment under "File" > "Import Dataset" > "From text (base)".

# repeat for each UR2eggNOG dataset
# choose "File", "Import Dataset"
mydata1 <- UR2eggNOGaa #rename dataset
mydata1a <- mydata1 %>% rename(ID = V1,
                   AnType = V2,
                   AnID = V3) #rename columns
mydata1b <- mydata1a %>% 
  pivot_wider(names_from = AnType,
              values_from = AnID) #matches the eggNOG terms to the corresponding UniRef90 terms, using the UniProtKB ID
mydata1c <- mydata1b %>% filter(eggNOG != "NULL") #get rid of rows w/out both eggNOG and UniRef90 annotations

Combine all the processed fractions into one dataset. It is important to note that sometimes a fraction may contain "list" type data in a column, and this must be converted to "character" type data to be combined. This is done using the unnest() command, as shown below.

mydata1d <- unnest(mydata1c, UniRef90) #converts a dataset where the UniRef90 column was of the "list" class to a "character" class
allmydata <- bind_rows(mydata1d, mydata2c, ...) #this will combine all the processed dataframes into one by columns

With the processed UniRef90-to-eggNOG dataset, it is now possible to combine all the data. Create a peptide-protein-UniRef90-eggNOG dataset, so that each peptide is matched with all the eggNOG terms which it may correspond with.

Pep2Prot <- select(Pep2Prot, -X) #get rid of extra column in imported dataset
Prot2UR <- select(Prot2UR, -X) #same
Pep2Prot2UR <- left_join(Pep2Prot, Prot2UR) #joins the datasets by the values of the "Protein" column. Values must be of same type in both datasets
Pep2Prot2UR2eggNOG <- left_join(Pep2Prot2UR, allmydata) #joins datasets by "UniRef90", the common column name
P2P2UR2Ea <- Pep2Prot2UR2eggNOG %>% filter(eggNOG != "NULL") #filters out any peptides without an eggNOG annotation
P2P2UR2Eb <- select(P2P2UR2Ea, -ID) #get rid of ID column, which is no longer necessary
P2P2UR2Ec <- unnest(P2P2UR2Eb, eggNOG) #some peptides have >1 eggNOG terms, which are in a list. This expands them to give each eggNOG term its own row

Summarize the frequency with which a peptide belongs to any eggNOG term by adding up the times which it corresponds to a given eggNOG term.

#counting frequency of peptide occurrence in eggNOG terms
count_Pep2eggNOG <- P2P2UR2Ec %>% count(Peptide, eggNOG) #counts the number of times each peptide occurs in any eggNOG annotation

Finally, save your dataset as a .csv file.

write.csv(count_Pep2eggNOG, file = "MyPepFunkDataset.csv") #prints the dataset as a .csv file

Your database is ready for use in pepFunk!