P2Rank custom feature - cusbg/p2rank-framework GitHub Wiki

What are you going to learn here:

how to employ a custom feature in p2rank without changing p2rank source code

Prerequisities

You should have p2rank 2.3 installed and relevant [datasets] downloaded (described in the setup guide). All commands must be run from the directory created in the setup guide. You should also know how to train and evaluate a model without conservation. You also need to be able to compute your feature on a per level-residue basis, i.e. to have a value for every residue of the molecule.

Introduction

Adding a new feature to p2rank can be done by implementing one of the interfaces. Alternatively, values for a new feature can be loaded using CSV feature.

The main idea of csv_file_atom_feature is to allow a user to quickly test a new feature with p2rank. It should not be used as a long-term solution. Once a feature proves to be effective, it should be integrated into p2rank for better performance and re-usability.

File Format

csv_file_atom_feature reads values from a valid CSV file, where a value for each atom-residue needs to be specified. For example:

"chain","ins. code","seq. code","conservation"
"E",,"157","0.43317"
"E",,"156","0.4811"
"E",,"159","0.0"
"E",,"158","0.41996"
...

"pdb serial","conservation"
"1","0.43317"
"2","0.4811"
"3","0.0"
"4","0.41996"
...

In the first example, values are specified per residue, in the second example, values are specified per atom. The columns pdb serial, chain,ins. code, seq. code are used to identify the atom/residue. The remaining columns then contain the superset of features to be tested. As you will see below, you can either use one of the columns as the feature to test or multiple columns in case your feature consists of multiple dimensions. The names of these columns do not matter as they are not used further in the pipeline. If the name of a column starts with #, the column will be ignored.

Using the custom features

In order to use the csv_file_atom_feature you must specify the name of the feature columns and provide the directories where it should look for the CSV files. A single CSV corresponds to a single structure and the files can be located in different directories.

A new model can be trained using following command:

traineval -t .\datasets\chen11.ds -e .\datasets\joined.ds -features 'csv_file_atom_feature' -csv_file_feature_directories ',.\custom-feature,'

but, keep in mind that this will override all features see in the configuration file, you may thus need to list all options there. Alternatively, you can add csv_file_atom_feature directly into the configuration file.

The -csv_file_feature_directories is used to specify directories where to search for CSV files. The argument is a comma-separated list of paths.

In the example above the .\custom-feature the directory contains CSV features for both datasets. It should be possible to specify two directories one per dataset.

Feature directory

When executed, csv_file_atom_feature tries to load a feature for each protein processed by p2rank. Each protein has an associated structure file (as of now only PDB file), the csv_file_atom_feature uses the name of this file to locate CSV file(s) with relevant features.

For example, if the protein is load from file 148lE.pdb then the csv_file_atom_feature search all given directories for files named 148lE.pdb.csv to load the features.

Example

Help wanted!