Structure Alignments (Structure) - singa-bio/singa GitHub Wiki
Within the structure
package, SiNGA provides algorithms to perform alignments of macromolecular structures.
At the moment the Fit3D algorithm for substructure search (see doi:10.1089/cmb.2014.0263) is implemented that provides the ability to align structural motifs (implemented as StructuralMotif
) against single target structures or batches of them.
To understand the usage of Fit3D one first has to consider the biological relevance of structural motifs:
Small subsets of amino acids/nucleotides are often responsible for ligand binding, ion or cofactor fixation or intrinsic structure stabilization. For in in-depth review of well-studied structural motifs refer to doi:10.1089/cmb.2014.0263. Beside this, template motifs can be used to represent protein superfamilies or infer evolutionary relationships.
Due to the rapid increase of structure data available in the Protein Data Bank there is vast data space to screen for known structural motifs to predict protein function, select appropriate templates or to discover new binding sites. The problem of structural motif alignment is depicted in the following figure:
SiNGA features easy guidance through the whole process of structural motif screening:
- extraction of the query structural motif and
- screening in one or multiple target structures.
The following sections address different scenarios, which you might encounter during your research.
A very popular and well-studied example of a structural motif is the catalytic triad of serine proteases (the upper left in the grid of the motif examples figure). This catalytic site consists of three residues that can even be spread along multiple protein chains (we call that inter-molecular) as it is the case for 4CHA. For that particular protein residues 57
and 102
of chain B and residue 102
of chain C comprise the motif. This is also a good example to understand that structural motifs are not necessarily neighbored on protein sequence level and thus sometimes hard to detect by multiple sequence alignment methods.
Within SiNGA whole macromolecular structures are represented with Structure
objects. Models, chains and structural motifs are organized in LeafSubstructureContainer
s that bundle multiple LeafSubstructure
s. To run a Fit3D search of the catalytic triad against another member of the same protein family you have to parse the query structure and define the motif of interest:
Structure query = StructureParser.pdb()
.pdbIdentifier("4CHA")
.parse();
StructuralMotif queryMotif = StructuralMotif.fromLeafIdentifiers(query, LeafIdentifiers.of("B-57","B-102","C-195"));
Now you can easily run a search in an arbitrary target structure, e.g. 1KLT:
Structure target = StructureParser.pdb()
.pdbIdentifier("1KLT")
.parse();
Fit3D fit3d = Fit3DBuilder.create()
.query(queryMotif)
.target(target.getFirstModel()) // only use the first model
.run();
You can also perform this search against the single chain A of the target:
Structure target = StructureParser.pdb()
.pdbIdentifier("1KLT")
.chainIdentifier("A")
.parse();
Fit3D fit3d = Fit3DBuilder.create()
.query(queryMotif)
.target(target.getFirstChain()) // use only the first chain
.run();
In either of both cases you can now retrieve and analyze the results of the search:
for (Fit3DMatch fit3DMatch : fit3d.getMatches()) {
System.out.println(fit3DMatch.getRmsd());
}
That's it! You have successfully performed a Fit3D search against a single target. To analyze and use the produced results, see section "Analyze and use results".
Consider that you want to screen for a RNA structural motif: the binding site of the purine riboswitch family
(Rfam-ID RF00167). It is well known that this binding sites is constituted of the
residues 22
, 47
, 51
and 74
. Whereas uracil 74
is responsible for substrate specificity and recognizes adenine
ligands. If this uracil is substituted with a cysteine, the substrate specificity shifts towards guanine.
Load a query structure that contains the structural motif of interest, e.g. chain A of 2EES:
Structure motifContainingStructure = StructureParser.pdb()
.pdbIdentifier("2EES")
.chainIdentifier("A")
.parse();
Extract the structural motif:
StructuralMotif queryMotif = StructuralMotif.fromLeafIdentifiers(motifContainingStructure, LeafIdentifiers.of("A-22", "A-51", "A-52", "A-74"));
Define a position-specific exchange of cysteine 74
to uracil:
queryMotif.addExchangeableFamily(LeafIdentifier.fromSimpleString("A-74"), NucleotideFamily.URIDINE);
To run a Fit3D search against multiple targets you can to specify them in a text file as PDB-IDs or paths to PDB files separated by line break, e.g. targets.txt
looks like:
1y26
1y27
2b57
2ees
...
or
/home/user/1y26.pdb
/home/user/1y27.pdb
/home/user/2b57.pdb
/home/user/2ees.pdb
...
Read the content of this file as List<String>
, e.g. by using the Files
-API:
List<String> identifiersFromFile = Files.readAllLines(Paths.get("/home/user/targets.txt");
Using the targets you can create a MultiParser
object:
StructureParser.MultiParser targets = StructureParser.pdb()
.pdbIdentifiers(identifiersFromFile)
.everything();
Now you are ready to set up a Fit3D search and obtain the matches found:
Fit3D fit3d = Fit3DBuilder.create()
.query(nucleotideMotif)
.targets(targets)
.maximalParallelism()
.run();
That's it! You have successfully performed a Fit3D search against multiple targets in parallel. To analyze and use the produced results, see section "Analyze and use results".
You can write the aligned matches in PDB format for visual inspection (e.g. with PyMOL):
fit3d.writeMatches(Paths.get("/tmp/matches"));
The result for the purine riboswitch family could look like this:
All matches are stored as a Fit3DMatch
object. The following table shows elements containing in the Fit3DMatch
object:
match | description |
---|---|
substructureSuperimposition |
returns the superimposition of the match and the query. |
rmsd |
returns the root-mean-squared displacement of the match |
pvalue |
returns the p-value of the match, using the model of Fofanov et al. 2008 |
candidateMotif |
returns the matched motif |
matchType |
returns type of match, either inter (across multiple protein chains) and intra (within one protein chain) |
uniProtIdentifiers |
returns the corresponding Uniprot identifiers of the match |
pfamIdentifiers |
returns the corresponding Pfam of the match |
ecNumbers |
returns the corresponding EC-Number of the match |
alignedSequence |
returns the sequence of the matched motif |
structureTitle |
returns the title of the found structure |
Fit3D depends primarily on two parameters:
- the maximal allowed root-mean-squared displacement (RMSD) for a candidate to be considered as match that defaults to 2.5 Å and
- the distance tolerance that is used when extracting microenvironments (see doi:10.1089/cmb.2014.0263) that defaults to 1.0 Å.
It is important to understand that the RMSD is strongly intertwined with the atoms used for representation of the query motif and its size. Specifically, that means that the fewer atoms are used to represent the motif (e.g. only alpha carbons), the lower the RMSD tends to be, even if the agreement is not good between candidate and query. The problem to "scale" the RMSD is yet to be answered, for more details about the behavior of the RMSD value see the paper of Stark et al. 2003.
The parameters can be adjusted with the Fit3DBuilder
.
You can use SiNGAs StructuralEntityFilter.Atoms
and AtomRepresentationSchemeType
together with Fit3D to represent the query structural motif by your specific needs. The StructuralEntityFilter.Atoms
can even be concatenated as you want thanks to the Predicate
-API of Java 8. Per default Fit3D uses all non-hydrogen atoms for the alignments.
For example, if you want to find a nucleotide motif represented by the phosphorus atoms only, you should call:
Fit3D fit3d = Fit3DBuilder.create()
.query(nucleotideMotif)
.target(nucleotideTarget.getAllChains().get(0)) // use the first chain
.atomFilter(StructuralEntityFilter.AtomFilter.isPhosphorus())
.run();
Or if you are interested in the side chain accuracy of the alignment of a protein motif, you can use:
Fit3D fit3d = Fit3DBuilder.create()
.query(proteinMotif)
.target(proteinTarget.getAllChains().get(0)) // use the first chain
.representationScheme(RepresentationSchemeType.SIDECHAIN_CENTROID)
.run();