Structure parsing (Structure) - singa-bio/singa GitHub Wiki
Within the structure
package SiNGA provides convenience methods to parse
macromolecular structures from different platforms and files.
You can retrieve single or multiple structures with the StructureParser
class. If you want to parse the structures online you have two options to choose from: pdb()
and mmtf()
. Using these options structures are fetched via their identifiers from the Protein Data Bank. If you want to parse a structure from a local file system, use the local()
method.
Structure structure = StructureParser.pdb()
.pdbIdentifier("5w0y")
.parse();
To parse a single PDB structure (analogously use
mmtf()
to parse structures in MMTF format)
List<Structure> structures = StructureParser.pdb()
.pdbIdentifiers(Arrays.asList("5F3P", "5G5T", "5J6Q", "5MAT"))
.parse();
To parse a multiple PDB structure.
Additionally, structures can be parsed using local files or local representations of the PDB by using local()
. Here files can be provided using the File
and Path
class as well as the String
representation of the file location. The parsing of InputStream
s os also possible. Packed Gzip files will be unpacked on the fly.
StructureParser.local()
.fileLocation("your_folder/myPDBFile.pdb")
.parse();
To parse a structure from the provided file.
localPDB localPdb = new LocalPDB("/srv/pdb");
List<Structure> structures = StructureParser.local()
.localPDB(localPdb, Arrays.asList("5k2c", "5k2d", "5iu4", "5iu7"))
.parse();
To use a local PDB installations as a source for structures.
Here it is important to prepare the local pdb in the standard PDB folder structure.
With ChainLists
it is possible to parse only specific chains from a text file.
To parse the chains from file it’s necessary to use the following format:
PDBID | separator | CHAINID
e.g.
1c0a:A
Use the .chainList(Path, String)
method as follows:
List<Structures> structures = StructureParser.local()
.localPDB(localPdb)
.chainList(chainListPath, ":")
.parse();
It is also possible to only parse specific domains using the annotations from the Pfam database. Furthermore you are able to parse chains containing these domains.
List<List<LeafSubstructure<?>>> domains = PfamParser.create()
.version(PfamParser.PfamVersion.V31)
.pfamIdentifier("PF17480")
.all()
.domains();
To parse Leafsubstructures from the corresponding protein family, grouped by their chain of origin.
The MultiParser performs parsing of multiple structures from a single source. Using the iterator pattern the structures can be parsed and processed one by one. Each specified structure will be parsed lazily so parsing until a certain condition is met can be done without unnecessarily parsing structures.
StructureParser.MultiParser multiParser = StructureParser.pdb()
.pdbIdentifiers(identifiers)
.everything();
while (multiParser.hasNext()) {
Structure next = multiParser.next();
}
Structures can be parsed in a reduced form to omit parts that are not needed for further analysis to speed up the parsing process or further processing.
Structure structure = StructureParser.pdb()
.identifier("1PQS")
.model(2)
.parse();
To only get model 2 for a multi model NMR structure.
Structure structure = StructureParser.pdb()
.identifier("1BRR")
.chain("A")
.parse();
To only get chain A of a homomeric structure.
Structure structure = StructureParser.pdb()
.identifier("2N5E")
.model(3)
.chain("B")
.parse();
Reducers can also be combined.
There are additional options that can be applied, using the following setup:
StructureParserOptions structureParserOptions = StructureParserOptions.withSettings(GET_TITLE_FROM_PDB,OMIT_HYDROGENS);
The following table shows possible options and their function:
Command | Description |
---|---|
CREATE_EDGES |
Create Edges in between leaf substructures (default). |
OMIT_EDGES |
Omit Edges between leaf substructures. |
GET_LIGAND_INFORMATION |
Parse additional information for ligands form PDB (default). |
OMIT_LIGAND_INFORMATION |
Omit parsing of ligand information. |
GET_HETERO_ATOMS |
Parse atoms annotated as hetero atoms (default). |
OMIT_HETERO_ATOMS |
Omit parsing of hetero atoms. |
GET_HYDROGEN_CONNECTIONS |
Connect hydrogens to leafs. |
OMIT_HYDROGENS_CONNECTIONS |
Omit hydrogen connections to leafs (default). |
GET_HYDROGENS |
Parse hydrogen atoms. |
OMIT_HYDROGENS |
Omit hydrogen atoms (default). |
GET_TITLE_FROM_FILENAME |
Use the file name as the title of the structure. |
GET_TITLE_FROM_PDB |
Use the PDB file to infer the PDB title (default). |
GET_IDENTIFIER_FROM_FILENAME |
Parse and use the first valid PDB identifier from the file name. |
GET_IDENTIFIER_FROM_PDB |
Use the PDB file to infer the PDB identifier (default). |
The class StructureWriter
can be used to write Structure
objects in PDB format. The class provides the Methods wirteStructure(OakStructure, Path)
, writeLeafSubstructureContainer(LeafStructureContainer, Path)
, writeLeafSubstructers(List<Leafsubstructures>, Path)
, writeToXYZ(AtomContainer, Path)
and
writeWithConsecutiveNumbering(OakStructure, Path)
. Each of the Methods requires a Path where the resulting file will be stored. Also, they need a specific container object you want to export. The method writeWithConsecutiveNumbering
renumbers all identifiers starting from 1. The method writeToXYZ
can be used write in XYZ format.
SiNGA parses some information, that can not be retrieved from the pdb file by default. Ligands and modified amino acids and modified nucleotides are enriched with data from .cif files. Modified leaf substructures are recognizable as such by calling the isModified()
method. Ligands are connected with bonds using this information.