Creating mmCIF files - informatics-isi-edu/protein-database GitHub Wiki

Requirements for extracting catalog data into mmCIF (to be used as an input to a separate pdb system)

  • Data needs to be exported entry-wise i.e., only data belonging to a particular entry (as denoted by entry.id) is to be exported
  • Only tables and columns in the json schema need to be exported
  • Some tables need to be appended from the mmCIF file uploaded by the user in step 2:
    • atom_site
    • ihm_starting_model_coord
    • ihm_sphere_obj_site
    • ihm_gaussian_obj_site
    • ihm_gaussian_obj_ensemble
    • pdbx_poly_seq_scheme
    • pdbx_nonpoly_scheme

Note: Not all entries will have all of the the above tables.

mmCIF format specifications

  • PDBx/mmCIF syntax
  • CIF 1.1 syntax specifications
  • Space or tabs can be used to separate column values in a row
  • If there is an optional column in a table and some rows in the table have values and some rows don't, then . can be used to denote missing values
  • Single or double quotes can be used for textual column values that contain spaces
  • Multi-line texts are enclosed within ; (see example for entity_poly table below)
    • ; has to be in the beginning of the line.
    • The text does not have to start right after the first ;
    • In the example below (entity_poly table), the first ;*; is for _entity_poly.pdbx_seq_one_letter_code and the second ;*; is for _entity_poly.pdbx_seq_one_letter_code_can
-- valid
;multi-line text
;

-- valid (prefer)
; multi-line text
;

-- valid (empty line can be used anywhere in the file)
; multi-line text
;

"next column value"

-- valid
; multi-line text

;

-- valid (but don't recommend)
; 
multi-line text
;

  • If the text contains single quotes, then they can be enclosed within double quotes and vice versa. If the text contains both single and double quotes, then they are enclosed within ; like multi-line texts.
  • # identifies a commented line and can be used to add empty lines between tables
  • When vocab tables are used, the corresponding values should be used to populate the mmCIF tables (see entity.type in the example below)
  • If a table returns zero rows for a particular structure_id, then the table need not be included in the mmCIF file i.e., no empty tables
  • The structure_id column in each table need not be included in the mmCIF file

Default format for tables in mmCIF

  data_structure_id (use value of structure_id)

  loop_
  _table_name.column_name_1
  _table_name.column_name_2
  ...
  ...
  ...
  _table_name.column_name_n
  Row_1_column_value_1      Row_1_column_value_2 .........     Row_1_column_value_n
  ....
  ....
  ....
  Row_m_column_value_1      Row_m_column_value_2 .........     Row_m_column_value_n

Examples

loop_
_entity.id
_entity.type
_entity.src_method
_entity.pdbx_description
_entity.formula_weight
_entity.pdbx_number_of_molecules
1 polymer man "C1q subunits A, C, and B" 45697.594   1
2 non-polymer man  N-ACETYL-D-GLUCOSAMINE    221.208   1
3 non-polymer syn 'CALCIUM ION'  .   1
#
loop_
_entity_poly.entity_id
_entity_poly.type
_entity_poly.nstd_linkage
_entity_poly.nstd_monomer
_entity_poly.pdbx_seq_one_letter_code
_entity_poly.pdbx_seq_one_letter_code_can
1    'polypeptide(L)'    no    no 
;KDQPRPAFSAIRRNPPMGGNVVIFDTVITNQEEPYQNHSGRFVCTVPGYYYFTFQVLSQWEICLSIVSSSRGQVRRSLGF
CDTTNKGLFQVVSGGMVLQLQQGDQVWVEKDPKKGHIYQGSEADSVFSGFLIFPSAGSGKQKFQSVFTVTRQTHQPPAPN
SLIRFNAVLTNPQGDYDTSTGKFTCKVPGLYYFVYHASHTANLCVLLYRSGVKVVTFCGHTSKTNQVNSGGVLLRLQVGE
EVWLAVNDYYDMVGIQGSDSVFSGFLLFPDGSAKATQKIAFSATRTINVPLRRDQTIRFDHVITNMNNNYEPRSGKFTCK
VPGLYYFTYHASSRGNLCVNLMRGRERAQKVVTFCDYAYNTFQVTTGGMVLKLEQGENVFLQATDKNSLLGMEGANSIFS
GFLLFPDMEA
;
;KDQPRPAFSAIRRNPPMGGNVVIFDTVITNQEEPYQNHSGRFVCTVPGYYYFTFQVLSQWEICLSIVSSSRGQVRRSLGF
CDTTNKGLFQVVSGGMVLQLQQGDQVWVEKDPKKGHIYQGSEADSVFSGFLIFPSAGSGKQKFQSVFTVTRQTHQPPAPN
SLIRFNAVLTNPQGDYDTSTGKFTCKVPGLYYFVYHASHTANLCVLLYRSGVKVVTFCGHTSKTNQVNSGGVLLRLQVGE
EVWLAVNDYYDMVGIQGSDSVFSGFLLFPDGSAKATQKIAFSATRTINVPLRRDQTIRFDHVITNMNNNYEPRSGKFTCK
VPGLYYFTYHASSRGNLCVNLMRGRERAQKVVTFCDYAYNTFQVTTGGMVLKLEQGENVFLQATDKNSLLGMEGANSIFS
GFLLFPDMEA
;
#
loop_
_entity_poly_seq.entity_id
_entity_poly_seq.num
_entity_poly_seq.mon_id
_entity_poly_seq.hetero
1 1   LYS n
1 2   ASP n
1 3   GLN n
1 4   PRO n
1 5   ARG n
1 6   PRO n
1 7   ALA n
1 8   PHE n
1 9   SER n
1 10  ALA n
1 11  ILE n
1 12  ARG n
1 13  ARG n
1 14  ASN n
1 15  PRO n
#