Creating mmCIF files - informatics-isi-edu/pdb-ihm GitHub Wiki

Requirements for extracting catalog data into mmCIF (to be used as an input to a separate pdb system)

Data needs to be exported entry-wise i.e., only data belonging to a particular entry (as denoted by entry.id) is to be exported
Only tables and columns in the json schema need to be exported
Some tables need to be appended from the mmCIF file uploaded by the user in step 2:
- atom_site
- ihm_starting_model_coord
- ihm_sphere_obj_site
- ihm_gaussian_obj_site
- ihm_gaussian_obj_ensemble
- pdbx_poly_seq_scheme
- pdbx_nonpoly_scheme

Note: Not all entries will have all of the the above tables.

mmCIF format specifications

PDBx/mmCIF syntax
CIF 1.1 syntax specifications
Space or tabs can be used to separate column values in a row
If there is an optional column in a table and some rows in the table have values and some rows don't, then . can be used to denote missing values
Single or double quotes can be used for textual column values that contain spaces
Multi-line texts are enclosed within ; (see example for entity_poly table below)
- ; has to be in the beginning of the line.
- The text does not have to start right after the first ;
- In the example below (entity_poly table), the first ;*; is for _entity_poly.pdbx_seq_one_letter_code and the second ;*; is for _entity_poly.pdbx_seq_one_letter_code_can

-- valid
;multi-line text
;

-- valid (prefer)
; multi-line text
;

-- valid (empty line can be used anywhere in the file)
; multi-line text
;

"next column value"

-- valid
; multi-line text

;

-- valid (but don't recommend)
; 
multi-line text
;

If the text contains single quotes, then they can be enclosed within double quotes and vice versa. If the text contains both single and double quotes, then they are enclosed within ; like multi-line texts.
# identifies a commented line and can be used to add empty lines between tables
When vocab tables are used, the corresponding values should be used to populate the mmCIF tables (see entity.type in the example below)
If a table returns zero rows for a particular structure_id, then the table need not be included in the mmCIF file i.e., no empty tables
The structure_id column in each table need not be included in the mmCIF file

Default format for tables in mmCIF

  data_structure_id (use value of structure_id)

  loop_
  _table_name.column_name_1
  _table_name.column_name_2
  ...
  ...
  ...
  _table_name.column_name_n
  Row_1_column_value_1      Row_1_column_value_2 .........     Row_1_column_value_n
  ....
  ....
  ....
  Row_m_column_value_1      Row_m_column_value_2 .........     Row_m_column_value_n

Examples

loop_
_entity.id
_entity.type
_entity.src_method
_entity.pdbx_description
_entity.formula_weight
_entity.pdbx_number_of_molecules
1 polymer man "C1q subunits A, C, and B" 45697.594   1
2 non-polymer man  N-ACETYL-D-GLUCOSAMINE    221.208   1
3 non-polymer syn 'CALCIUM ION'  .   1
#
loop_
_entity_poly.entity_id
_entity_poly.type
_entity_poly.nstd_linkage
_entity_poly.nstd_monomer
_entity_poly.pdbx_seq_one_letter_code
_entity_poly.pdbx_seq_one_letter_code_can
1    'polypeptide(L)'    no    no 
;KDQPRPAFSAIRRNPPMGGNVVIFDTVITNQEEPYQNHSGRFVCTVPGYYYFTFQVLSQWEICLSIVSSSRGQVRRSLGF
CDTTNKGLFQVVSGGMVLQLQQGDQVWVEKDPKKGHIYQGSEADSVFSGFLIFPSAGSGKQKFQSVFTVTRQTHQPPAPN
SLIRFNAVLTNPQGDYDTSTGKFTCKVPGLYYFVYHASHTANLCVLLYRSGVKVVTFCGHTSKTNQVNSGGVLLRLQVGE
EVWLAVNDYYDMVGIQGSDSVFSGFLLFPDGSAKATQKIAFSATRTINVPLRRDQTIRFDHVITNMNNNYEPRSGKFTCK
VPGLYYFTYHASSRGNLCVNLMRGRERAQKVVTFCDYAYNTFQVTTGGMVLKLEQGENVFLQATDKNSLLGMEGANSIFS
GFLLFPDMEA
;
;KDQPRPAFSAIRRNPPMGGNVVIFDTVITNQEEPYQNHSGRFVCTVPGYYYFTFQVLSQWEICLSIVSSSRGQVRRSLGF
CDTTNKGLFQVVSGGMVLQLQQGDQVWVEKDPKKGHIYQGSEADSVFSGFLIFPSAGSGKQKFQSVFTVTRQTHQPPAPN
SLIRFNAVLTNPQGDYDTSTGKFTCKVPGLYYFVYHASHTANLCVLLYRSGVKVVTFCGHTSKTNQVNSGGVLLRLQVGE
EVWLAVNDYYDMVGIQGSDSVFSGFLLFPDGSAKATQKIAFSATRTINVPLRRDQTIRFDHVITNMNNNYEPRSGKFTCK
VPGLYYFTYHASSRGNLCVNLMRGRERAQKVVTFCDYAYNTFQVTTGGMVLKLEQGENVFLQATDKNSLLGMEGANSIFS
GFLLFPDMEA
;
#
loop_
_entity_poly_seq.entity_id
_entity_poly_seq.num
_entity_poly_seq.mon_id
_entity_poly_seq.hetero
1 1   LYS n
1 2   ASP n
1 3   GLN n
1 4   PRO n
1 5   ARG n
1 6   PRO n
1 7   ALA n
1 8   PHE n
1 9   SER n
1 10  ALA n
1 11  ILE n
1 12  ARG n
1 13  ARG n
1 14  ASN n
1 15  PRO n
#