IEDB Database Karma Modeling Notes - SciKnowEngine/SciKnowGraph GitHub Wiki

IEDB Data

This is concerned with working with the data from the IEDB system to act as a suitable target for the overall project.

James Overton specified a query that we can process on the server. We then edited this somewhat with the following columns:

  • reference
  • host
  • in vivo process 1
    • adjuvant
    • route
    • dose_schedule
  • iv1 immunogen
  • iv1 immunogen-containing object
  • antibody
  • antigen
  • assay
  • results
-- Given our initial modeling work on Fig 3 + 4 from Richardson 1998, 
-- Use the assay_type to select from the `bcell` table.
-- these are approximately the columns we want:

SELECT

  -- article 
  article.pubmed_id,
  article.reference_id,
  
  -- host
  bcell.h_organism_id AS host_taxon_id,
  bcell.h_age AS host_age,

  -- in vivo process 1
  bcell.iv1_process_type,
  bcell.iv1_adjuvants,
  bcell.iv1_route,
  bcell.iv1_dose_schedule,

  -- iv1 immunogen
  iv1_immunogen.object_type AS iv1_immunogen_type,
  iv1_immunogen.object_sub_type AS iv1_immunogen_sub_type,
  iv1_immunogen.object_description AS iv1_immunogen_description,
  iv1_immunogen.organism_id AS iv1_immunogen_organism_id,

  -- iv1 immunogen containing object
  iv1_container.object_type AS iv1_container_type,
  iv1_container.object_sub_type AS iv1_container_sub_type,
  iv1_container.object_description AS iv1_container_description,

  -- antibody
  bcell.ab_type,
  bcell.ab_materials_assayed,
  bcell.ab_immunoglobulin_domain,

  -- antigen
  antigen.object_type AS antigen_type,
  antigen.object_sub_type AS antigen_sub_type,
  antigen.object_description AS antigen_description,
  antigen.organism_id AS antigen_organism_id,

  -- assay
  assay_type.obi_id AS assay_type,
  assay_type.assay_type AS assay_type_name,

  -- results
  bcell.as_num_subjects AS num_subjects,
  bcell.as_num_responded AS num_responded,
  bcell.as_response_frequency AS response_frequency,
  bcell.as_location AS location,
  bcell.as_char_value AS char_value,
  bcell.as_num_value AS num_value,
  bcell.as_inequality AS inequality,
  bcell.as_type_id AS type_id
  
FROM bcell
LEFT JOIN assay_type ON bcell.as_type_id = assay_type.assay_type_id
LEFT JOIN object AS iv1_immunogen ON bcell.iv1_imm_object_id = iv1_immunogen.object_id
LEFT JOIN object AS iv1_container ON bcell.iv1_con_object_id = iv1_container.object_id
LEFT JOIN object AS antigen ON bcell.ant_object_id = antigen.object_id
LEFT JOIN article ON bcell.reference_id = article.reference_id
WHERE assay_type.assay_type = "in vivo assay";

This returns all data from the relevant epitope page pertaining to a assay type based on a high-level KEfED design pattern.

Data

[IEDB Download Page][Schema][sql dump]

Data table we use for Karma Modeling: data_from_james_query_whole_db_invivoassay.csv

Karma Model, iteration 1

Screenshot [png]

The model file for this is here:

We are here concerned with reconstructing the structure of data for an Investigation linked to a Study Design Execution which then links to an Experiment. This Experiment then uses has_participant links to Data Item instances that are then linked to Value Specification instances to provide actual data elements. The additional KEfED class provenance_context then ties together data records from each row to create a full-kefed-enabled data table.

This model is legitimately linked data, is based on OBI and uses data from IEDB 'as is'.

Some observations:

  1. The core idea seems to work quite well. We still need to build the whole model for the example data and then run SPARQL queries over the data to validate it. This should be one of the evaluations of this work and can serve as the basis for then 'exploding' the data by going deeper into the experimental design of each experiment.
  2. The models are large, causing Karma to slow down significantly. Chatting to Dipsy about this, she suggests that we break up the model into separate pieces.
    • One problem with this is that the model shown uses randomly generated URIs using python's uuid function. This would not translate between models and so we'd need to figure out a better way of assigning anonymous uris. Shouldn't be too difficult to fix.
  3. The modeling process is quite complicated, involving several connected motifs of Data Item/Value Specification/value all linked to the provenance_context for each data element.
    • Again following a conversation with Dispy about possibly scripting the generation of these motifs to automate the model generation, she points out that there's a section in the model file that contains the history of the various commands executed to build a model. She says that we might be able to hack that part of the file to load new elements.
    • If we do this, we need to load the new model in Karma, publish it and then save it since all the R2RML data will be regenerated from the execution of the new commands. This is worth investigating.
  4. The Karma tool does quite a nice job of allowing us to work around the issue of OBO foundry identifiers since when you mouse over nodes in the interface, the label and comments appear in the tool. This makes it perfectly doable to create models here.
  5. This data could legitimately be packaged as nanopublications or some semantic equivalent.
  6. This could be a nice compelling story for a paper (see https://w3id.org/semsci/ as a possible venue)

Karma Model, iteration 2

Fixing the issues with uuids:

  1. If we are encoding a URI for an experimental variable: http://www.iedb.org/lod/data_item/<variable_name>/<bcell_id>
  2. If we are encoding a URI for a text value which we want to be a unique URI, we use a hashing mechanism.

For each independent variable, perform the following additional steps:

  1. add a new column through a PythonTransformation for a data item uri
    • return 'http://www.iedb.org/lod/data_item/<variable>/' + getValue("bcell_id")
  2. add a new column through a PythonTransformation for a value specification uri
    • return 'http://www.iedb.org/lod/value_specification/<variable>/' + getValue("bcell_id")
  3. add a new column through a PythonTransformation for a value specification value
    • Use the following code:
    import hashlib
    v = getValue("Values")
    if v == 'NULL':
         return v
    stem = 'http://www.iedb.org/value_specification/iv1_adjuvant/'
    hash_object = hashlib.md5(v.encode())
    return stem + hash_object.hexdigest()
  1. assign semantic types to the two uri columns
  2. assign an object->property to the value column
  3. link the data_item to the value_specification objects
  4. link the value_specification to the provenance_context via parameterizes
  5. link the experiment object to the data_item object.

This translates to 8 commands in the Karma interface: 3 x SubmitPythonTransformationCommand, and all the others. We don't know exactly what these commands are as we execute them, but since we will need to repeat all 8 for each new column in the data table, we should attempt to write a scripting interface to speed this process up somewhat.

Here are two examples of the form of these commands for adding python transformations for the URI columns.

Link to latest Karma model: data_from_james_query_whole_db_invivoassay.csv-model_19.ttl.txt

SubmitPythonTransformationCommand for a new data_item column with URI:

 {
        \"commandName\": \"SubmitPythonTransformationCommand\",
        \"model\": \"new\",
        \"inputParameters\": [
            {
                \"name\": \"hNodeId\",
                \"type\": \"hNodeId\",
                \"value\": [{\"columnName\": \"<INPUT_COLUMN_NAME>\"}]
            },
            {
                \"name\": \"worksheetId\",
                \"type\": \"worksheetId\",
                \"value\": \"W\"
            },
            {
                \"name\": \"selectionName\",
                \"type\": \"other\",
                \"value\": \"DEFAULT_TEST\"
            },
            {
                \"name\": \"newColumnName\",
                \"type\": \"other\",
                \"value\": \"<NEW_NAME_FOR_COLUMN>\"
            },
            {
                \"name\": \"transformationCode\",
                \"type\": \"other\",
                \"value\": \"return 'http://www.iedb.org/lod/data_item/<VARIABLE_NAME>/' + getValue(\\\"bcell_id\\\")\"
            },
            {
                \"name\": \"errorDefaultValue\",
                \"type\": \"other\",
                \"value\": \"\"
            },
            {
                \"name\": \"isJSONOutput\",
                \"type\": \"other\",
                \"value\": \"false\"
            },
            {
                \"name\": \"inputColumns\",
                \"type\": \"hNodeIdList\",
                \"value\": \"[{\\\"value\\\":[{\\\"columnName\\\":\\\"bcell_id\\\"}]}]\"
            },
            {
                \"name\": \"outputColumns\",
                \"type\": \"hNodeIdList\",
                \"value\": \"[{\\\"value\\\":[{\\\"columnName\\\":\\\"<NEW_NAME_FOR_COLUMN>\\\"}]}]\"
            }
        ],
        \"tags\": [\"Transformation\"]
    },
    

SubmitPythonTransformationCommand for a new value_specification column with data_value (note the use of hashing to encode string value as a URI):

 {
        \"commandName\": \"SubmitPythonTransformationCommand\",
        \"model\": \"new\",
        \"inputParameters\": [
            {
                \"name\": \"hNodeId\",
                \"type\": \"hNodeId\",
                \"value\": [{\"columnName\": \"<INPUT_COLUMN_NAME>\"}]
            },
            {
                \"name\": \"worksheetId\",
                \"type\": \"worksheetId\",
                \"value\": \"W\"
            },
            {
                \"name\": \"selectionName\",
                \"type\": \"other\",
                \"value\": \"DEFAULT_TEST\"
            },
            {
                \"name\": \"newColumnName\",
                \"type\": \"other\",
                \"value\": \"<NEW_NAME_FOR_COLUMN>\"
            },
            {
                \"name\": \"transformationCode\",
                \"type\": \"other\",
                \"value\": \"import hashlib\\nv = getValue(\\\"Values\\\")\\nif v == 'NULL':\\n   return v\\nstem = 'http://www.iedb.org/value_specification/<VARIABLE_NAME>/'\\nhash_object = hashlib.md5(v.encode())\\nreturn stem + hash_object.hexdigest()\"
            },
            {
                \"name\": \"errorDefaultValue\",
                \"type\": \"other\",
                \"value\": \"\"
            },
            {
                \"name\": \"isJSONOutput\",
                \"type\": \"other\",
                \"value\": \"false\"
            },
            {
                \"name\": \"inputColumns\",
                \"type\": \"hNodeIdList\",
                \"value\": \"[{\\\"value\\\":[{\\\"columnName\\\":\\\"bcell_id\\\"}]}]\"
            },
            {
                \"name\": \"outputColumns\",
                \"type\": \"hNodeIdList\",
                \"value\": \"[{\\\"value\\\":[{\\\"columnName\\\":\\\"<NEW_NAME_FOR_COLUMN>\\\"}]}]\"
            }
        ],
        \"tags\": [\"Transformation\"]
    },
    

Karma Model, iteration 3

It would be a significant amount of work to write scripts to edit the model files directly. Perhaps this is infrastructure that we could invest in at some point, but not now.

At present our latest Karma model is for four independent variables plus simple measurement of the presence or absence of an effect

Variables are encoded as data item and value specification entities, linked to experiment and provenance_context instances to form a data table for that particular experiment. This could be thought of as a mapping from a database table (or view / query) to a generated graph model.

This is the core of the initial challenge with IEDB.

⚠️ **GitHub.com Fallback** ⚠️