IEDB Database Karma Modeling Notes - SciKnowEngine/SciKnowGraph GitHub Wiki
This is concerned with working with the data from the IEDB system to act as a suitable target for the overall project.
James Overton specified a query that we can process on the server. We then edited this somewhat with the following columns:
- reference
- host
- in vivo process 1
- adjuvant
- route
- dose_schedule
- iv1 immunogen
- iv1 immunogen-containing object
- antibody
- antigen
- assay
- results
-- Given our initial modeling work on Fig 3 + 4 from Richardson 1998,
-- Use the assay_type to select from the `bcell` table.
-- these are approximately the columns we want:
SELECT
-- article
article.pubmed_id,
article.reference_id,
-- host
bcell.h_organism_id AS host_taxon_id,
bcell.h_age AS host_age,
-- in vivo process 1
bcell.iv1_process_type,
bcell.iv1_adjuvants,
bcell.iv1_route,
bcell.iv1_dose_schedule,
-- iv1 immunogen
iv1_immunogen.object_type AS iv1_immunogen_type,
iv1_immunogen.object_sub_type AS iv1_immunogen_sub_type,
iv1_immunogen.object_description AS iv1_immunogen_description,
iv1_immunogen.organism_id AS iv1_immunogen_organism_id,
-- iv1 immunogen containing object
iv1_container.object_type AS iv1_container_type,
iv1_container.object_sub_type AS iv1_container_sub_type,
iv1_container.object_description AS iv1_container_description,
-- antibody
bcell.ab_type,
bcell.ab_materials_assayed,
bcell.ab_immunoglobulin_domain,
-- antigen
antigen.object_type AS antigen_type,
antigen.object_sub_type AS antigen_sub_type,
antigen.object_description AS antigen_description,
antigen.organism_id AS antigen_organism_id,
-- assay
assay_type.obi_id AS assay_type,
assay_type.assay_type AS assay_type_name,
-- results
bcell.as_num_subjects AS num_subjects,
bcell.as_num_responded AS num_responded,
bcell.as_response_frequency AS response_frequency,
bcell.as_location AS location,
bcell.as_char_value AS char_value,
bcell.as_num_value AS num_value,
bcell.as_inequality AS inequality,
bcell.as_type_id AS type_id
FROM bcell
LEFT JOIN assay_type ON bcell.as_type_id = assay_type.assay_type_id
LEFT JOIN object AS iv1_immunogen ON bcell.iv1_imm_object_id = iv1_immunogen.object_id
LEFT JOIN object AS iv1_container ON bcell.iv1_con_object_id = iv1_container.object_id
LEFT JOIN object AS antigen ON bcell.ant_object_id = antigen.object_id
LEFT JOIN article ON bcell.reference_id = article.reference_id
WHERE assay_type.assay_type = "in vivo assay";
This returns all data from the relevant epitope page pertaining to a assay type based on a high-level KEfED design pattern.
[IEDB Download Page][Schema][sql dump]
Data table we use for Karma Modeling: data_from_james_query_whole_db_invivoassay.csv
The model file for this is here:
We are here concerned with reconstructing the structure of data for an Investigation
linked to a Study Design Execution
which then links to an Experiment
. This Experiment
then uses has_participant
links to Data Item
instances that are then linked to Value Specification
instances to provide actual data elements. The additional KEfED class provenance_context
then ties together data records from each row to create a full-kefed-enabled data table.
This model is legitimately linked data, is based on OBI and uses data from IEDB 'as is'.
Some observations:
- The core idea seems to work quite well. We still need to build the whole model for the example data and then run SPARQL queries over the data to validate it. This should be one of the evaluations of this work and can serve as the basis for then 'exploding' the data by going deeper into the experimental design of each experiment.
- The models are large, causing Karma to slow down significantly. Chatting to Dipsy about this, she suggests that we break up the model into separate pieces.
- One problem with this is that the model shown uses randomly generated URIs using python's
uuid
function. This would not translate between models and so we'd need to figure out a better way of assigning anonymous uris. Shouldn't be too difficult to fix.
- One problem with this is that the model shown uses randomly generated URIs using python's
- The modeling process is quite complicated, involving several connected motifs of
Data Item
/Value Specification
/value
all linked to theprovenance_context
for each data element.- Again following a conversation with Dispy about possibly scripting the generation of these motifs to automate the model generation, she points out that there's a section in the model file that contains the history of the various commands executed to build a model. She says that we might be able to hack that part of the file to load new elements.
- If we do this, we need to load the new model in Karma, publish it and then save it since all the R2RML data will be regenerated from the execution of the new commands. This is worth investigating.
- The Karma tool does quite a nice job of allowing us to work around the issue of OBO foundry identifiers since when you mouse over nodes in the interface, the label and comments appear in the tool. This makes it perfectly doable to create models here.
- This data could legitimately be packaged as nanopublications or some semantic equivalent.
- This could be a nice compelling story for a paper (see https://w3id.org/semsci/ as a possible venue)
Fixing the issues with uuids:
- If we are encoding a URI for an experimental variable:
http://www.iedb.org/lod/data_item/<variable_name>/<bcell_id>
- If we are encoding a URI for a text value which we want to be a unique URI, we use a hashing mechanism.
For each independent variable, perform the following additional steps:
- add a new column through a PythonTransformation for a
data item
urireturn 'http://www.iedb.org/lod/data_item/<variable>/' + getValue("bcell_id")
- add a new column through a PythonTransformation for a
value specification
urireturn 'http://www.iedb.org/lod/value_specification/<variable>/' + getValue("bcell_id")
- add a new column through a PythonTransformation for a
value specification
value- Use the following code:
import hashlib
v = getValue("Values")
if v == 'NULL':
return v
stem = 'http://www.iedb.org/value_specification/iv1_adjuvant/'
hash_object = hashlib.md5(v.encode())
return stem + hash_object.hexdigest()
- assign semantic types to the two uri columns
- assign an object->property to the value column
- link the
data_item
to thevalue_specification
objects - link the
value_specification
to theprovenance_context
viaparameterizes
- link the
experiment
object to thedata_item
object.
This translates to 8 commands in the Karma interface: 3 x SubmitPythonTransformationCommand, and all the others. We don't know exactly what these commands are as we execute them, but since we will need to repeat all 8 for each new column in the data table, we should attempt to write a scripting interface to speed this process up somewhat.
Here are two examples of the form of these commands for adding python transformations for the URI columns.
Link to latest Karma model: data_from_james_query_whole_db_invivoassay.csv-model_19.ttl.txt
SubmitPythonTransformationCommand for a new data_item column with URI:
{
\"commandName\": \"SubmitPythonTransformationCommand\",
\"model\": \"new\",
\"inputParameters\": [
{
\"name\": \"hNodeId\",
\"type\": \"hNodeId\",
\"value\": [{\"columnName\": \"<INPUT_COLUMN_NAME>\"}]
},
{
\"name\": \"worksheetId\",
\"type\": \"worksheetId\",
\"value\": \"W\"
},
{
\"name\": \"selectionName\",
\"type\": \"other\",
\"value\": \"DEFAULT_TEST\"
},
{
\"name\": \"newColumnName\",
\"type\": \"other\",
\"value\": \"<NEW_NAME_FOR_COLUMN>\"
},
{
\"name\": \"transformationCode\",
\"type\": \"other\",
\"value\": \"return 'http://www.iedb.org/lod/data_item/<VARIABLE_NAME>/' + getValue(\\\"bcell_id\\\")\"
},
{
\"name\": \"errorDefaultValue\",
\"type\": \"other\",
\"value\": \"\"
},
{
\"name\": \"isJSONOutput\",
\"type\": \"other\",
\"value\": \"false\"
},
{
\"name\": \"inputColumns\",
\"type\": \"hNodeIdList\",
\"value\": \"[{\\\"value\\\":[{\\\"columnName\\\":\\\"bcell_id\\\"}]}]\"
},
{
\"name\": \"outputColumns\",
\"type\": \"hNodeIdList\",
\"value\": \"[{\\\"value\\\":[{\\\"columnName\\\":\\\"<NEW_NAME_FOR_COLUMN>\\\"}]}]\"
}
],
\"tags\": [\"Transformation\"]
},
SubmitPythonTransformationCommand for a new value_specification column with data_value (note the use of hashing to encode string value as a URI):
{
\"commandName\": \"SubmitPythonTransformationCommand\",
\"model\": \"new\",
\"inputParameters\": [
{
\"name\": \"hNodeId\",
\"type\": \"hNodeId\",
\"value\": [{\"columnName\": \"<INPUT_COLUMN_NAME>\"}]
},
{
\"name\": \"worksheetId\",
\"type\": \"worksheetId\",
\"value\": \"W\"
},
{
\"name\": \"selectionName\",
\"type\": \"other\",
\"value\": \"DEFAULT_TEST\"
},
{
\"name\": \"newColumnName\",
\"type\": \"other\",
\"value\": \"<NEW_NAME_FOR_COLUMN>\"
},
{
\"name\": \"transformationCode\",
\"type\": \"other\",
\"value\": \"import hashlib\\nv = getValue(\\\"Values\\\")\\nif v == 'NULL':\\n return v\\nstem = 'http://www.iedb.org/value_specification/<VARIABLE_NAME>/'\\nhash_object = hashlib.md5(v.encode())\\nreturn stem + hash_object.hexdigest()\"
},
{
\"name\": \"errorDefaultValue\",
\"type\": \"other\",
\"value\": \"\"
},
{
\"name\": \"isJSONOutput\",
\"type\": \"other\",
\"value\": \"false\"
},
{
\"name\": \"inputColumns\",
\"type\": \"hNodeIdList\",
\"value\": \"[{\\\"value\\\":[{\\\"columnName\\\":\\\"bcell_id\\\"}]}]\"
},
{
\"name\": \"outputColumns\",
\"type\": \"hNodeIdList\",
\"value\": \"[{\\\"value\\\":[{\\\"columnName\\\":\\\"<NEW_NAME_FOR_COLUMN>\\\"}]}]\"
}
],
\"tags\": [\"Transformation\"]
},
It would be a significant amount of work to write scripts to edit the model files directly. Perhaps this is infrastructure that we could invest in at some point, but not now.
At present our latest Karma model is for four independent variables plus simple measurement of the presence or absence of an effect
Variables are encoded as data item
and value specification
entities, linked to experiment
and provenance_context
instances to form a data table for that particular experiment. This could be thought of as a mapping from a database table (or view / query) to a generated graph model.
This is the core of the initial challenge with IEDB.