resourceindex - ConstantB/ontop-spatial GitHub Wiki

Table of Contents Benchmark of Semantic Index vs. Reformulation only techniques The machine The Data and Indexes Data Expansion performance The Queries Query 1: Enzyme_Gene Query 2: Melanoma and Breast_Cancer Query 3: DNA_Repair_Gene, Antigen_Gene and Cancer_Gene Conclusions References Appendix A: Database Preparation

This page contains information about our experimentation with the Semantic Index technique and the data from the Resource Index. This page will be updated with more data about his experimentation as soon as it becomes available. Future directions of this experimentation include the use of alternative DBMS, i.e., PostgreSQL and MySQL; experimentation with redundancy elimination in the data; experimentation with performance with advanced RDBMS optimizations such as table partitioning.

Benchmark of Semantic Index vs. Reformulation only techniques

The tests objective was to benchmark the execution time of query reformulations based on a) semantic index b) traditional DNF query reformulation c) CNF query reformulation. We tested on a DB2 database running on a dedicated Linux machine. This experimentation is a complement for the information presented in this article (1).

The machine

We used a DB2 9.7 server hosted on a Linux Virtual Machine with 4x2.67Ghz Intel Xeon processors and 8GB of Ram. The DB2 machine was tunned for performance using DB2 Automatic Configuration Advisor. The configuration can be found in the end of this page. Note that this configuration is not yet optimal and further work should be done to assure that resources, specially memory, is allocated to maximize query performance. For further information about the configuration of the server see the end of this page.

The Data and Indexes

As described in (1), for this experimentation we used part of the data from the Resource Index. More information about his application can be found in (1) and at the Resource Index website.

Our local configuration included 2 main tables that we used for queries. First we have the OBR_CT_ANNOTATION table, where the annotations for the Clinical Trials resource are stored. This is the original data generated by the resource index, not expanded. This table is used as the ABox for CNF and DFN query reformulations. We use two columns from this table, element_id and concept_id, the first indicates the id of a document, the second indicates a concept that was used to annotate the document. We defined two reversible indexes over this table, i.e., (concept_id, element_id) and (element_id). The first index is also used for clustering the table. Construction of these indexes was 46m44.918s and 30m38.989s respectively.

The second table is "CT_ANN" where we store the annotations using the semantic index indexes in two column ELEMENT_ID and IDX. This table is constructed from the data in OBR_CT_ANNOTATION combined with the semantic index information. For example, if in OBR_CT_ANNOTATION we have the tuple (25, melanoma) and the index of melanoma=67, then we add the tuple (25,67). We use this table to execute the range queries generated by the semantic index technique. We defined two reversible indexes over this table, i.e., (idx, element_id) and (element_id). The first index is also used for clustering the table. Construction of these indexes was 41m45.997s and 28m58.048s respectively. The semantic index data is generated using the ontology data from the resource index and our implementation of the semantic index method (see (1)).

There is a total of X tuples in each of these tables. The size of OBR_CT_ANNOTATION is X. The size of CT_ANN is X. The size of the indexes of OBR_CT_ANNOTATION are X and Y. The size of the indexes of CT_ANN are X and Y.

Data Expansion performance

Our tests with expanded data were very limited and we don't describe them here. Instead we refer to (1) and (6) for an overview of this information. We also note that in the resource index website service, based on data expansion, all our queries seem to return in 0 to 3 seconds.

The Queries

We tested 3 queries. Each of them selects all the distinct resources in the Clinical Trials resource that are tagged semantically with the concepts of the query. Note that all these queries can be executed directly in the website of the Resource Index (the data returned by the website might differ slightly since the data online is updated frequently).

The forms of reformulations tested where 3:

Semantic Index (SI): These are reformulations generated by a semantic index based reformulator, see (1) for a full description of these queries.
DNF reformulation: This is a perfect reformulation in the classic sense. The perfect reformulation takes into account the TBox and generates a disjunction of conjunctions (union of conjunctive queries). In our version we replaced the union of queries by a union of disjunctions in the where clause to make the queries as efficient as possible. The size of the reformulation is exponential in the size of the query and the TBox. See (2) (3) for example of techniques that generate these queries.
CNF reformulation: This is a succint way to express a UCQ, the query is smaller because it uses nested disjunctions. This kind of query is generated by techniques such as the one presented here (4).

The raw numbers for the execution of these tests can be found here: https://spreadsheets.google.com/spreadsheet/pub?hl=en&key=0AkzBFnHPAuKedE5zMl9mWVUzVGFZblFCVTFTcl82elE&hl=en&gid=1

The logs for the executions of the tests can be found [attachment:log-warm.txt.zip] and [attachment:log-cold.txt.zip].

Query 1: Enzyme_Gene

The query is q(x) :- Enzyme_Gene(x) . Total distinct records: 8454

Concepts involved:

Enzyme Gene from NCI Thesaurus info
Ranges involved: 1
Mean Execution time (cold): 3.848s (STD: 0.323s)
Mean Execution time (warm): 0.154s (STD: 0.061s)
Disjunctions involved: 934
Mean Execution time (cold): 4.871s (STD: 0.258s)
Mean Execution time (warm): 0.956s (STD: 0.100s)

There was no CNF reformulation
since there is only one atom in the query.

DAGMan call: GET RAND SQL RESOURCE_INDEX.CT_ANN http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#Enzyme_Gene

Query 2: Melanoma and Breast_Cancer

The query is q(x) :- melanoma(x) ^ Breast_Cancer(x). Total distinct records: 72

Concepts involved:

melanoma from Human Disease ontology info
Breast_Cancer from Human Phenotype Ontology info
Ranges involved: 5 + 1
Execution time (cold): 7.172s (STD: 0.189s)
Execution time (warm): 3.544s (STD: 0.242s)
Disjunctions involved: 150
Execution time (cold): 4.039s (STD: 0.183s)
Execution time (warm): 0.369s (STD: 0.013s)
Disjunctions involved: 75 + 2
Execution time (cold): 3.601s (STD: 0.438s)
Execution time (warm): 0.202 (STD: 0.022s)
DAGMan call: GET RAND SQL CT_ANN http://purl.org/obo/owl/DOID#DOID_1909 http://purl.org/obo/owl/HP#HP_0003002

Query 3: DNA_Repair_Gene, Antigen_Gene and Cancer_Gene

The query is q(x) :- DNA_Repair_Gene(x) <sup> Antigen_Gene(x) </sup> Cancer_Gene(x). Total distinct records: 2

Concepts involved:

DNA_Repair_Gene from NCI Thesaurus info
Antigen_Gene from NCI Thesaurus info
Cancer_Gene from NCI Thesaurus info
Ranges involved: 1 + 1 + 1
Execution time (cold): 3.582s (STD: 0.166s)
Execution time (warm): 0.082s (STD: 0.010s)
Disjunctions involved: 467874
Execution time (cold): couldn't execute
Execution time (warm): couldn't execute
Disjunctions involved: 99 +34 + 139
Execution time (cold): 4.267s (STD: 0.436s)
Execution time (warm): 0.711s (STD: 0.057s)
DAGMan call: GET RAND SQL CT_ANN http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#DNA_Repair_Gene http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#Antigen_Gene http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#Cancer_Gene

Conclusions

The semantic index is in general faster than reformulation only techniques. The semantic index technique can be as fast as the expansion based query answering (see (1) and (6)). DNF based reformulation is not scalable due to the exponential blow of the size of the queries; this holds even in languages simpler than DL-Lite without mandatory participation. The CNF alternative for rewritings is much more scalable than DNF, however, the size of the query is still considerably larger than that of the ones produced by a semantic index. Also, scalability of CNF could be a problem in more complex queries.

We still need to study the query plans of Query 2 to understand why the performance of the SI in this query is not better than for the rest of the queries.

References

(1) Rodriguez M. and Calvanese D. Dependencies: Making Ontology Based Data Access Work in Practice. In Proc. of The 5th Alberto Mendelzon International Workshop on Foundations of Data Management (AMW'11). Chile, 2011.

(2) D. Calvanese, G. De Giacomo, D. Lembo, M. Lenzerini, and R. Rosati. Tractable reasoning and efficient query answering in description logics: The DL-Lite family. J. of Automated Reasoning, 39(3):385–429, 2007.

(3) H. Pe ́rez-Urbina, B. Motik, and I. Horrocks. Tractable query answering and rewriting under description logic constraints. J. of Applied Logic, 8(2):186–209, 2010.

(4) R. Rosati and A. Almatelli. Improving query answering over DL-Lite ontologies. In Proc. of KR 2010, 2010.

(5) R. Kontchakov, C. Lutz, D. Toman, F. Wolter, and M. Zakharyaschev. The combined approach to query answering in DL-Lite. In Proc. of KR 2010, 2010.

(6) P. LePendu, N. Noy, C. Jonquet, P. Alexander, N. Shah, and M. Musen. Optimize first, buy later: Analyzing metrics to ramp-up very large knowledge bases. In Proc. of ISWC 2010, volume 6496 of LNCS, pages 486–501. Springer, 2010.

Appendix A: Database Preparation

The following steps are required to configure DB2, to move the data from MySQL to DB2, to create the semantic index data and to load it in DB2 and to create the required indexes:

use IBMDataMovementTool to generate transformations scripts. Do not transfer using the GUI.
Go to the script's folder (.../migr/), modify the file db2tables.sql changing the OBS_CONCEPT.FULL_ID from LONG VARCHAR to VARCHAR(246) .
Use 'nohup ./unload &' to get the data (approx. 2 hrs).
Use 'nohup ./db2gen.sh &' to create the DB2 DB and load the data.
Increase the size of each logfile to 1GB (using a 4k page) and a total of size with the following commands (restart the database with db2stop db2start):

UPDATE DATABASE CONFIGURATION USING LOGPRIMARY 80 IMMEDIATE ;
UPDATE DATABASE CONFIGURATION USING LOGFILSIZ 262144 IMMEDIATE ;

Create a new 1GB bufferpool and associate it to the Resource Index tablespace

CREATE BUFFERPOOL BIGBUFFERPOOL IMMEDIATE  SIZE 262144 AUTOMATIC NUMBLOCKPAGES 136363 BLOCKSIZE 32 PAGESIZE 4 K ;
ALTER TABLESPACE USERSPACE1 BUFFERPOOL BIGBUFFERPOOL;

To prepare the semantic index data we use the following commands

CREATE TABLE SEM_SKELETON (CONCEPT_ID INTEGER, PARENT_ID INTEGER);
INSERT INTO SEM_SKELETON SELECT DISTINCT CONCEPT_ID, PARENT_CONCEPT_ID FROM OBS_RELATION WHERE level = 1;

and execute DAGMAN with the
c command to obtain the CSV files with the index data.

Create the CT_ANN table using the commands

DROP TABLE "RESOURCE_INDEX"."CT_ANN";
CREATE TABLE "RESOURCE_INDEX"."CT_ANN" (element_id INTEGER, idx INTEGER);
ALTER TABLE "RESOURCE_INDEX"."CT_ANN" ACTIVATE NOT LOGGED INITIALLY;
INSERT INTO "RESOURCE_INDEX"."CT_ANN" (element_id, idx) SELECT ct.element_id, sem.idx FROM OBR_CT_ANNOTATION ct JOIN SEM_INDEXES sem ON ct.concept_id = sem.concept_id;

resourceindex - ConstantB/ontop-spatial GitHub Wiki

Table of Contents

Benchmark of Semantic Index vs. Reformulation only techniques

The machine

The Data and Indexes

Data Expansion performance

The Queries

Query 1: Enzyme_Gene

Query 2: Melanoma and Breast_Cancer

Query 3: DNA_Repair_Gene, Antigen_Gene and Cancer_Gene

Conclusions

References

Appendix A: Database Preparation

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️