refactoring protege owl scalability - NCIEVS/nci-protege5 GitHub Wiki

Refactoring Protege OWL for Scalability

(proposal for next major phase of NCI Thesaurus editing software)

Executive Summary

The next phase of the 2024/2025 work involves major changes to the protege software, NCI is using to maintain NCI Thesaurus. This work essentially will move away from having the entire ontology in memory on the clients, using a database running behind the protege server. The protege server today provides the concurrent editing control that enables the recording of changesets, maintains various histories of edits, and provides snapshots of the ontology for clients to load on their local machines.

The main goal of this work is to make the client much thinner by using a centralized database for the ontology and only fetching data as needed to support the editing workflows of the modelers. The protege server will continue to act as a third middle tier that will coordinate between the modelers, capture histories for downstream processing, and ensure the updates are consistent.

This will provide the next level of scalability, and support the estimated rate of future growth of the terminology.

Two very critical requirements of this work are:

  • The user interface will remain exactly the same, other than minor changes stemming from the simplified approach, .eg. how login and/or syncing of updates occurs. It's imperative to not affect any current workflows.
  • The OWL input/output format will remain the same, to ensure no impact on downstream processes

The only thing modelers should notice is that the application is snappier, there should be no waiting when opening a project and logging in. Saves should be more immediate.

Another secondary but important long term goal is make the internal components more maintainable and understandable.

Introduction

The purpose of this next phase of work is a major refactoring of the protege components and the plugins in use by NCI in order to make the application a thinner client that does not keep the entire ontology in memory. This is motivated by the size of NCI Thesaurus, which is now over 3.5 million axioms or so. Protege was designed to edit and manage multiple ontologies at the same time. At NCI, we have only one, a very large one, and generally might manage others but these are treated as separate projects. We only ever edit one at a time. Protege was also designed with a powerful and flexible plugin facility, which allows the development of features that extend the core ontology editing functions.

All of these plugins and the protege components make use of the OWL API, and it's this component that will need to change to move away from an in memory model. We begin with an overview of the overall architecture, and how NCI makes use of it today. We'll then enumerate the many components in the system, specifying how they will change to support the new model.

Overview

Though Protege is intended for standalone ontology editing, at NCI we've developed a client/server plugin that implements a concurrent editing environment. Multiple modelers can edit the same ontology and the protege server maintains a linear version ordering. We've also developed several plugins that run on the client that support the complex workflows and curation process at NCI.

Protege Server

When users login to the server, they are authenticated with the protege server. The protege server also authenticates them using NCI's LDAP servers. Users are presented with a list of projects they can edit. This list is dependent on which user logs in and the permissions they've been assigned. A project represents an ontology. A separate admin tool is used to build projects, beginning with an OWL file as input. The ontology is loaded as OWL into memory and serialized to a binary blob that is stored on the protege server. This binary provides the base so-called snapshot that all the modelers work with. The first time a modeler opens a project, this snapshot is loaded from the server onto the client, and edits on it can begin. As changes are made they are sent as changesets to the server, which enforces a linear order on the versions. Each client periodically polls the server to see if there are other changes to fetch. If so they need to be fetched locally before new changes can be committed. This doesn't prevent all the conflicts as the semantics of the description logic might create others. Peridocially a user with manager privileges will use a revision history plugin tab to eview the lists of changes and either accept or reject them.

Atfer the changes are reviewed and accepted, a process called squash takes the changesets, adds them to the basline ontology and a new snapshot is created. Checksums are used on the client machines to determine if a server snapshot was updated, at login time, and if necessary the new snapshot is fetched.

The configuration information created when a new project is made is used to configure the various plugins in the client. So when a project is loaded the way the plugins work may change based on the user. For example, modelers can retire classes or merge them, but they are then retreed in the pre-retire or pre-merge branches. This is so that they can be reviewed by managers, who will them retree them as retired. Managers are also allowed to use certain tabs in edit mode, whereas normal modelers can only use certain tabs in read-only mode. So a project represents both an ontology as well as a set of roles and privileges that different users will have on the project.

So when a user logs in and loads a project, the snapshot of the ontology is loaded into memory and then the changesets from the server are loaded. The main idea in this new work will be to get rid of the snapshot entirely, only loading parts of the ontology as needed from a database, which sits on the server behind the protege server. As edits are committed and changesets written on the server, the server will also edit the database. The database we chose is Virtuoso, an RDF triple store. This fits best as OWL itself, as a language, is encoded in RDF triples. However as noted, there are a number of issues with the OWL API, mainly it's size. OWL emcompasses OWL-Lite, OWL-DL, and OWL-Full and in Thesaurus we only make use of a small subset of OWL-DL.

EditTab

This plugin is the chief workhorse for editing thesaurus. As the user navigates the taxonomy using the tree widget, the selected classes are loaded into edit tab. It has tabs for editing the tables of complext properties, .ie. properties that have properties on them, called qualifiers. These tables are build dynamically from the project configuration. Another tab displays for editing all the other simple single valued properties, excluding some that are designated as read-only.

EditTab was contains a tab for more complex edits involving to classes. These include splits. merges, cloning, and a generic dual editing panel. Probably the most complicated workflow process are retirements. When a class is selected for retiring it's dependencies are first presented to the user for repair. These include the subclasses for the class as well as the classes that point to the retired class thru role relationships. These all need to be repaired and/or retargeted. After this is complete the class can be retired, or in the case of a regular modeler, pre-retired until a manager reviews the work.

In addition to these complex editing processes, EditTab also supports a report writer tab for generating reports, and a batch loading and editing tool that takes inputs in the form of files containing new classes or modifications, and performs all the operations in a batch processing mode.

Because it does so much, EditTab is the plugin with the largest contact with the underlying protege-owl component and the OWL API. It makes use of the client/server plugin also.

Lucene Query

EVS History

Revision History

Sparql Query

Curator

Proposed Work

A change in the software this large requires a two-pronged approach, one bottoms up and the other top down. We will first build some prototypes that exercise the new internal components, validating that they can support all the use cases in a performant way. We'll then refactor the protege components and plugins from the top down to make use of these new components. Though we will abandon the OWL API we shoudl be able to reuse many of the objects from it as they both are in use by the plugins and they already have methods to construct them from RDF triples as well as methods to serialize to RDF.

Protege core

The core foundation so far appears to be the component requiring the least amount of change. It's responsible for the bas framework, the notion of a workflow tab and the plugin architecture. It also manages the connection dialogs, file input and output, and persistence of user preferences and layouts of the UI. NCI controls which tabs and views are usable or are read-only using the login and configuration. We've made changes to core in order to support this. As part of this refactoring we will make these changes more generic and permanent.

Protege OWL and Client/Server

These are the components that require the most rework. The client/server plugin is currently part of the protege-owl repository. This components contains numerous UI widgets, vies, and panels, all of which make use of a protege-owl model layer that itself sits on top of the OWL API. However the separation of concerns is not at clean as it ought to be, which makes reworking it more delicate. This is the main reason why we propose working from the bottom up as well as the top down. The essential approach will be to build a new client/server and protege-owl model based on direct use of the RDF triple store. Reading will go directly to the database and writing will be passed thru the protege server tier, in order to capture changesets, and EVS history.