etl to elasticSearch - kwantu/platformconfiguration GitHub Wiki

back to main ...

ETL Engine

Setting up the ETL engine is as follows:

  1. Stop the etl process cron jobs from running.
  2. Make sure that the etl_type code for the job(s) you want to set up are defined in the etl_types table.
  3. Make sure that all the apps that you need are listed in the etl_appcodes table. Here we define an appcode for each app to use - by default it is the first 8 characters of the application uuid. But you can change it here if you want to. It has to be unique, and changing it after creating indexes will require a recreation of all the indexes.
  4. Select the apps that are linked to this etl_type in the etl_types_appid_mapping table. This might not apply to all etl_types, so it is managed through a table. There is a service in "/db/kwantu-resource/etl/etl_module.xqm" called etl:populateEtlAppidMappingForAType that will add all the apps adopted in the community config to the selected etl type. (We need to update this service to populate the et_appcodes if the app is not yet present there.
  5. Use the service to populate the elt_type for all the communities.
  6. Now insert the etl_type in the main etl_control service so that it knows which index to create and which external service to call to process the results of the search. The main structure here is a single search per etl_type that returns changed records, and then we process those records one by one. "/db/kwantu-resource/etl/etl_module.xqm" (later we must make this more modular).

Service to do the setup (To be improved)

You can use the service '://exist/rest/db/kwantu-resource/etl/populate_etlcontrol_for_all_communities.xql' to do the insert. (For now you need to set the following parameters in the service)

    $etlType := '<etl_type>',
    $pendingApproved := '<pending/approved/all>',
    $isActive := '<T/F>'
  1. Once the records are inserted, you can turn the scheduler services on. They execute the services in /db/kwantu-resource/etl/runBuild/*.
  2. Track the progress of these services in the etl_control table. There is a record for each etl_type and community. The status will start as notStarted. This will trigger the index function to be called. Once the index has been generated, then the next time that the service is called, it will identify a community-etl_code combination that has not yet been updated, lock it for the processor in question, and process till completion. The result column is locked for the processor, and once the processor starts with that record it will process it till completion.

ETL process and configuration

Service to track global database changes: etl_type: "all_dbs"

There is a service managed by the etl_type = "all_dbs" that prioritizes the trigger of updates. The process involves the following.

  1. The global changes feed is extracted by a service in exist-db.
  2. This returns a list of all communities where some change has taken place since the previous feed.
  3. For each of these communities it updates the etl Control Table, setting serverBookmark and serverSince fields for the latest value for that community.
  4. The etl Service then prioritizes those communities where changes have taken place, and processes them on a oldest first basis.
  5. Once it starts processing a specific community, it processes all the changes for that community to completion before moving on to a next community.
  6. See the Scheduler section

To create a new etl type

The process to create a etl process involves the following:

  1. Stop the etl process cron jobs from running.
  2. Create the etl_type code in the etl_types table.
  3. Make sure that all the apps that you need are listed in the etl_appcodes table. Here we define an appcode for each app to use. It has to be unique, and changing it after creating indexes will require a recreation of all the indexes.
  4. Select the apps that are linked to this etl_type in the etl_types_appid_mapping table. This might not apply to all etl_types, so it is managed through a table. There is a service in "/db/kwantu-resource/etl/etl_module.xqm" called etl:populateEtlAppidMappingForAType that will add all the apps to the selected etl type
  5. Use the service to populate the elt_type for all the communities. ** We need to create a service that will limit communities to only those that have the app installed - this is still to do. ** For now use the service '/db/kwantu-resource/etl/populate_etlcontrol_for_all_communities.xql' to do the insert.
  6. Now insert the etl_type in the main etl_control service so that it knows which index to create and which external service to call to process the results of the search. The main structure here is a single search per etl_type that returns changed records, and then we process those records one by one. "/db/kwantu-resource/etl/etl_module.xqm" (later we must make this more modular). 6a. Define the index to be created in etl:createIndexJson($etl_type as xs:string).
        case 'mainOracle' return
            map {
             "index": 
                map { "fields": array {( map { "type":"asc" },
                                         map {"meta-data.applicationId":"asc"},
                                         map {"complete": "asc"},
                                         map {"id": "asc"}
                                )}
                      },
             "type" : "json",
             "ddoc": "etl-exportToOracle-index",
             "name": "etl-exportToOracle-index"
            }

6b. Define the search query in etl:mangoQuery(...). Make sure that you use the use_index statement and that the index name is defined in the service that defines the search query.

        case 'elasticSpInstance' return
            map {
                 "selector": map {
                                  "type":"workflowInstanceSubProcess"
                 },
                 "use_index": "etl-elastic-type"
                }

ETL to Elasticsearch configuration

For each type of process to run, there is an etl_type that manages the process. This maps to the etl service in exist-db that knows how to deal with it.

The services

1. "elasticSpInstance" that processes all work flow instance changes to Elastic search

The process creates the following elasticSearch indexes that tracks each workflow instance:

2. "elasticWorkerObject" that processes all worker object instances changes to Elastic search

The process creates the following elasticSearch indexes that tracks each workflow instance.

3. "elasticRoles" that processes all the user assignment actions to Elastic search

The process creates the following elasticSearch indexes that tracks each workflow instance.

etl_type elt_type_function is_active notes
deleteProject deleteProject T EPWP Specific service
elasticRoles elasticRoles T To be finalised
elasticSDO elasticSDO T
elasticSpInstance elasticSpInstance T
elasticWorkerObject elasticWorkerObject T
enterpriseDev enterpriseDev T EPWP Specific service
mainOracle mainOracle T EPWP Specific service
participant participant T EPWP Specific service
pendingPart pendingPart T EPWP Specific service
processError processError T Generic service for error proceses
⚠️ **GitHub.com Fallback** ⚠️