#### Authorship

|Version|Date|Modified by|Summary of changes|
|-------|----|-----------|------------------|
|0.1    | 2017-06-12 | Paul Wille | Initial version|

Scheduling of Importing Processes

Within our Open Data Network one important task is, to have importing processes the are defined as ETL-processes being scheduled. This requires a standalone system component that is responsible for keeping track of all sources that are still active and need recurring importing tasks to be triggered.

Requirements

<<<< TODO >>>>>

Scheduling requirements

<<<< TODO >>>>>

types (interval, cronlike, weekdays etc)
dynamically added jobs per source

Dependencies and related other system components

<<<< TODO >>>>>

Location of Scheduler

The major design decision to be made when designing this component besides its functionality itself and the interfaces to other system components was, where it should reside.

Possible solutions

Basically we found three possible solutions that would suit our system:

Having a scheduler component within our administration/configureation/metadata system that is related to the sources registered there
Having a scheduler component within the component that handles the deployment of importer tasks
having a scheduler within the importer framework

We decided to have a mix of 1) and 2). The metadata and configuration management will take place in the relational metadata system and authors of importers register and configure their sources and importers there. That is why we decided, that the setting for importing-schedules and -intervals should also take plcae there. On the other hand it is desirable to keep the scheduling itself as close as possible to the component that handles the deployment and distribution of importing tasks. At the time of having to decide how to design the importer, we already agreed on using Kubernetes as system that will manage the deployment and distribution of importer tasks. Kubernetes comes with a scheduler itself that also provides an API. Therefore we decided that the settings for the intervall will take place within the configuration of data sources but the scheduling itself will be handled in kubernetes. The managemant-system will call the kubernetes-scheduler-API with the settings made by the user and kubernetes will handle the scheduling itself.

Having a standalone version of 2) would require an own implementation of a scheduler-component. While it could handle the scheduling it would still need to call the depolying-component with the information about what to deploy and probably further information e.g. based on which rule the process was triggered. So Kubernetes would not be free of adjustments too, in order to perform the deploying part in the end. While communication between the management system and Kubernetes can not be avoided, in our chosen way this means just transferring a configuration-info and not infos on task each time a task is triggered by the scheduler. However we had a very concrete plan on building a scheduler this way. It would have used sidekiq as a asynchronus background processor and the sidekiq-scheduler extension to schedule the jobs. The sidekiq scheduler would have had the advantage of managing schedules dynamically from within the ruby on rails instance.

As mentioned in 3) there was also the possibility to have a scheduler within the importing framework. Our importer framework makes use of Spring Batch (<<<<>>>>>) and with Spring Cloud Dataflow the Spring framework also provides a component, that could be used for importing Spring tasks. This however would not make to much sense in our case as the Spring-Batch tasks - while being written with the Spring framework - themselves get automatically coordinated/deployed by Kubernetes. Therefore it is not guaranteed that there is a Spring-instance running all the time. We could enforce that by having a Spring Cloud Dataflow process running, but this would not allow using Kubernetes for orchestrating and automating the deploy process of importers.

Scheduling - OpenData-tu/documentation GitHub Wiki

Scheduling of Importing Processes

Requirements

Scheduling requirements

Dependencies and related other system components

Location of Scheduler

Possible solutions

⚠️ GitHub.com Fallback ⚠️

Scheduling - OpenData-tu/documentation GitHub Wiki

Scheduling of Importing Processes

Requirements

Scheduling requirements

Dependencies and related other system components

Location of Scheduler

Possible solutions

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️