Scheduling - OpenData-tu/documentation GitHub Wiki
#### Authorship
|Version|Date|Modified by|Summary of changes|
|-------|----|-----------|------------------|
|0.1 | 2017-06-12 | Paul Wille | Initial version|
Within our Open Data Network one important task is, to have importing processes the are defined as ETL-processes being scheduled. This requires a standalone system component that is responsible for keeping track of all sources that are still active and need recurring importing tasks to be triggered.
<<<< TODO >>>>>
<<<< TODO >>>>>
- types (interval, cronlike, weekdays etc)
- dynamically added jobs per source
<<<< TODO >>>>>
The major design decision to be made when designing this component besides its functionality itself and the interfaces to other system components was, where it should reside.
Basically we found three possible solutions that would suit our system:
- Having a scheduler component within our administration/configureation/metadata system that is related to the sources registered there
- Having a scheduler component within the component that handles the deployment of importer tasks
- having a scheduler within the importer framework
We decided to have a mix of 1) and 2). The metadata and configuration management will take place in the relational metadata system and authors of importers register and configure their sources and importers there. That is why we decided, that the setting for importing-schedules and -intervals should also take plcae there. On the other hand it is desirable to keep the scheduling itself as close as possible to the component that handles the deployment and distribution of importing tasks. At the time of having to decide how to design the importer, we already agreed on using Kubernetes as system that will manage the deployment and distribution of importer tasks. Kubernetes comes with a scheduler itself that also provides an API. Therefore we decided that the settings for the intervall will take place within the configuration of data sources but the scheduling itself will be handled in kubernetes. The managemant-system will call the kubernetes-scheduler-API with the settings made by the user and kubernetes will handle the scheduling itself.
Having a standalone version of 2) would require an own implementation of a scheduler-component. While it could handle the scheduling it would still need to call the depolying-component with the information about what to deploy and probably further information e.g. based on which rule the process was triggered. So Kubernetes would not be free of adjustments too, in order to perform the deploying part in the end. While communication between the management system and Kubernetes can not be avoided, in our chosen way this means just transferring a configuration-info and not infos on task each time a task is triggered by the scheduler. However we had a very concrete plan on building a scheduler this way. It would have used sidekiq as a asynchronus background processor and the sidekiq-scheduler extension to schedule the jobs. The sidekiq scheduler would have had the advantage of managing schedules dynamically from within the ruby on rails instance.
As mentioned in 3) there was also the possibility to have a scheduler within the importing framework. Our importer framework makes use of Spring Batch (<<<<>>>>>) and with Spring Cloud Dataflow the Spring framework also provides a component, that could be used for importing Spring tasks. This however would not make to much sense in our case as the Spring-Batch tasks - while being written with the Spring framework - themselves get automatically coordinated/deployed by Kubernetes. Therefore it is not guaranteed that there is a Spring-instance running all the time. We could enforce that by having a Spring Cloud Dataflow process running, but this would not allow using Kubernetes for orchestrating and automating the deploy process of importers.