Application Layer: Batch Processing - i-on-project/integration GitHub Wiki

I-On Integration uses batch processing techniques to acquire and process all unstructured data. To make use of these techniques we have opted to use the Spring Batch framework. It enables the development of batch applications that run without user interaction.

Spring Batch is a part of the Spring Framework, so it provides the same features as the former (inversion of control, dependency injection, etc.) while providing functionalities that support the implementation of Extract, Transform and Load (ETL) processes as transaction management, job processing statistics, job restart and resource management.

It is scalable, it can be used for a simple use case such as of this project (reading a file, transform the data and load into a shared file repository), or more complex use cases (such as reading multiples chunks of data and transform high volumes of data). Additionally, provides an extended array of functionalities like asynchronous processing, parallelism, retries, and conditional.

In essence Spring Batch is a state machine, in which the states and transitions make the composition of a batch job. Steps are the most usual form of state and can be of two types:

Tasklet-based: runs the execute method within the scope of a transaction in a loop until execute tells the step to stop.
Chunk-based: is intended for item-based processing. Each chunk-based step has up to three main parts: an ItemReader, an ItemProcessor (optional) and an ItemWriter.

The central piece of Spring Batch is the JobRepository, which maintains job state and metrics. It is shared by all components of the framework, and in this implementation is backed by a Postgres database.

Job execution is the responsibility of the JobLauncher. Configured to use a new thread for each invocation, JobLauncher uses ThreadPoolTaskExecutor to execute each job.

Due to the challenge of maintaining consistency with multiple parallel commits posed by the git implementation, and performance not being a crucial factor, the executor is limited to a single working thread.

Defining a Standard Structure of a Batch Job

Each institution provides different formats for the information we aim to process, and in the case of ISEL each department supplies different PDF files, each one with its own structure and specific content. Retrieving data from these files requires different implementations of the extraction process. We will describe each one in next sections.

Even accounting for these differences, a set of common steps have been defined for any processed PDF file, as shown.

The "Download PDF" tasklet retrieves the file from the specified URI and confirms if it is a valid PDF file. Also verifies if it has been processed already by comparing its metadata against records of previous successful job executions.

Information extraction from a PDF file is done by the "Extract PDF tasklet", using tools like iText and Tabula, resulting in a raw data object.

The “Create Business Objects" tasklet transforms the raw data into business objects as per the model defined for each job type. It requires the application of business logic to parse the raw data obtained previously.

The conversion of the business objects into a DTO that will be sent to git, happens next in the “Create DTOs” tasklet. The DTOs reflect the output formats described in section Integration Data Model.

The transfer of the serialized DTOs to the file repository, is executed in the “Serialize and Push to Git” tasklet, accordingly with the defined folder structure for the job type using the dispatcher component.

Once the job is completed either successfully or not, the NotificationListener is triggered. It is an implementation of a JobExecutionListener that defines two methods: a beforeJob and an afterJob method. The afterJob method logs information about the ExitStatus of a job and can be used to trigger notifications to other channels, depending on the result of the job.

Job Engine

Acting as the main contact point for the Application Layer, the JobEngine exposes methods that allow the creation of new job executions as well as obtaining a list of currently running jobs, and details about a specific job. Every time a new job execution is required it must be submitted through JobEngine::runJob and provide an AbstractJobRequest argument. The AbstractJobRequest is a sealed class containing three parameters:

OutputFormat: An enum` class containing the supported output file formats (JSON or YAML).
InstitutionModel: Object representing the institution whose data is to be obtained.
JobType: Another enum containing the desired job type to be run (timetable, evaluations, or calendar).

AbstractJobRequest is subclassed for each supported job type:

TimetableJobRequest: Adds a ProgrammeModel argument as one must be specified for each timetable job.
CalendarJobRequest: Does not require extra parameters.
EvaluationsJobRequest: Like the TimetableJobRequest also requires a ProgrammeModel argument.

When creating the new job instance all relevant parameters will be explicitly added and passed onto Spring Batch, making them available during execution and persisted in the supporting database by the framework’s internal mechanisms. This allows for reliable and effortless saving of execution parameters which can then be obtained through a regular database query. This data is also used as the source for both job queries allowed by the API: listing all running jobs and obtaining job details by ID.

Dispatcher

The Dispatcher component, defined in the functional interface IDispatcher, routes parsed data into the target storage location, providing an abstraction layer over both the type of storage and how the data is submitted. Dispatcher’s sole method, dispatch, expects three arguments:

A ParsedData object.
A String indicating the desired file name.
An OutputFormat

ParsedData acts as a wrapper around data produced by the application and adds an identifier property and a getDirectory method which can be used to determine the file path to save the output to.

DispatcherImpl is the current implementation of the IDispatcher interface and expects two additional arguments (injected by Spring):

An IFileWriter
An IGitHandlerFactory

The IFileWriter is a generic functional interface whose purpose is write input objects into disk using the request OutputFormat. Its implementation, FileWriter, expects an ISerializer, also injected through Spring’s dependency management. The ISerializer serializes input objects into Strings in the desired format (YAML or JSON).

The Dispatcher’s purpose is to orchestrate these components and guarantee each received object is serialized, committed, and pushed into the target git repository.

Each object submission generates a new git commit that is then pushed onto the remote git branch through the IGitHandler interface, defined in the infrastructure layer.