Comparison of Viable Data Import Frameworks - OpenData-tu/documentation GitHub Wiki

Authorship

Version Date Modified by Summary of changes
0.1 2017-05-28 Rohullah & Jawid Initial version
0.1a 2017-06-06 Andres Ardila Formatting (mostly)

According to the established criteria for data import frameworks. In this document we want to compare a couple of existing frameworks on top of which we can build our own framework. Functional and non-functional aspects of the frameworks are considered here.

Spring Batch

  1. Reusable architecture framework
  2. lightweight, enterprise and batch job processing
  3. Open Source
  4. Reusable functions
    • logging/tracing
    • transaction management
    • job processing statistics
    • job restart
    • skip
    • resource management

Functionality

  • Commit batch process periodically
  • Concurrent batch processing: parallel processing of a job
  • Staged, enterprise message-driven processing
  • Massively parallel batch processing
  • Manual or scheduled restart after failure
  • Sequential processing of dependent steps (with extensions to workflow-driven batches)
  • Partial processing: skip records (e.g. on rollback)
  • Whole-batch transaction: for cases with a small batch size or existing stored procedures/scripts
  • The ability to stop/start/restart jobs and maintain state between executions.
  • deployment model, with the architecture JARs, built using Maven.

Architecture

Advantages

  • Scalability options
    • range from a single JVM via threads to multi JVM scalability
  • Integration with Spring Integration
  • Big data support
    • works well with Hadoop, YARN, Pig, Hive, MapReduce, SQLite etc.
  • Java or XML based configuration

Limitations

  • not a scheduling framework
    • still work in conjunction with a scheduler such as Quartz

Comparing Java EE with Spring Batch

  • Java package javax.batch can be used to code similar type Spring Batch components but Spring Batch goes beyond what Java implements in JSR-352 such as dependency injection, supporting inheritance, supporting both Java and XML configuration (where java only supports XML configuration).
  • Spring Batch requires less lines of code than Java to build a batch application.
  • Note: JSR-352 is included in JEE7

Quartz - Job Scheduler

  1. A job scheduling library which can be integrated within virtually any Java application.
    • Simple and complex jobs
  2. Open source licensed under the Apache 2.0
  3. always requires at least the sl4j-api jar file
  4. Quartz can run embedded within another free standing application
  5. Jobs are scheduled to run when a given Trigger occurs
    • on certain times of the day
    • on certain days of the week
  6. There are no known competing open source projects (there are a few other open source schedulers, but they are basically just Cron replacements written in Java).

Why not use java.util.Timer?

  • Timers have no persistence mechanism.
  • Timers have inflexible scheduling (only able to set start-time & repeat interval, nothing based on dates, time of day, etc.)
  • Timers don't utilize a thread-pool (one thread per timer)
  • Timers have no real management schemes - you'd have to write your own mechanism for being able to remember, organize and retrieve your tasks by name, etc.

What Quartz isn't

  • It is not a job queue
  • it is not a grid computation eingine
  • it is not a job execution service, it is a code library that we can embeded (in our case) it into Spring Batch to scheduling tasks.

Easy Batch

  1. a framework for simplifying a batch processing with Java
  2. can be configured with a Java API or embedded in an application server

Functionality

  • Hand resources I/O
  • Data filtering/ validation
  • Type conversion
  • Objects marshalling/ unmarshalling
  • Transaction management
  • Logging/ Reporting
  • Job Monitoring/ Scheduling

Limitations

Comparing Easy Batch with Spring Batch, Spring Batch has advanced features like retry on failure, remoting or flows, data partitioning and implements the JSR 352 but Easy Batch can’t.

Summer Batch

  1. open source bulk processing framework for .NET framework
  • takes advantages of C#
  • batch solution of Microsoft based environment
  • relies on Unity 3.5 as Dependency Injection container and NLog 4.1.2
  1. features of JSR-352 Java Batch Standard are supported by Summer Batch

Features

  • Repeatable and customizable batch jobs
  • Multi step jobs, with simple step sequences or conditional logic between them
  • In-memory or persisted job repository
  • Support for a Read-Process-Write logic, as well as arbitrary batchlet steps for a more complete control on behavior
  • Chunk-processed steps, with checkpoint management and restartability
  • Step partitioning used for parallel processing
  • Database readers and writers, with support for Microsoft® SQL Server, IMB® DB2 and Oracle® databases
  • Flat file readers and writers
  • Easy mapping between readers and writers and your domain classes
  • Batch contexts at step level and job level
  • XML design for main batch architecture, C# design for step properties
  • FTP operations support
  • Email sending support
  • SQL Scripts invocation support