Data Import Framework Evaluation Criteria - OpenData-tu/documentation GitHub Wiki

Authorship

Version Date Modified by Summary of changes
0.1 2017-05-25 Andres Ardila Initial version

What are we looking for

We're looking for a framework which takes care of common data import use-cases (e.g. parsing CSV files) so that this doesn’t have to be implemented from scratch. Secondarily, a supplemental scheduling component is required in order to automate the execution of jobs which are to run at definite points in time. We need something which can be split up into smaller components easily so that it can be distributed in the cloud (as opposed to a monolith does-it-all type application).

Framework Requirements

  1. Open source (can’t find where this is specified in the slides but remember Pallas mentioning it, prob a good idea to validate it). Or at the very least an open license is required.
    • Active community
  2. Extensibility
    • Easy to inject user code
    • Easy for users to override behavior (inheritance, configuration, etc.)
  3. Distributed characteristics of the system
    • Not a monolith
    • provides the ability to distribute at different parts in the flow: e.g. allocation/execution of jobs should be deferrable; that is, I should be able to manage whether I want to spawn a new process, or run as a Docker, etc.
    • Scalability? Availability?
  4. Provides a flexible scheduler for job executions (i.e. on a time schedule, or ran on-demand)
    • This could potentially be a different component but easiest if it’s all covered in one
  5. Abstraction of well-known data access patterns/protocols
    • Local file
    • HTTP(S)
    • FTP
    • [Database connections: OLEDB, JDBC, etc.] probably not critical as users will be remote to the framework and direct database connections across such boundaries are rather uncommon --we would more likely have data extracts in the formats specified above. Nonetheless, for private cloud deployments, direct database connections would probably be useful.
  6. Support for well-known data formats
    • JSON
    • Delimiter-separated values (DSV):
      • CSV
      • TSV
    • XML
    • Plain Text
    • Spreadsheet
      • Excel (XLS, OOXML...)
      • OpenDocument
  7. Checkpointing
    • In order to orchestrate distributed imports, each processor should be able to checkpoint its work and notify some monitor/controller
    • If halted, a process could be restarted at a specific point in the import without starting from zero (nice-to-have)
  8. Logging, success/failure notifications
  9. Ability to add new sources (i.e. "jobs") easily
    • Without needing to re-deploy
    • Without needing to rebuild
    • Ideally via an API