Aleph2 basic concepts - IKANOW/Aleph2 GitHub Wiki

Data

Units of data in Aleph2 are called "data objects". This includes things as diverse as:

  • web pages - whether raw or annotated by natural language processing
  • video files
  • log records - individual or aggregated
  • objects generated by analytics on existing data
  • KML overlays
  • Aircraft tracks
  • business transactions

Aleph2 describes a set of logical ways ("data services") in which data can be stored, index, and retrieved eg:

  • Document: As an annotated "document" (a JSON object with a formatted sub-object describing entities, associations between entities, user comments etc)
  • Search Index: As a searchable object
  • Columnar: As a related set of columns
  • Graph: As a collection of nodes and edges
  • Storage Layer: As a set of "opaque" objects within a file
  • Temporal/Geo-spatial: enables time and geo-specific processing
  • Data warehouse: a relational view of the data well-suited to traditional OLAP-type processing

How an object is handled by these Aleph2 data services is defined by its "schema" (DataSchemaBean). The schema describes the different properties relative to each service - for example: which columns should be stored in columnar fashion, how the graph should be constructed from the objects, for how long objects should be stored, etc.

Import

Data is imported into Aleph2 via "buckets" (DataBucketBean). Buckets combine the following aspects of data ingest:

  • The data schema to be applied to all objects in this bucket
  • "Harvesting" - taking serialized data from whatever transport layer, and returning a set of (possibly opaque - but usually JSON) objects
  • "Enrichment" - taking the data from the harvest, filtering unwanted objects, formatting/creating the desired fields, applying internal or external functionality (eg geo-location, natural processing, lookups via other buckets, arbitrary business logic etc)

Data buckets also have standard metadata:

  • A set of access rights, as described below under "Security".
  • Grouping metadata, they can be grouped in a number of different ways:
    • Each bucket has a filesystem hierarchy (that actually physically maps onto where data is stored in the "Storage Service" ie HDFS) - multiple buckets can be referenced by parent folders.
    • Each bucket can also be assigned a non-unique alias, which can then refer to multiple buckets
    • Finally a specific "multi-bucket" can be created that is a collection of other buckets.

Harvesting configuration consists of 3 different parts:

  • A "Harvest Technology" a JVM JAR implementing IHarvestTechologyModule whose callbacks are invoked whenever pre-defined actions occur on a bucket (eg it is created). The Harvester is then free to do whatever processing it wants (typically launching or re-configuring external processes such as Hadoop/Flume/Logstash/a web crawler) to ingest objects using either the HDFS file interface (for batch operations), or functionality provided by an injected IHarvestContext object for streaming operations.
  • Optionally, a set of "Harvest Module" JVM JARs whose format is defined by the author of the "Harvest Technology". Aleph2 enables the upload/access-permissions/discovery/retrieval of "Harvest Module" "Libraries". These libraries will typically provide deframing of the data from its transport layer and JSON-ification.
  • A list of "Harvest Technology"-specific JSON configuration objects

As data is harvested, its enrichment can take one of two forms:

  • Streaming enrichment, where each object is processed as soon as it is received
  • Batch enrichment, where enrichment is performed on sets of objects (which is more efficient but introduces latency, ie is not suitable for alerting)

(Streaming enrichment will use the Storm framework together with Kafka for messaging; Batch enrichment will use the Hadoop (YARN)/Spark framework).

Typically only one of the two is used, though both are supported (eg you could want to take log records, perform batch processing on them and then store them efficiently, but also perform a smaller set of enrichment processes in near-real-time, discarding most objects but broadcasting "alerts" to listeners - pub/sub is described below).

Enrichment consists of two lists (one for batch and one for streaming) of sets of artefacts:

  • A list of JVM JARs obtained from the "Enrichment Module" "Library". One of the JARs in this list must implement IEnrichmentBatchModule or IEnrichmentStreamingModule. (The others can be in arbitrary format and can be used to provide functional libraries - eg the Stanford NLP set of JARs, internal utilities etc)
  • A JSON configuration object passed into the module at startup.
  • The dependencies between the modules (which for batch processing can be used to enrich the objects in parallel)

Like for the harvester, enrichment modules have an IEnrichmentModuleContext injected, enabling interaction with the core framework (eg to filter objects, log errors, etc).

At the end of the enrichment stage (whether batch of streaming), the extracted, transformed and enriched object is automatically passed to each of the data services defined in its schema for storage and indexing. It can also be broadcast across an "object bus" for analytics or API listeners to process - this is described below.

Note finally that a bucket can be generated with no harvest/enrichment. This can point to an existing collection in the database, or just an empty bucket that can then be populated manually or via analytic threads (discussed below).

Analytics

Conceptually, an "Analytic Thread" (AnalyticThreadBean) takes data from one or more populated buckets, applies arbitrary further processing via user-defined technologies (eg Hadoop, Spark, Storm, Mahout, Gephi).

Analytic Threads will be contained the bucket corresponding to the output location of the results.

Each "Analytic Thread" comprises the following:

  • The name of an "Analytic Technology", JVM JAR implementing IAnalyticsTechologyModule whose callbacks are invoked whenever a pre-defined action occurs (eg: user interaction, another analytic thread completes, a bucket obtains more data, on a regular schedule). The analytic technology will be responsible for queuing the desired analytics (defined by the remaining items on this list)
  • A set of inputs together with associated queries (in the "language" of whichever "Data Service" is being used - eg this could be "search term" queries, temporal queries, geo-spatial queries, "graph" queries etc)
  • A list of the following sub-objects:
    • A list of "Analytic Modules", JVM JARs managed by the Aleph2 "Library" whose format is defined by the corresponding "Analytic Technology"
    • A configuration object describing the details of the analytics (input/output/etc)
    • A set of dependencies within the analytic thread (eg run module1, then module2 etc)

The "Analytic Thread" will then run over the specified data and dump the output into one of more buckets with the appropriate data schemas. The output can treat existing data in the output buckets in one of the following ways:

  • Wipe and start again each time
  • Add data incrementally
  • Merge with existing data

Instead of taking data "At Rest" from a bucket, objects can be streamed "In Flight" for real-time or near-real time analytics and alerting. The analytic thread in this case registers a bucket name and the pipeline stage (before enrichment, after enrichment, or after the named enrichment module ie "in the middle").

TODO some concrete examples

Access

TODO

In addition to its imported or analytics-derived data, "Data Buckets" will have 3 manually generated set of associated artefacts:

  • "Related Documents" - (TODO) "Office" documents/PDFs/web pages related to the data within the bucket. This store will be accessed separately to the standard bucket data, but will be searchable.
  • "Spreadsheets" - (TODO) some abstraction of an Excel spreadsheet that enables "macros" to be inserted from the different data services associated with the bucket (eg columnar operations)
  • "Knowledge Graph" - (TODO) uses the graph db service (if enabled) to enable users to build or modify the graph db representation of the data within the bucket.

Security

Security in Aleph2 is delegated to a separate service, typically invoking an existing security scheme (eg Kerberos, IKANOWv1, OAuth2)

The proposed security architecture is described here.

Libraries

In the above overview, the following "plugin functionality" could be configured from libraries (SharedLibraryBean):

  • Harvest Technologies
  • Harvest Modules
  • Enrichment Modules
  • Analytic Technologies
  • Analytic Modules
  • Access Modules

Aleph2 provides a library upload, storage, and retrieval service. Libraries are tagged for discovery and have access tokens assigned to them to restrict who can use them. This means, for example, that different analytics or APIs can be restricted based on "user group" (eg commercial tiers for SaaS, by division in a large organization, etc).

Currently only admins will be able to upload libraries for security reasons. The admin then sets the access tokens to decide who can use them.