Service schema generic settings - IKANOW/Aleph2 GitHub Wiki

Overview

Each of the data services has a schema that defines how data objects are handled by that service.

This section describes these generic formats:

  • Search Index Service
  • Columnar Schema
  • Temporal Schema
  • Document Schema
  • Storage Schema
  • Data Warehouse Schema

(Note, in addition, each technology (eg elasticsearch for search index schema) has a set of more specific (no-generic) settings possible - these are described here)

All schema

All the schema support the following 3 fields.

  • enabled: true|false - defaults to true if the root object is present
  • service_name: string - leave out unless the bucket/service is bound to a non-default service
  • technology_override_schema: object - format is service-and-technology-specific, described under "advanced settings"

For example:

{
   "enabled": true,
   "technology_override_schema": {  }
}

The remainder of this page describes the generic service-specific parameters.

Search index schema

  • tokenize_by_default: Search indexes normally support tokenization (eg "the word" -> "the","word"), and by default this is enabled. Use this field to override that, ie set to false to treat each field like a single token/value (ie like a conventional database field).
  • tokenization_override: By default, two tokenization schemes are supported, "_none_" (no tokenization, as above), and "_default_", whatever the search index's default is (eg decompose into words by " "). Different technologies (and configurations) might support additional tokenization schemes. Then this field is a map of "columnar schemas" (see below under "Columnar Schema", and also the example) that specifies the fields that map to the given tokenization scheme.
    • (Normally only one of "_default_"/"_none_" is needed since all other fields default to tokenized/non-tokenized based on the tokenize_by_default field).
    • (Elasticsearch supports dual tokenization of fields to support both columnar type operations and also tokenized searches, see under Advanced schema configurtion for details.
  • type_override: Maps types to "columnar schemas" (see below under "Columnar Schema", and also the example).
    • All search indexes support "string"/"date"/"double"/"long"/"number". Elasticsearch also supports "ip".
  • target_index_size_mb: A user preference for the maximum size of the index files generated (in Megabytes) (defaults to no limit)
  • target_write_settings: A collection of write settings described below under "generic writer configuration"
"search_index_schema": {
   "tokenization_override": {
      "_default_": {
         "field_include_list": [ "tokenize_me" ],
         "field_include_pattern_list": [ "tokenize_me_*" ]
      },
      "_none_": {
         "field_include_list": [ "untokenize_me" ],
         "field_include_pattern_list": [ "untokenize_me_*" ]
      }
   },
   "type_override": {
      "ip": {
         "field_include_pattern_list": [ "*_addr" ]
      },
      "date": {
         "field_include_list": [ "ts", "te" ]
      }
   },
   "target_index_size_mb": 1000,
   "target_write_settings": {
      "target_write_concurrency": 10,
      "batch_flush_interval": 5
   }
}

See here for technology-specific configuration for elasticsearch.

Columnar schema

The fields specified in the following include lists are treated as columns - ie take up (often significantly) more system resources but can have columnar operations such as multi-dimensional aggregation applied on/across them.

The exclude list is similar but will preclude fields that otherwise would be treated as columnar.

  • field_include_list: [ string, string, ... ] - a list of fully specified fieldnames to treat as columnar
  • field_exclude_list: [ string, string, ... ] - a list of fully specified fieldnames to exclude from columnar treatment
  • field_include_pattern_list: [ string, string, ... ] - a list of globs (eg value.string*, nested.**.last_field) to treat as columnar
  • field_exclude_pattern_list: [ string, string, ... ] - a list of globs to exclude from columnar treatment
  • field_type_include_list: [ string, string, ... ] - a list of field types (string/date/number) to default as columnar
  • field_type_exclude_list: [ string, string, ... ] - a list of field types to exclude from columnar treatment

NOTE: if an enabled but otherwise empty columnar schema is specified, then it applies the system defaults of "field_type_include_list": [ "string", "number", "date" ], ie everything.

For example:

"columnar_schema": {
   "field_include_list": [ "@timestamp", "other_date" ],
   "field_type_include_list": [ "string", "number" ],
   "field_exclude_pattern_list": [ "nested_object.**.high_volume_low_value.*" ],
   "field_exclude_list": [ "top_level_field_to_exclude" ]
}

See here for technology-specific configuration for elasticsearch.

Temporal schema

  • time_field: "string" - the field to use as the time (will just use the current time if not populated)
  • grouping_time_period: "houly"|"daily"|"weekly"|"monthly"|"yearly" - A string describing the time period granularity for searches
  • exist_age_max: "string" - duration (eg "1 year", "3 months", "60 days" before which the data is aged out of the system)

For example:

"temporal_schema": {
   "time_field": "@timestamp",
   "grouping_time_period": "monthly",
   "exist_age_max": "12 months"
}

See here for technology-specific configuration for elasticsearch.

Document schema

IMPORTANT: unless the document schema is set with one of deduplication_policy, deduplication_fields, or deduplication_fields, then objects emitted into Aleph2 that replace existing objects will be discarded.

  • deduplication_policy: one of "leave", "update", "overwrite", "custom", "custom_update"; described further in the javadocs
    • (if this is left blank but deduplication_fields is specified, then defaults to "leave", else performs no deduplication)
  • deduplication_fields: an array of fields that defines a unique key for the document (on which deduplication occurs). Partial keys are supported, however if a document has no matching fields it is discarded. If not specified (but deduplication_policy is set), then defaults to [ "_id" ]
  • deduplication_contexts: an array of bucket paths (including globs) that defines all the buckets over which deduplication occurs (eg [ "/bucket/harvest_set_*/deduplication_set_1/**", ""/bucket/analytics_set_*/deduplication_set_1/**"" ]. Note odd behavior can result if the different buckets all have different deduplication_contexts
    • Defaults to just the bucket being processed
  • custom_deduplication_configs: a list of enrichment configurations (currently: only the first is executed) that are run on the following data:
    • if custom_finalize_all_objects is false (default) then all incoming duplicate objects (not those with the same/older timestamp as an existing object, in the custom_update case)
    • if custom_finalize_all_objects is true (defaults to false): then as above and also all non-duplicate objects
    • Note that each duplicate key corresponds to a single call to onObjectBatch in the enrichment module, with batch being (in order) all incoming duplicates and all existing duplicates, and grouping_key set to the deduplication field/fields.
      • batch_size is somewhat different: if there are DB-side duplicates, then it counts incoming objects, if there aren't (in "finalize mode") it is empty; it can usefully be thought of as "a counter until you hit a duplicate db object".
        • Unfortunately there is not currently a way of counting the number of DB objects without iterating through them.
        • Finally note that the IBatchRecord objects for DB-side duplicates in batch have the property injected set to true (and false for incoming objects)
  • custom_policy: lax, strict (default), or very_strict:
    • if lax: any objects of any type can be emitted from the custom enrichment module
    • if strict: if you emit a "new" object (ie one that doesn't have the same _id as one of the DB duplicates), then it must have a different set of deduplication fields.
    • if very_strict: only one object can be emitted from each call to onObjectBatch (and also delete_unhandled_duplicates defaults to true)
  • custom_finalize_all_objects: see above (only applies if deduplication_policy is custom or custom_update) See also the advanced schema.
  • delete_unhandled_duplicates: if true (defaults to false unless custom_policy is very_strict) then any duplicate objects that aren't overwritten by an incoming object are deleted.
    • So for example, in the custom case, if there were 3 objects with the same deduplication fields in the database, an incoming one is received, and the custom enrichment updates an emits the first, then the second two objects would then be deleted.

For example:

{
   "deduplication_timing": "custom", //(currently only supported field)
   "deduplication_policy": "custom",
   "custom_policy": "strict",
   "deduplication_fields": [ "url", "source" ],
   "dedupliation_contexts": [ "/my/bucket", "/someone/else/s/bucket" ],
   "custom_finalize_all_objects": true,
   "delete_unhandled_duplicates": false,
   "custom_deduplication_configs": [ { "module_name_or_id": "/app/aleph2/library/my_custom_dedup_logic.jar" } ]
}

Storage schema

The storage schema consists of 3 identical sub-schema, for each of 3 different stages of a data object ETL lifecycle:

  • raw - the object as it is received from the external harvester
  • json - the object after it has been converted into a JSON object, but before any enrichment has occurred
  • processed - the object after all enrichment processes have been completed (eg including transforms, annotations etc)

The 3 sub-schemas have the following parameters:

  • grouping_time_period ("houly"|"daily"|"weekly"|"monthly"|"yearly"), which defines the granularity of the directories in which data objects are stored
  • exist_age_max (string), which defines how long before data objects are aged out
  • codec (string, optional) the compression codec - defaults to none, other supported values are: gz or gzip, sz or snappy, fr.sz or snappy_framed
  • (Not available in the August 2015 release) target_write_settings - a generic writer object described below

For example:

"storage_schema": {
   "enabled": true,
   "raw": {
      "grouping_time_period": "yearly",
      "exist_age_max": "5 years"
   },
   "json": {
      "grouping_time_period": "monthly",
      "exist_age_max": "1 year",
      "codec": "gz"
   },
   "processed": {
      "codec": "snappy_framed"
   }
}

Data warehouse schema

  • main_table: an object (format described below) that defines the SQL/HQL schema
  • views (NOT CURRENTLY SUPPORTED): a list of the same "table" object as the "main_table", and allows a set of views to be defined that have different subsets of fields and an associated SQL (or technology specific via technology_override_schema) query to act on a subset of the data.

The table format is as follows:

{
   "database_name": string,
   "name_override": string,
   "view_name": string,
   "sql_query": string,
   "table_format": { ... }
}

Where:

  • database_name: the database name in which the table will be placed (defaults to "default").
  • name_override: by default the table name is the standard bucket signature - this field allows for admins to overwrite this with a more human readable name.
  • view_name (NOT CURRENTLY SUPPORTED): for views, the view name (which is used a "sub collection" string at the end of the bucket signature unless the name_override is set).
  • sql_query (NOT CURRENTLY SUPPORTED): for views, the SQL query to apply to get a subset of the data
  • table_format: a JSON object that describes the schema. In the future it will be possible to leave it blank and have the system infer the format from the existing data, but this is not currently supported.

The table_format JSON format is as follows:

{
   // primitive:
   <key>: "CHAR"|"VARCHAR"|"DATE"|"TIMESTAMP"|"BINARY"|"STRING"|"DOUBLE"|"FLOAT"|"BOOLEAN"|"BIGINT"|"INT"|"SMALLINT"|"TINYINT",
   // object/struct
   <key>: { ... },
   // array
   <key>: [ string | { ... } ],
   // map
   <key>: [ string, 
            string | { ... } ],
   // union
   <key>: [ {},
            string | { ... }, 
            ...
          ],
}

Where:

  • <key> is the field name in the SQL/HQL table, and the value determines the type of that field:
    • if the value is a string, one of the values listed above, then the type is a primitive of that type
      • (note that "DECIMAL[(precision,scale)]" is not currently supported)
    • If the value is an array with a single value, then the type is either an array of primitives (single value is a string) or objects (single value is a JSON object)
    • If the value is an object then the type is an object (aka struct), with the format of that object described by the contents inside the {...}
    • If the value is an array of size 2, then the type is a map where the key (first element) is a primitive of the designated type, and the value (second element) can be either a primitive or object as above
    • (If the value is an array of size 2 AND the first element is an empty object, then the type is a union of the data types indicated by the subsequent elements).
      • _(eg [ {}, "BIGINT", "DOUBLE" ] would be a union of "BIGINT" and "DOUBLE")

(Note that the format is recursive, so the {...} objects is the same format as the top level object.)

Some examples:

A simple flat format:

                "data_warehouse_schema": {
                    "enabled": true,
                    "main_table": {"table_format": {
                        "da": "STRING",
                        "dp": "STRING",
                        "ibyt": "BIGINT",
                        "ipkt": "BIGINT",
                        "obyt": "BIGINT",
                        "opkt": "BIGINT",
                        "sa": "STRING",
                        "sp": "STRING",
                        "td": "DOUBLE",
                        "datet": "TIMESTAMP"
                    }},
                    "technology_override_schema": { ... }
                 }                 

A nested format:

                "data_warehouse_schema": {
                    "main_table": {
                        "table_format": {
                            "address": {
                                "city": "STRING",
                                "country": "STRING",
                                "state": "STRING",
                                "original": "STRING"
                            },
                            "email": {
                                "emailid": "STRING"
                            },
                            "metadata": {
                                "sourceurl": "STRING"
                            },
                            "name": {
                                "first": "STRING",
                                "last": "STRING"
                            }
                        }
                    }
                }

See also the advanced schema.

Generic writer configuration

Any schema that supports bulk write (currently search_index_service and storage_service) supports a set of generic write options, normally under the fieldname ``:

  • batch_max_objects: When writing data out in batches, the (ideal) max number of objects per batch write
  • batch_max_size_kb: When writing data out in batches, the (ideal) max size per batch write (in KB)
  • batch_flush_interval: When writing data out in batches, the (ideal) max time between batch writes (in seconds)
  • target_write_concurrency: A user preference for the number of threads that will be used to write data (defaults to 1)

eg

{
//...
   "target_write_settings": {
      "batch_max_size_kb": 10000000,
      "batch_max_objects": 10000,
      "target_write_concurrency": 10,
      "batch_flush_interval": 300
   }
//...
}

The defaults depend on the specific technology and service.

⚠️ **GitHub.com Fallback** ⚠️