Service schema: technology overrides - IKANOW/Aleph2 GitHub Wiki
Each element in bucket.data_schema
, in addition to the generic fields (described here), contains a technology_specific_overrides
map (stored as a JSON object), which enables bucket owners to take advantage of underlying features of the implementation of the service (at the expense of portability).
This section describes the formats used so far. Note that typically these fields can be configured per bucket or globally - the specifics are case-by-case and are described below.
Elasticsearch - search_index_schema, temporal_schema, columnar_schema, document_schema, data_warehouse_schema
The following parameters go in the technology_override_schema
sub-object of data_schema.search_index_service
. Note that these objects can also be specified globally in the properties file, using the format ElasticsearchIndexService.search_technology_override.<fields>=<value>
-
dual_tokenize_by_default
: (false
by default). Iftrue
, then all string fields generate two fields in Elasticsearch,"X"
which is either tokenized or not depending on the generic schema, and"X.raw"
(if tokenization is the default), or"X.token"
(otherwise). -
dual_tokenization_override
: Allows the user to override the dual tokenization default using the "columnar schema", where*include*
turns dual tokenization on, and*exclude*
turns it off. -
collide_policy
:error
|new_type
- (defaults tonew_type
). Elasticseach by default will error if it receives objects with inconsistent inferred schemas (Eg field "a" is first a string, then a number). Ifnew_type
is set, then the platform automatically generates a new type using thetype_name_or_prefix
below, eg"type_1"
,"type_2"
, etc -
type_name_or_prefix
: string - (defaults to"data_object"
for"collide_policy": "error"
, to"type_"
for"collide_policy": "new_type"
) the type used to store the objects in the given index.- NOTE that if
collide_policy
is set to"new_type"
then this is treated as the prefix (the types are<prefix>1
,<prefix>2
, etc). As a result only the"_default_"
mapping is copied into the template, all others are ignored.
- NOTE that if
-
verbose
:true
|false
- iftrue
then the generated elasticsearch mapping is returned when the bucket is updated. -
index_name_overrride
: string - (EXPERIMENTAL, requires admin access) points the bucket to an existing named index, vs using the built-in Aleph2-generated name -
target_max_index_size_mb
: string - the default size at which an index is segmented (is overwritten by the standardtarget_write_settings
undersearch_index_schema
), defaults to unlimited.
More advanced:
-
settings
: object - for each type (specified bytype_name_or_prefix
, or"_default_"
for all types), applies the elasticsearchsettings
to the template used to generate indexes for this bucket. (See example below) -
aliases
: object - for each type (specified bytype_name_or_prefix
, or"_default_"
for all types), applies the elasticsearchaliases
to the template used to generate indexes for this bucket. (See example below) -
mappings
andmapping_overrides
: the difference between the 2 is that the entire object frommappings
overwrites any defaults, whereas fields are copied frommapping_overrides
field by field, and where the field is not specified, the default value is used-
mappings
: object - for each type (specified bytype_name_or_prefix
, or"_default_"
for all types), applies the elasticsearchmapping
to the template used to generate indexes for this bucket. (See example below)- (Note that only the
mappings
object from the type specified intype_name_or_prefix
is used, or"_default_"
if none specified) -
NOTE see comments below if specifying custom
_all
or_source
fields in the mapping)
- (Note that only the
-
mapping_overrides
: object (as above, specified per type/"_default_
") - allows users to specify top-level fields to be applied to the template used to generated indexes for this bucket (see example below).-
NOTE if
mappings
overrides_all
or_source
then a custommapping_overrides
must be specified without that field set (otherwise the defaults of{"_all": { "enabled": false }, "_source": { "enabled": true } }
will overwrite the user settings- eg set
"mapping_overrides": { "_all": { "enabled": false } }
if setting a custom_source
inmappings
- eg set
-
NOTE if
-
Note that the index name generated is in the format:
-
<bucket-name-summary>_<uuid>
or<bucket-name-summary>_<uuid>_<dateformat>
, where:-
<bucket-name-summary>
is taken by appending the first, penultimate, and last directories in the bucket using "" as a separator, converting to lower case, replacing all non-alphanum characters except "-" with "", and collapse "__"s into "_". - the UUID is the last 12 characters of the type 3 UUID generated from the bucket's full name
- the
<dateformat>
is only used when the temporal service is enabled, and is one of"{yyyy-MM-dd-HH}"
,"yyyy-MM-dd"
,"YYYY.ww"
,"yyyy-MM"
,"yyyy"
-
For example: "/test+1-1/another__test/VERY/long/string"
-> "test_1-1_long_string__2711e659d5a6"
As an example, here is the default configuration hardwired into the platform. For more details see the appropriate elasticsearch documentation:
{
"collide_policy": "new_type",
//(default prefix is _type, but don't specify it here since then it will mess up if override to collide_policy:"error")
"settings" : {
"index.refresh_interval" : "5s",
"index.indices.fielddata.cache.size": "10%" // (note, does not apply to doc values)
},
"aliases": {
"my-alias-name": {} // (in this format because the {} can contain more advanced settings)
},
// Top level mapping fields, can be against a type, _default_, or * (catch all)
"mapping_overrides": {
"*": { // applied as a back-stop
"_all" : {"enabled" : false},
"_source": {"enabled" : true}
}
},
"mappings" : {
// (fielddata is overwritten unless a matching column is not specified, in which case the defaults here are used)
// (note that non-string fields should be explicitly marked as "index": "not_analyzed" so the system knows it can use "doc_values", if so desired)
"_default_" : {
"dynamic_templates" : [
{
"string_fields" : {
"match" : "*",
"match_mapping_type" : "string",
"mapping" : {
"type" : "string", "index" : "analyzed", "omit_norms" : true, "fielddata": { "format": "disabled" },
"fields" : {
"raw" : {"type": "string", "index" : "not_analyzed", "ignore_above" : 256, "fielddata": { "format": "disabled" }}
}
}
}
},
{
"all_other_fields": {
"match": "*",
"mapping": {
"type": "{dynamic_type}",
"index": "not_analyzed",
"fielddata": { "format": "disabled" }
}
}
}
],
"properties" : {
"@timestamp": { "type": "date", "fielddata": { "format": "doc_values" }, "index": "not_analyzed" }
}
}
}
}
-
enabled_field_data_analyzed
: object - the top-level keys are the data object fieldnames, or"_default_"
. The values are objects in the elasticsearch "fielddata" configuration format - this configuration is applied to analyzed fields that are designated columnar via the generic configuration -
enabled_field_data_notanalyzed
: object - as above: this configuration is applied to non-analyzed fields that are designated columnar via the generic configuration -
default_field_data_analyzed
: object - as above: this configuration is applied to analyzed fields that are designated not-columnar via the generic configuration -
default_field_data_notanalyzed
: object - as above: this configuration is applied to non-analyzed fields that are designated not-columnar via the generic configuration
As an example, here is the default configuration hardwired into the platform. For more details see the appropriate elasticsearch documentation:
{
"enabled_field_data_analyzed": {
"_default_": {
"format": "fst"
}
},
"enabled_field_data_notanalyzed": {
"_default_": {
"format": "doc_values"
}
},
"default_field_data_analyzed": {
"_default_": {
"format": "disabled"
}
},
"default_field_data_notanalyzed": {
"_default_": {
"format": "disabled"
}
}
}
There are currently no technology overrides for the elasticsearch implementation of the temporal service.
One issue with the deduplication functionality provided by the document schema is that it assumes that it is possible to perform exact ("term") matches on the specified fields. With elasticsearch this can be problematic because of the "analyzer" step, which can make querying on some fields "fuzzy". Elasticsearch enables each field to have different "views" of the data (eg "fieldA"
analyzed, "fieldA.raw"
non-analyzed).
To supports that, the document_schema technology_override
can be configured as follows:
//...
"document_schema": {
"technology_override": {
"default_modifier": string,
"field_override": {
string: string
}
}
}
//...
Where:
-
default_modifier
: a string appended to any string not matching thefield_override
map (see below), eg".raw"
will map"fieldA"
to"fieldA.raw"
-
field_override
: a string/string map that converts the key to the value. Note "dot notation" is supported for both key and value, but with:
replacing.
. Eg{ "obj.fieldA": "obj.fieldA.numeric" }
would map"obj.fieldA"
to"obj.fieldA.numeric"
.
Note that currently there is no default transform, but soon Elasticsearch will deduce a sensible field based on the schema.
The Elasticsearch data warehouse service implementation is based on Hive, and specifically the elasticsearch-hadoop Hive integration.
See also these integration notes.
The technology_override_schema
has the following format:
{
"types": [ string ],
"url_query": string,
"json_query": { ... },
"name_mappings": { string: string }
}
Where:
-
types
is an array of strings that specifies the elasticsearch types that will be included in the Hive table. By default, the data warehouse service will try to use all types available in the bucket (which it currently only checks each time the bucket is updated). -
url_query
: (only one of this or thejson_query
can be specified) a "URL style query" for Elasticsearch, eg"q=FIELD:TERM"
-
json_query
: (only one of this or theurl_query
can be specified) a full (standard) Elasticsearch query object -
name_mappings
: Allows to map between ES field names and Hive compatible ones (eg in the example below@timestamp
) - search fores.mapping.names
in the elasticsearch-hadoop documentation linked above.
Here's an example of using the technology_override_schema
to override the name
"data_warehouse_schema": {
"enabled": true,
"main_table": {"table_format": {
"da": "STRING",
"dp": "STRING",
"ibyt": "BIGINT",
"ipkt": "BIGINT",
"obyt": "BIGINT",
"opkt": "BIGINT",
"sa": "STRING",
"sp": "STRING",
"td": "DOUBLE",
"datet": "TIMESTAMP"
}},
"technology_override_schema": {
"table_overrides": {
"main_table": {
"name_mappings": { "datet": "@timestamp" }
}
}
}
},
Note that the field types have to match the corresponding Elasticsearch types, no conversion happens, including between "long" ("BIGINT") and "int" ("INT"), "float" ("FLOAT") and "double" ("DOUBLE"), "date" ("TIMESTAMP") and "long" etc etc.