Service schema: technology overrides - IKANOW/Aleph2 GitHub Wiki

Overview

Each element in bucket.data_schema, in addition to the generic fields (described here), contains a technology_specific_overrides map (stored as a JSON object), which enables bucket owners to take advantage of underlying features of the implementation of the service (at the expense of portability).

This section describes the formats used so far. Note that typically these fields can be configured per bucket or globally - the specifics are case-by-case and are described below.

Elasticsearch - search_index_schema, temporal_schema, columnar_schema, document_schema, data_warehouse_schema

search_index_schema

The following parameters go in the technology_override_schema sub-object of data_schema.search_index_service. Note that these objects can also be specified globally in the properties file, using the format ElasticsearchIndexService.search_technology_override.<fields>=<value>

  • dual_tokenize_by_default: (false by default). If true, then all string fields generate two fields in Elasticsearch, "X" which is either tokenized or not depending on the generic schema, and "X.raw" (if tokenization is the default), or "X.token" (otherwise).
  • dual_tokenization_override: Allows the user to override the dual tokenization default using the "columnar schema", where *include* turns dual tokenization on, and *exclude* turns it off.
  • collide_policy: error|new_type - (defaults to new_type). Elasticseach by default will error if it receives objects with inconsistent inferred schemas (Eg field "a" is first a string, then a number). If new_type is set, then the platform automatically generates a new type using the type_name_or_prefix below, eg "type_1", "type_2", etc
  • type_name_or_prefix: string - (defaults to "data_object" for "collide_policy": "error", to "type_" for "collide_policy": "new_type") the type used to store the objects in the given index.
    • NOTE that if collide_policy is set to "new_type" then this is treated as the prefix (the types are <prefix>1, <prefix>2, etc). As a result only the "_default_" mapping is copied into the template, all others are ignored.
  • verbose: true|false - if true then the generated elasticsearch mapping is returned when the bucket is updated.
  • index_name_overrride: string - (EXPERIMENTAL, requires admin access) points the bucket to an existing named index, vs using the built-in Aleph2-generated name
  • target_max_index_size_mb: string - the default size at which an index is segmented (is overwritten by the standard target_write_settings under search_index_schema), defaults to unlimited.

More advanced:

  • settings: object - for each type (specified by type_name_or_prefix, or "_default_" for all types), applies the elasticsearch settings to the template used to generate indexes for this bucket. (See example below)
  • aliases: object - for each type (specified by type_name_or_prefix, or "_default_" for all types), applies the elasticsearch aliases to the template used to generate indexes for this bucket. (See example below)
  • mappings and mapping_overrides: the difference between the 2 is that the entire object from mappings overwrites any defaults, whereas fields are copied from mapping_overrides field by field, and where the field is not specified, the default value is used
    • mappings: object - for each type (specified by type_name_or_prefix, or "_default_" for all types), applies the elasticsearch mapping to the template used to generate indexes for this bucket. (See example below)
      • (Note that only the mappings object from the type specified in type_name_or_prefix is used, or "_default_" if none specified)
      • NOTE see comments below if specifying custom _all or _source fields in the mapping)
    • mapping_overrides: object (as above, specified per type/"_default_") - allows users to specify top-level fields to be applied to the template used to generated indexes for this bucket (see example below).
      • NOTE if mappings overrides _all or _source then a custom mapping_overrides must be specified without that field set (otherwise the defaults of {"_all": { "enabled": false }, "_source": { "enabled": true } } will overwrite the user settings
        • eg set "mapping_overrides": { "_all": { "enabled": false } } if setting a custom _source in mappings

Note that the index name generated is in the format:

  • <bucket-name-summary>_<uuid> or <bucket-name-summary>_<uuid>_<dateformat>, where:
    • <bucket-name-summary> is taken by appending the first, penultimate, and last directories in the bucket using "" as a separator, converting to lower case, replacing all non-alphanum characters except "-" with "", and collapse "__"s into "_".
    • the UUID is the last 12 characters of the type 3 UUID generated from the bucket's full name
    • the <dateformat> is only used when the temporal service is enabled, and is one of "{yyyy-MM-dd-HH}", "yyyy-MM-dd", "YYYY.ww", "yyyy-MM", "yyyy"

For example: "/test+1-1/another__test/VERY/long/string" -> "test_1-1_long_string__2711e659d5a6"

As an example, here is the default configuration hardwired into the platform. For more details see the appropriate elasticsearch documentation:

{
  "collide_policy": "new_type",
  //(default prefix is _type, but don't specify it here since then it will mess up if override to collide_policy:"error")
  
  "settings" : {
    "index.refresh_interval" : "5s",
    "index.indices.fielddata.cache.size": "10%" // (note, does not apply to doc values)
  },
  
  "aliases": {
     "my-alias-name": {} // (in this format because the {} can contain more advanced settings)
  },

  // Top level mapping fields, can be against a type, _default_, or * (catch all)
  "mapping_overrides": {
  	"*": { // applied as a back-stop
  		"_all" : {"enabled" : false},
  		"_source": {"enabled" : true}
  	}
  },
  
  "mappings" : { 
  	// (fielddata is overwritten unless a matching column is not specified, in which case the defaults here are used)
  	// (note that non-string fields should be explicitly marked as "index": "not_analyzed" so the system knows it can use "doc_values", if so desired) 
    "_default_" : {
       "dynamic_templates" : [
       {
         "string_fields" : {
           "match" : "*",
           "match_mapping_type" : "string",
           "mapping" : {
             "type" : "string", "index" : "analyzed", "omit_norms" : true, "fielddata": { "format": "disabled" },
              "fields" : {
                 "raw" : {"type": "string", "index" : "not_analyzed", "ignore_above" : 256, "fielddata": { "format": "disabled" }}
               }
           }
         }
       },
       {
       	"all_other_fields": {
       		"match": "*",
       		"mapping": {
       			"type": "{dynamic_type}",
       			"index": "not_analyzed",
       			"fielddata": { "format": "disabled" } 
       		}
       	}
       } 
       ],
       "properties" : {
         "@timestamp": { "type": "date", "fielddata": { "format": "doc_values" }, "index": "not_analyzed" }
       }
    }
  }
}

columnar_schema

  • enabled_field_data_analyzed: object - the top-level keys are the data object fieldnames, or "_default_". The values are objects in the elasticsearch "fielddata" configuration format - this configuration is applied to analyzed fields that are designated columnar via the generic configuration
  • enabled_field_data_notanalyzed: object - as above: this configuration is applied to non-analyzed fields that are designated columnar via the generic configuration
  • default_field_data_analyzed: object - as above: this configuration is applied to analyzed fields that are designated not-columnar via the generic configuration
  • default_field_data_notanalyzed: object - as above: this configuration is applied to non-analyzed fields that are designated not-columnar via the generic configuration

As an example, here is the default configuration hardwired into the platform. For more details see the appropriate elasticsearch documentation:

{
	"enabled_field_data_analyzed": {
		"_default_": {
			"format": "fst"
		}
	},
	"enabled_field_data_notanalyzed": {
		"_default_": {
			"format": "doc_values"
		}
	},
	"default_field_data_analyzed": {
		"_default_": {
			"format": "disabled"
		}
	},
	"default_field_data_notanalyzed": {
		"_default_": {
			"format": "disabled"
		}
	}
}

temporal_schema

There are currently no technology overrides for the elasticsearch implementation of the temporal service.

document_schema

One issue with the deduplication functionality provided by the document schema is that it assumes that it is possible to perform exact ("term") matches on the specified fields. With elasticsearch this can be problematic because of the "analyzer" step, which can make querying on some fields "fuzzy". Elasticsearch enables each field to have different "views" of the data (eg "fieldA" analyzed, "fieldA.raw" non-analyzed).

To supports that, the document_schema technology_override can be configured as follows:

//...
"document_schema": {
   "technology_override": {
      "default_modifier": string,
      "field_override": {
         string: string
      }
   }
}
//...

Where:

  • default_modifier: a string appended to any string not matching the field_override map (see below), eg ".raw" will map "fieldA" to "fieldA.raw"
  • field_override: a string/string map that converts the key to the value. Note "dot notation" is supported for both key and value, but with : replacing .. Eg { "obj.fieldA": "obj.fieldA.numeric" } would map "obj.fieldA" to "obj.fieldA.numeric".

Note that currently there is no default transform, but soon Elasticsearch will deduce a sensible field based on the schema.

data_warehouse_schema

The Elasticsearch data warehouse service implementation is based on Hive, and specifically the elasticsearch-hadoop Hive integration.

See also these integration notes.

The technology_override_schema has the following format:

{
    "types": [ string ],
    "url_query": string,
    "json_query": { ... },
    "name_mappings": { string: string }
}

Where:

  • types is an array of strings that specifies the elasticsearch types that will be included in the Hive table. By default, the data warehouse service will try to use all types available in the bucket (which it currently only checks each time the bucket is updated).
  • url_query: (only one of this or the json_query can be specified) a "URL style query" for Elasticsearch, eg "q=FIELD:TERM"
  • json_query: (only one of this or the url_query can be specified) a full (standard) Elasticsearch query object
  • name_mappings: Allows to map between ES field names and Hive compatible ones (eg in the example below @timestamp) - search for es.mapping.names in the elasticsearch-hadoop documentation linked above.

Here's an example of using the technology_override_schema to override the name

                "data_warehouse_schema": {
                    "enabled": true,
                    "main_table": {"table_format": {
                        "da": "STRING",
                        "dp": "STRING",
                        "ibyt": "BIGINT",
                        "ipkt": "BIGINT",
                        "obyt": "BIGINT",
                        "opkt": "BIGINT",
                        "sa": "STRING",
                        "sp": "STRING",
                        "td": "DOUBLE",
                        "datet": "TIMESTAMP"
                    }},
                    "technology_override_schema": {
                        "table_overrides": {
                            "main_table": {
                                "name_mappings": { "datet": "@timestamp" }
                            }
                        }
                    }
                },

Note that the field types have to match the corresponding Elasticsearch types, no conversion happens, including between "long" ("BIGINT") and "int" ("INT"), "float" ("FLOAT") and "double" ("DOUBLE"), "date" ("TIMESTAMP") and "long" etc etc.

⚠️ **GitHub.com Fallback** ⚠️