Operator property specification - Texera/texera GitHub Wiki

This document describes the properties for each operator in Texera. It serves as the communication API of the operators and query plans between Texera-GUI and Texera-Web.

Author: Zuozhi Wang, Kishore Narendran


All operators mentioned below commonly have a required property: "attributes" and two optional properties: "limit" and "offset".

{
	"attributes" : "attr1_name, attr2_name, attr3_name",
	"limit" : "10 (this property is optional)",
	"offset" : "5 (this property is optional)"
}

Matcher operators:
{
	"operator_type" : "KeywordMatcher",
	"keyword" : "a_keyword",
	"matching_type" : "one of: [conjunction, phrase, substring]"
}

{
	"operator_type" : "DictionaryMatcher",
	"dictionary" : "dict_entry_1, dict_entry_2, dict_entry_3",
	"matching_type" : "one of: [conjunction, phrase, substring]"
}

{
	"operator_type" : "RegexMatcher",
	"regex" : "a_regex",
}

{
	"operator_type" : "FuzzyTokenMatcher",
	"query" : "a query of fuzzy token matcher",
	"threshold_ratio" : "0.8",
}

{
	"operator_type" : "NlpExtractor",
	"nlp_type" : "one of: [Noun, Verb, Adjective, Adverb, NE_ALL, Number, Location, Person, Organization, Money, Percent, Date, Time] (case insensitive)",
}


{
	"operator_type" : "Join",
	"inner_attribute" : "inner_attr_name",
	"outer_attribute" : "outer_attr_name",
	"predicate_type" : "one of [CharacterDistance, SimilarityJoin]",
	"threshold" : "10"
}
notice that join doesn't have attributes, instead, it has inner_attribute and outer_attribute.

{
	"operator_type" : "Projection",
	"attributes" : "attr_1_name, attr_2_name"
}

Source Operators:
Keyword, Regex, FuzzyToken, and Dictionary have their corresponding source operator, which adds a another property of "dataSource".

{
	"operator_type" : "KeywordSource",
	"data_source" : "data_source_name",
	"keyword" : "a_keyword",
	"matching_type" : "one of: [conjunction, phrase, substring]"
}
{
	"operator_type" : "DictionarySource",
	"data_source" : "data_source_name",
	"dictionary" : "dict_entry_1, dict_entry_2, dict_entry_3",
	"matching_type" : "one of: [conjunction, phrase, substring]"
}

{
	"operator_type" : "RegexSource",
	"dataSource" : "data_source_name",
	"regex" : "a_regex"
}

{
	"operator_type" : "FuzzyTokenSource",
	"data_source" : "data_source_name",
	"query" : "a query of fuzzy token matcher",
	"threshold_ratio" : "0.8",
}

Sink Operators:
{
	"operator_type" : "FileSink",
	"file_path" : "file_path"
}

{
	"operator_type" : "IndexSink",
	"index_path" : "index_path",
	"index_name" : "name_of_index"
}

{
	"operator_type" : "TupleStreamSink"
}


The JSON format representing the operator graph will be:
{
        "operators" : [
        {
                "operator_id" : "operator_1_id",
                "operator properties as mentioned above" : "some properties"
        },
        {
                "operator_id" : "operator_2_id",
                "operator properties as mentioned above" : "some properties"
        }
        ],
        "links" : [
        {
                "from" : "operator_1_id",
                "to" : "operator_2_id"
        }
        ]
}