myDistiller's extraction language - KurtEnglmeier/myDistiller GitHub Wiki

myDistiller's extraction language

For information identification, myDistiller follows the principle of pattern recognition in texts. The definition of patterns starts with basic patterns expressed as regular expressions and aggregates gradually into larger compounds of patterns. For aggregation purposes, myDistiller provides its own syntax to shield users from the definition of complex Regular Expressions. The identified information is marked by XML-tags. The tags are labelled with names as defined by the user. When information identification is completed, myDistiller extracts the tags and transforms them into an XML presentation.

There are five processes that constitute data extraction in myDistiller:

Step 1: Conversion of numbers and dates

This process transforms numbers and dates expressed in words like “five hundred seventy six” or “February seven, twenty hundred fourteen” into the corresponding numeric representations: “576” and “2014-2-7”.

In this process, myDistiller uses the following external data:
Numeric expressions in words: numbers.words or numerales.palabras
Word representations of months: months.words or meses.palabras
Word representations of days: weekdays.words or dias.palabras

myDistiller extracts data from all text files using file extensions listed in extensions.list

Connector patterns The first patterns used are the connector patterns. These are regular expressions reflecting basic data types like “date”, “numeric”, etc. These expressions are not applied directly, instead they are used by the following extraction processes, prevailingly the processes addressing named entities.
Examples:
numeric,numerics=\d+([-.\,]\d+)*
decimal,decimales=(\d{1,3}[.,]?)+-?;;\d+[,.]\d+
WORD,WORDS=\b\p{Lu}+([-\p{Lu}']+)?\b
Word,Words=\b\p{Lu}[\p{Ll}']*(-?\p{Lu}[\p{Ll}']*)?\b
word,words=\b\p{L}*(-?\p{L}*)?\b
year,years=[1-2]\d{3}(-\d{1,2})?

There is one special syntax introduced here by myDistiller: “;;” separates alternative patterns that shall be applied for the same pattern definition.

All patterns get names in singular and plural. This supports the natural use of terms and concepts when word patterns need to be described.

Step 2: Basic named entities

Basic named entities are the first elements to be identified and tagged. They do not use connector patterns. However, much like connector patterns they are also expressed using regular expressions.
Example:
date,dates=\d{4}-\d{1,2}-\d{1,2};;\d{1,2}[/.]\d{1,2}[./]\d{4};;\d{4}-\d{1,2}

This process tags basic named entities in texts. You will find these elements in XML tags.
Example: 2011-7-29

Process 3: Named entities

Named entities represent the first more complex patterns. They are made of regular expressions, fixed expressions, and connector patterns.
Example:
time from="from".year:year
time to="to".year:year
time from="after".words.year:year
lifetime,lifetimes="(".date:birthdate.date:death date.")"
title,title=phrase:german."(".phrase:english.")"

myDistiller provides its own syntax for the definition of these patterns. Again, the patterns at the bottom layer are the “connector patterns”. They link data concepts with their instances in the texts. The syntax for the definition of word patterns is quite simple. It supports a handful of operators the users employ to define a pattern as a sequence of elements. The following abstract definition of a patterns summarizes myDistiller’s syntax. The Table below explains in more detail the functionality of the operators.

concept,concepts = element1.element2;element3. (element4,element5,element6):name.?element7

On the left side of the equation the user defines the name of the pattern. The right side of the equation lists the sequence of elements the discovery service locates in data in order to indicate (and extract) the information items. The pattern can be assigned to a singular term and optionally also to the corresponding plural term, because people intuitively apply both forms when describing their patterns.

Operator Function
. The dot indicates strong sequence (“followed by”). The information item indicated before the dot must be located before the item indicated after the dot.
, Comma means weak sequence. The elements indicated must be located in the data. However, they may appear in any order (inclusive combination).
; The semicolon is used to indicate an exclusive combination. Just one of the elements ought to be located.
: Labeling operator: the name after the colon is assigned to elements or a group of elements. Labeling serves the implicit introduction of (local) nested patterns.
(...) Parentheses serve to indicate a group of elements. Grouping only makes sense together with the labeling operator.
? The question mark indicates that an element can be optional, that is, the corresponding item can but must not be located in the data sources.
"constant" Expressions indicated between double quotation marks are treated as fixed terms or key words. An expression can be a simple word, a sequence of words or a phrase. Please be aware that any expression in quotes is treated as written. myDistiller just translates it into Regular Expressions, but it does not treat key words as Regular Expressions! That is, myDistiller escapes all regex characters between quotes. Please be also aware that myDistiller is case-insensitive when it comes to constants.
'regex' Expressions indicated between simple quotation marks are treated as Regular Expressions. Please be careful when using groups please mark them as non-capturing groups (using ?:)! myDistiller internally uses capturing groups and named capturing groups. Your groups may interfer with myDistiller's groups resulting in irritating results. For myDistiller, dots also include line breaks.
# Comment: any statement after the dash sign is ignored as instruction and thus treated as comment.

The result of this step are further elements tagged in the text.
Example:
don BELARMINO EMILIO RUZ JEREZ

Process 4: Location-based patterns (optional)

There are expressions that can be best identified based on their location in text. This means they appear before, after, or between significant other expressions and follow typical patterns. The adjective trailing the name of a person and heading her civil status is assumed to express the person’s nationality. The brackets “>” and “<“ indicate if the term appears after or before the expression. The “<“- or “>”-expressions may appear anywhere in the statement.
Example:
profession,professions=birthday<.words.>address
The words in between the expressions for birthday and address stand then for the profession of the person.

Process 5: Word patterns

This process is the flag ship among myDistiller’s information extraction processes. It handles the most complex word patterns. Of course, the syntax is the same as in the named entities process.
Examples:
birthplace="born".Words
author,authors=Words:name.lifetime."german"

This process results in more complex information patterns identified in and extracted from texts.
Example:
Franz Kafka(1883-7-3 – 1924-6-3</death date>) was a German

Different documents different patterns

Many named entities, all basic named entities and all connector patterns can be applied to all kind of documents. However, the more complex the word pattern the higher the probability that this pattern can be applied just to one kind of document. In general, the patterns listed as named entities or word patterns are applied to all kind of documents. These are generic patterns. If they appear below a specific header indicating a certain type of document they are applied exclusively to this type of documents (specific patterns). All patterns up to the first >>>header are considered being generic.
Example:
>>>authors
birthplace="born".Words
author,authors=Words:name.lifetime."german"
move,moves=date."move".words:destination
death,deaths=words."dies";"died".?words,date
play,plays="production";"premiere";"drama";"works".titles

The identifiers following the three brackets (“>>>”) represent unique variable names for the document collection “authors”. All patterns listed before the first set of brackets are considered as generic patterns, i.e. they will be applied to all document collections.

myDistiller classifies documents along their most significant header as defined by the user.
Example:
authors=WIKIPEDIA
The classifiers are listed in the file identifiers.config. In the absence of such a unique identifier myDistiller assigns the classifier "ANY" and applies all generic patterns to these documents.

Regular Expressions

Connector patterns and basic named entities are the place where you apply Regular Expressions. They host the most primitive elements for your information extraction. Typical examples are "date", "email address", "social security number", or "numeric data". Remember that only basic named entities are tagged during extraction. Connector patterns are treated as variables shielding away Regular Expressions. For all other patterns (named entities, word patterns, location-based patterns) the application of Regular Expressions is not necessary. However, you can use Regular Expressions in named entities, word patterns, and location-based patterns. Your Regular Expression must appear between simple quotation marks! For instance, in those patterns you can express the grammatical variants of a verb in two ways: "live";"lives";"lived". Instead of explicitly mentioning these variants you can also use a Regex as abbreviated version: 'live[sd]?'

Special characters in text: 'new line (NL)' and 'carriage return (CR)'

Probably you don't see them, but they are there: special characters like 'new line'. Systems use them to separate lines or to format text. Sometimes you may ignore them and sometimes (if you have to analyse text in forms, for instance) you may consider them useful. By default, myDistiller ignores these characters. However, if you add a line in the pattern files you can indicate that they should be treated as white space.

The important things in between: what may appear between pattern elements

If you define a pattern like myExpression="foo".word (using the definition of word in the example above) then myDistiller locates "foo" in your text, adds the subsequent word, and tags the result as myExpression: foo whatever. If you set word in plural and put myExpression="foo".words the result looks different: foo whatever is written after foo. "words" consumes up as many words it can get after "foo". Sometimes it may be too much "words" consumes. There are two things that makes "words" stop consuming up words. First the "white space" between words. By default it's the blank space and nothing else, not even new line. If there is a line break between "is" and "written" it stops after "is". The second way to avoid excessive consumption of words is using a limiting expression after words: myExpression="foo".words.foo. The resulting tag is the same, but words stops in consumption with the word "after".

If you want to allow different characters as white space, between words for example, then change or add the following line in your configuration file Distiller4.config: LIMITdefault-space-plurals=[ ]
By default, myDistiller allows just blanks appearing between pattern elements of the same type, that is, types set in plural (like words). It allows up to four blanks between them.

If you want to handle white space a bit more context-sensitive, you should add your own definitions. Let's say, you may include line breaks and tabs. In the configuration file, you should add a line like: LIMITincludeLineBreaksAndTabs=[ \t\n\r]
Next you add the following instruction in, say, your definitions for named entities:
limit-plurals=includeLineBreaksAndTabs
name,names=Words."Inc.";"company";"Gmbh"
...
You may define it as general instruction for all named entities, but you can also limit it to one of your content-specific sections: >>>authors
limit-plurals=includeLineBreaksAndTabs
lifetime=date:birthday.date:death day
...

Probably the more interesting question is: What's happening between pattern elements of different types like "Address", name, street, city, country, "birth day", date etc.? In general, myDistiller ignores everything between tagged elements and key expressions, no matter how far they are apart. Between Address and John Smith myDistiller ignores everything unless it's a tag or a constant like Birth day.

⚠️ **GitHub.com Fallback** ⚠️