myDistiller Tutorial - KurtEnglmeier/myDistiller GitHub Wiki
myDistiller helps you to extract the information you need from text or any other source of unstructured data. Information usually follows a particular pattern, its meaning emerges from unique patterns depicted by data. myDistiller helps you to describe these patterns. However, myDistiller does not stop here. Usually you don’t look just for a particular thing. Instead you need a bundle of data that underpins the broad scope of the information you require. For instance, if you are looking for a new phone you have a couple of qualities in mind the phone should have, such as the resolution of the camera (in megapixels), the resolution of the screen, the operating system it is equipped with, and the size of the internal storage. If you have to do that search only once, you probably don’t see any benefit if some system does it for you automatically. However, things look different if you have to do such a search on regular basis or if you have to check numerous documents containing this information.
myDistiller helps you to organize the automatic extraction of data relevant for your individual search. If you follow this tutorial you get a first impression of myDistiller. Furthermore you get a feeling about the potential of myDistiller’s pattern description language.
We composed a couple of text snippets from different contextual background, namely a report, product descriptions, and a biography. You find the example texts in the file “tutorial test text.txt”
under the download section (or at the end of this page).
Let us prepare the automatic extraction step by step.
These patterns are your most basic stock of patterns. They refer to numeric items, decimals, words (upper/mixed/lower caps), and phrases. If you are not familiar with Regular Expressions leave them as they are. Please remember, all other patterns just use these patterns. They are never shown explictly as annotated data or are shown in the XML file!
numeric,numerics=\d+([-.,]\d+)*
decimal,decimals=(\d{1,3}[.,]?)+-?;;\d+[,.]\d+
WORD,WORDS=\b\p{Lu}+([-\p{Lu}']+)?\b
Word,Words="?\p{Lu}[\p{L}']*(-?\p{Lu}[\p{L}']*)?"?
word,words=\b[\p{L}\p{N}][\p{L}\p{N}']*(-[\p{L}\p{N}][\p{L}\p{N}'-]*)?\b
phrase,phrases=\b["'\p{L}\p{N}]*([, /-"'\p{L}\p{N}]*)?\b
date,dates=(\d{4}|\d{2})[.-/]\d{1,2}[.-/]\d{1,2};;\d{1,2}[.-/]\d{1,2}[].-/](\d{4}|\d{2})
month,months=\d{1,2}-\d{4};;\d{4}-\d{1,2}
# Pattern for British zip codes:
zipcode,zipcodes=\b[A-Z]{1,2}[0-9R][0-9A-Z]? [0-9][ABD-HJLNP-UW-Z]{2}\b
Please remember that all expressions representing numbers such as “five”, “four point six” or “July” are converted into their corresponding number representations: “5”, “4.6”, “7”. Dates are formatted into Day-Month-Year (dd-MM-yyyy).
You may ask yourself: again Regular Expressions? What is the difference between “basic named entities” and “connector patterns”? myDistiller explicitly tags items that correspond to basic named entities. Connector patterns are not tagged, but you can work with them in the subsequent patterns. Again, if you are not in the mood to handle Regular Expressions just skip this step.
Now it’s your turn! Have a look on the following patterns:
lifetime=date:birthday.date:death day
person,persons=Words:name."(".Words:nickname.")".lifetime
person,persons=Words:name.lifetime
os=”operating system”;”betriebssystem”.Words.decimal
work,works=Words:german."(".Words:english.")"
If you apply them to the test text you see that they work fine for the final section, the biography section:
<person><name>Franz Kafka</name> (<lifetime><birthday><date>3-7-1883</date></birthday> – <death day><date>3-6-1924</date></death day></lifetime></person>) was a German-language writer of novels and short stories, regarded by critics as one of the most influential authors of the 20th century. Kafka strongly influenced genres such as existentialism. Most of his works, such as <work><german>"Die Verwandlung"</german> (<english>"The Metamorphosis"</english>)</work>, ...
However, you can also see that in the German section of the smartphone descriptions you find also an item that matches your pattern for work. Quite likely, this is not what you expected. Therefore, we recommend to split texts if the meaning of their patterns differ. We generate three text files:
-
economy.txt
where we put the economic report -
smartphones.txt
gets the phone descriptions, and -
authors.txt
will host the biography. You can better concentrate on the specifics of each data set!
We have three text files now. Start organizing your automatic extraction by thinking about significant titles. In the example, we can consider “Biography”, “descriptions”, and “report” as significant titles. You can also take “smartphone descriptions” if you like.
Take the file “identifiers.config”
. Here we say that we have three different types of documents identified by their significant headers that appear somewhere at the beginning of the texts.
We call the first document type “authors” and assign the significant title “bio” (or “biography”).
When completed the content of the text file “identifiers.config”
looks like this:
authors=bio
economy=report
-
smartphones=descriptions
Everything right of the equal sign is treated as Regular Expression. Just in case you need some more sophisticated identifiers.
Please return to your named entities now and add the following line before the first pattern:
>>>authors
If you run myDistiller again you see that Kafka’s biography is correctly tagged whereas the other texts remain untouched. However, if you move the “work”-pattern above the “>>>authors”-line the “work”-tag reappears in smartphones.txt. How that? myDistiller treats all patterns above the first section specific header (the line with the leading “>>>”) as generic patterns, i.e. as patterns that are applicable to all texts irrespectively of their title, type, etc. Better, we move the “work”-pattern back to the “authors” section.
Now we dedicate to the smartphone descriptions. There are some generic elements like dimension, data transfer speed, etc. However, each one has also its variants. If we want to mark the dimensions indicated the descriptions we should use the following two instructions:
>>>smartphones
dimension,dimensions=decimal.?word."x".decimal.?word."x".decimal.?word
dimension,dimensions=decimal."by";"x".decimal
With these patterns we correctly identify all variants of dimension, like: <dimension>136,6 x 70,6 x 8,6 mm</dimension>
We put the two patterns below the heading “>>>smartphones” because we don’t want to apply them to the other documents.
If you have a closer look on the two definitions you recognize that the second one would also identify dimensions that match the first pattern! This would leave items unidentified that match the longer pattern. to avoid that we put the longer pattern before the shorter one. Whenever an item is identified its inner elements can no longer be matched by subsequent patterns!
We (or our colleagues from IT) defined decimals as figures where the combination of decimal point and further figures may occur many times. This means 136,6 is a decimal much like 4.2.2, a pattern used in version numbers.
With this in mind we can locate all versions mentioned in the description with a simple pattern:
version,versions="version";"v".decimal:version number
The paragraph from the economic article contains some patterns typical for this class of articles. They talk about decrease or increase or durations of economic phenomenons. These patterns mostly contain key expressions with numbers, like “shrank by 30%”.
We first locate all instances of numbers representing percentages. We achieve this by a simple pattern: percent=numeric."%";"percent"
Next we define patterns for the phenomenon “decrease” which consists of the respective key words and the sub-pattern “percent” we defined before.
Now we can associate these number-related patterns with key expressions to further specify economic phenomena. For instance, we qualify recent development in unemployment. We define the corresponding pattern in the word.pattern
file: unemployment="unemployment".decrease;no change
Word patterns are only a bit different from named entities.They are a bit more “tolerant” concerning the whitespace between elements. They treat everything as whitespace that appears between elements. They consider an item as instance of a pattern even if its constituting elements appear on the extremes of a text. For our economic text we defined a couple of patterns:
>>>economy
duration,durations=numeric."years"
percent=numeric."%";"percent"
decrease,decreases="shrink","decrease","decreased","shrank".?words.percent
no change="remains";"remained".?words.percent,wort
With these instructions myDistiller produces the following annotation file (we just show a snippet of it):
... economic situation in Greece is devastating; GDP <decrease>shrank by almost <percent>30 percent</percent></decrease> in the past <duration>6 years</duration>, the <unemployment>unemployment rate <no change>remains above <percent>25%</percent></no change></unemployment>; and youth unemployment can be called only dramatic.
However, <duration>6 years</duration> of recession have made clear that enforcing ...
The patterns mark durations and all instances of raising or falling phenomena or when they stay on the same level. If we want to link these changes with phenomenons we define a word pattern that may take the form unemployment="unemployment".decrease;no change
Then we get the following XML file:
The patterns we are interested in represent chunks of significant data with “whitespace” in between. If we talk about words and phrases, “whitespace” between words can be blanks, tabs, commas, or the (usually) invisible symbol for “new line” or “carriage return”. This “something-in-between” may range from a single symbol to a (theoretically) unlimited sequence of characters.
Concerning named entities, the rationale is that only a couple of insignificant symbols appears between the building blocks of the corresponding patterns. If a named entity consists of a key expression and a basic pattern we assume that we have only a couple of insignificant characters between the two parts, for instance, up to four blanks, tabs, and the “new line”-sign. By default, myDistiller considers space, tabs, new line, carriage return, and form feed as “whitespace character”. This is mostly reasonable if we consider the space between elements of the same type. These elements we usually express in plural.
Consider the line “Franz Kafka (3 July 1883 – 3 June 1924)”. The space between the words representing the name is just blanks. So if we define name=Words
myDistiller locates “Franz Kafka”.
Things may be different if we approach the dates indicating Kafka’s lifetime. There are dashes and parentheses. One may say, these characters are also whitespace. Between elements of the same type, the so-called plurals, these characters are quite rare. However, between elements of different types it’s quite common to see more instances of whitespace. In our example, we treat parentheses as whitespace. In fact, myDistiller treats anything between elements as whitespace! This means it applies the Regular Expression (“.”) as long as it doesn’t encounter the next element.
Using these defaults for white space between elements of different types and elements in plural, i.e. elements of the same type, myDistiller correctly identifies Kafka’s lifetime by the instruction lifetime=.date:birthday.date:death day
.
We get the following annotation: Franz Kafka (<lifetime><birthday><date>3-7-1883</date></birthday> – <death day><date>3-6-1924</date></death day></lifetime>)
If we want to include now Kafka’s name, we expand our instruction to: person,persons=Words:name.lifetime
Note that we use :name
to mark the Words as names. This instruction results in <person><name>Biography Franz Kafka</name> (<lifetime><birthday><date>3-7-1883</date></birthday> – <death day><date>3-6-1924</date></death day></lifetime></person>).
Ooops, here we got a bit too much! This happens because the default whitespace between the element “Word” includes the new line character that is located between “Biography” and “Franz”. In many cases, like in forms, the new line character separates elements of different types from each other. Thus, we cannot treat “new line” as whitespace.
We need to instruct myDistiller to apply a different set of characters for whitespace. This can be achieved in two steps:
In myDistiller’s configuration file Distiller4.config
we define a our whitespace by adding the following line:
LIMITnoNL=[ \t\f,–]
.
Yes, the right hand side is Regular Expression. It includes space, tab, form feed, dash, and even comma. The name starts and has to start with the word “LIMIT”, the rest is optional and it’s up to you what name you chose. Our whitespace is called “noNL”, for “without new line”.
In the file containing your named entities you add the line limit-plurals=noNL
. You can do this above any section heading (indicated by “>>>”). Then your whitespace policy is applied to all named entities. If you put it into the “author” section then the application of this policy is restricted just to this section:
>>>authors
limit-plurals=noNL
lifetime=.date:birthday.date:death day
person,persons=Words:name.lifetime
You can even put the whitespace definition just before the “person” instruction. Then your policy is valid from here on.
You may ask now, “Is there a way back to the default whitespace?” Yes, there is! It requires to additional lines in the configuration file, we strongly recommend:
LIMITdefault-whitespace-plurals=\s
With these two instructions you can set your whitespace policy back to default by a further instruction, after the “person” pattern, for instance:
>>>authors
lifetime=.date:birthday.date:death day
limit-plurals=noNL
person,persons=Words:name.lifetime
limit-plurals=default-whitespace-plurals
If you think your whitespace policies should be valid in general, for all your patterns, then add the following lines in the configuration file:
LIMITdefault-space-plurals=\s
myDistillers takes the policy marked by “LIMITdefault-space-plurals” as default whitespace policy for all data sets.
In particular, when you start with myDistiller we recommend that you set your default whitespace policies in both cases to space, tab, new line, carriage return, form feed. You may use the Regex “\s” or the more explicit (and equivalent) definition “&91; \t\n\r\f&93”;. The latter enables you a more flexible handling of “new line”.
With a little bit of training and experimenting you’re going to find ways to write descriptive patterns that handle even the most intricate data. Addresses are prominent representatives of such data. In the “Smartphones” example we have a couple of them.
You immediately recognize that you cannot define a pattern like street=Words.numeric
. It returns “Xample Company 16” as street and similar rubbish.
A good starting point is always patterns that contain keywords. Fortunately, you will always find some helpful keywords. Streets have those keywords. There is always “street”, “St.”, “Ave.” around or something similar. With the pattern street,streets=numeric.Word."Street"
myDistiller correctly detects “16 Xample Street”. Don’t forget to put “Street” in quotes. Otherwise myDistiller thinks you want to apply a Regex!
Of course, you can use Regular Expressions even though it’s beyond the idea of myDistiller. If you cannot refrain from doing so, please keep in mind that myDistiller treats dot, comma, question mark, etc. as part of its syntax. Check again myDistiller’s syntax to avoid any irritating results.
myDistiller’s matching algorithm is case-insensitive! If you have “street” in your pattern it matches “street” and “Street” alike.
Variables like “Words” are flexible and very powerful, but they have also a drawback. They consume up all items that follow their patterns. However, they stop at keywords and data that are already annotated. To prevent flexible patterns to consume too much apply best them in a surrounding with many data already annotated or combine them with keywords.
We put the British zip code, for example, into the “Basic Named Entities”. It is quite unique and you can count on, that it doesn’t confuse with data that represent anything different from British zip codes. These patterns are applied before any Named Entities or Word Patterns. Slightly less unique are US zip codes and things look completely different concerning German zip codes.
Much like streets, company names mostly have unique keywords, too. We use these keywords again as anchors. With company names, streets, and zip codes correctly located we can be more “generous” when addressing city names that obviously do not have any unique keywords in most cases. The US city, for instance, we locate as a combination of words with zip codes followed by further words standing for the country.
Finally we identify data that represent company addresses with the most flexible pattern myDistiller supports: company,companies=name,street,suite,city