Home - KurtEnglmeier/myDistiller GitHub Wiki
In one form or another, information extraction developed by IT professionals have always shaped and dotted the IT landscape. And for as long as anyone can remember, the gap between business needs and IT’s understanding of these needs has been a source of trouble. We want to encourage you as information consumer to assume more responsibility in information management. myDistiller supports self-service information discovery and presents the findings in a format the corporate information architecture can easily ingest. With a little bit of training people can instruct myDistiller to detect certain information in texts - even if distributed over multiple sources - and to compose this information into snippets tailored to individual needs.
Many discovery requests are individual and ad-hoc by nature. Some have even one-time purpose, their findings are disposable artifacts. Adapting data warehouses or information retrieval, mining, or extraction systems to these requests is prohibitively expensive. This means that user needs are relegated to mundane manual work.
There are many examples for these requests that are better and more efficiently handled by self-service discovery:
- Checking call center notes to detect the essentials of a problem with a particular product or service.
- Specification of machines or machine parts in the customer complaints
- Identify actors of a purchase in contracts or their representatives.
- Specific information from product descriptions, like price, package size, weight etc.
- Text sections addressing particular symptoms from patients’ health reports.
- Certain data like date, number, and total amount from invoices (pdf, doc, or text files).
- Extract and combine essential information from balance sheets and related documents.
- Expressions of certain sentiments for and against products, services, or policies, in email, blogs, etc.
myDistiller also discovers information when data are scattered over multiple sources. Thus, it identifies the documents that belong together in order to satisfy a particular request. It could take a nurse 20 minutes to collect the data needed for a treatment assessment. Likewise it may take a secretary half an hour to read contracts and certificates to find the matching documents. In both cases, people have to scan or even page through documents, check them for the data essential for their request at hand. If this task needs to be done repetitively it may cost the person hours of precious time very day. She or he can write an instruction instead and delegates the job to myDistiller that reliably discovers the requested information.
To illustrate myDistiller’s potential we delegated a request to myDistiller and let it operate on the (English) Wikipedia collection. We wanted to extract birth and death dates of German writers and to list their works with their original title and the corresponding English translation of the title. By checking a sample of Wikipedia pages, we familiarized ourselves with the way how these facts are represented (Figure 1) and defined the corresponding descriptive patterns using myDistiller’s instruction language (Figure 2). Its engine applied our instructions on this collection and returned the extracted information in XML format (Figure 3).
Figure 1. Original text sections containing the requested information on the authors’ birth and death date and on the titles of their works.
Figure 2. Instructions to discover information on birth and death dates of (German) writers and the titles of their works (original title and English translation).
The example illustrates also the rationale for data integration and information sharing behind myDistiller that supports the collaborative development of a metadata schema. This schema can constitute the semantic skeleton of an information ecosystem on group or organizational level. In this context, myDistiller supports “active compliance”, that is, the collaborative agreement on a unified overarching metadata schema. This schema reflects organizational reality if its creators are mainly the information consumers, i.e. the domain experts.
Figure 3. Results rendered in XML.
Self-service discovery addresses, among other things, ad-hoc requests for discovery that, like in the example above, are not too complex. The integration in terms of consolidating metadata schema on a broader organizational level is not always the most important issue in integration. The integration into the technical environment can be more important: what kind of input formats are supported by the discovery service, how are the results presented? For the time being, our service accepts input such as HTML, PDF, or plain text documents. The output is simply rendered in XML, in order to enable a smooth integration into follow-up processes for data analytics, visualization, reporting and the like.
Before you start we recommend you to make yourself familiar with [myDistiller's extraction language](myDistiller's extraction language). myDistiller's [Tutorial](myDistiller tutorial) may give further insights.
If you want to further process myDistiller's XML file we recommend its tool called searchClient. It lets you locate, extract, and unify sections from and across XML files. For more information please check out [myDistiller's search client](myDistiller's search client)
Check out our [Tutorial](myDistiller tutorial) to learn more about myDistiller's information extraction.
Copy myDistiller.jar
to the location of your choice.
In the same directory you create a text file called Distiller4.config
. Please make sure that the extension “config” is not altered by your system. Some editors, for instance, add “txt”, because they believe that’s the required extension.
We recommend that you copy Distiller4.config
from the example and edit this copy.
Distiller4.config
has four entries:
The directory where myDistiller finds its input data, the original texts for instance.
It uses the output directory to store extracted data.
In the configuration directory it expects its auxiliary data and the patterns it shall operate on.
The language selection (the international language string). For the time being it handles en_US
and es_ES
The structure of the entries:
CONFIG_DIR=your configuration directory
INPUT_DIR=your input directory
OUTPUT_DIR=your output directory
LOCALE=your locale
Example of the entries:
CONFIG_DIR=/Users/Literature/
INPUT_DIR=/Users/Literature/IN
OUTPUT_DIR=/Users/Literature/OUT
LOCALE=en_US
Please make sure that you have the correct language code in your configuration file! This is very important because myDistiller converts all numbers expressed in words into numbers. For instance, it turns "five" into "5". Of course, this conversion is language dependent. If you apply the Spanish version of this conversion (triggered by the code entry in your configuration file) to English texts, your results will be disappointing.
myDistiller identifies the files it has to process by their file extensions. Valid extensions are listed in extensions.list
.
For example:
txt
original
myDistiller requires three auxiliary files. They serve to convert months, weekdays, and numbers, that are expressed by words, into numeric representations.
For English language support these files are
months.words
weekdays.words
number.words
For Spanish language support use:
meses.palabras
dias.palabras
numerales.palabras
These auxiliary data have the form number in words=number
. myDistiller replaces number in words by number.
Example:
nineteen=19 in numbers.words
march=3 in months.words
Information extraction usually addresses different document collections having different and partly unique characteristics in terms of the information they cover. Invoices, for example, contain particular information (e.g. the addressee) that follow certain patterns. The same information (e.g. the same person) may be represented in contracts in a different way. There may be a generic pattern that covers both representation forms. However, you may also need different patterns, depending on your particular purpose. Very specific and complex patterns, for instance, tend to appear just in one data collection and nowhere else.
Each document class (invoices, emails, articles, etc.) that stands for each document collection has a unique identifier or title. myDistiller needs this title to separate incoming documents. It is therefore required to indicate the unique titles in the file identifiers.config
. In addition, you choose a unique name for this collection that has this identifier.
In the absence of a declarative title the system automatically adds "ANY" as classification title and applies all generic patterns to this document.
Example:
We have a number of articles of authors from Wikipedia. We call this collection “authors
”. All its articles have as title “WIKIPEDIA
” somewhere at the top of the article. We therefore add to the identifiers’ list:
authors=WIKIPEDIA
The label “authors” can also be used in the pattern files if we have specific patterns that should be applied only to documents from the collection containing Wikipedia articles on authors.
- Java installed on your computer