TextAnalysisConfiguration - NatLibFi/Skosmos GitHub Wiki

Starting from version 1.4 Skosmos relies exclusively on the jena-text index for text searches (as long as JenaText is set as the SPARQL dialect in config.inc). This means that the jena-text analyzer configuration can be adjusted to make different kinds of matching strategies possible.

NOTE Using alternative analyzers is an experimental feature and hasn't been tested much at the time of the 1.4 release. Please try it and report your experiences in the skosmos-users group or as issues here on GitHub!

The analyzer is set in the Fuseki configuration file. Note that the analyzer is set in three places, separately for each SKOS label property (prefLabel, altLabel, hiddenLabel). Always set the same analyzer for each property!

The default analyzer is LowerCaseKeywordAnalyzer and it is configured like this:

           text:analyzer [ a text:LowerCaseKeywordAnalyzer ]

Note that you will need to rebuild the text index every time you change the analyzer configuration! You can do so either by reloading the RDF data to Fuseki (easier but may take time) or by using the jena.textindexer utility - see the jena-text documentation for details on how to do that.

Matching individual words

The default configuration of Skosmos considers each label as a separate token and doesn't distinguish words within labels. This means that e.g. fra* doesn't match academic fraud (you need to use *fra* instead).

This can be changed by setting the jena-text analyzer configuration to use SimpleAnalyzer, which splits the labels into words based on non-word letters (whitespace, commas etc.) and then matches individual words.

           text:analyzer [ a text:SimpleAnalyzer ]

Another alternative is StandardAnalyzer which does more intelligent tokenizing including a list of (English language) stop words and heuristics for acronyms, numbers, words with apostrophes etc.

           text:analyzer [ a text:StandardAnalyzer ]

Language-specific analyzers

Jena-text can intelligently choose an analyzer based on the language of labels using a MultilingualAnalyzer. Some of these analyzers perform stemming and/or use language-specific stop word lists. See the jena-text documentation for details on how to configure this.

Accent folding i.e. matching regardless of diacritics

Searches can be made diacritic-insensitive (e.g. a search for deja vu will match déjà vu) by using a ConfigurableAnalyzer which is configured to use an ASCIIFoldingFilter. This filter drops all non-ASCII characters into their nearest ASCII equivalents, for example éïèåäö will become eieaao. The downside of this simple algorithm is that in many languages, some diacritics are more significant than others - for example in Finnish "paatos" and "päätös" are completely different words but with this algorithm searches for any one of them will also match the other. A more sophisticated, language-aware analyzer would be needed to avoid this kind of wrong results.

This requires support for ConfigurableAnalyzer as well as support for AnalyzingQueryParser, which was added to Fuseki 1.4.0/2.4.0-SNAPSHOT in JENA-1134. You should thus use a recent (2016-04-07 or later) 1.4.0-SNAPSHOT or 2.4.0-SNAPSHOT version. (download directory for Fuseki 1.4.0-SNAPSHOT and 2.4.0-SNAPSHOT).

Configuration:

<#indexLucene> a text:TextIndexLucene ;
    text:queryParser text:AnalyzingQueryParser ;
    # other settings for the index: text:directory, text:entityMap, text:storeValues ...

# analyzer configuration in text:map
           text:analyzer [
             a text:ConfigurableAnalyzer ;
             text:tokenizer text:KeywordTokenizer ;
             text:filters (text:ASCIIFoldingFilter text:LowerCaseFilter)
           ] 

Accent folding plus individual words

This is the same as above, but using LetterTokenizer to split the label into individual words.

           # remember to set text:queryParser as above

           text:analyzer [
             a text:ConfigurableAnalyzer ;
             text:tokenizer text:LetterTokenizer ;
             text:filters (text:ASCIIFoldingFilter text:LowerCaseFilter)
           ] 

Selective accent foldig plus individual words

This is similar to the one above, but using SelectiveFoldingFilter to drop some of the non ASCII characters to their nearest equivalents. You can configure which non-ASCII characters are used in their original form in the text index.

# add the following to your text:TextIndexLucene instance
      text:defineAnalyzers (
          [ text:defineFilter <#selectiveFoldingFilter> ;
              text:filter [
                  a text:GenericFilter ;
                  text:class "org.apache.jena.query.text.filter.SelectiveFoldingFilter" ;
                  text:params (
                      [ text:paramName "whitelisted" ;
                        text:paramType text:TypeSet ;
                        text:paramValue ("Å" "Ä" "Ö" "å" "ä" "ö") ] )
              ]
          ]
      )

Then you can use a following analyzer in your entityMap:

          text:analyzer [
            a text:ConfigurableAnalyzer ;
            text:tokenizer text:KeywordTokenizer ;
            text:filters (<#selectiveFoldingFilter> text:LowerCaseFilter)
            ]