Home - RBGKew/String-Transformers GitHub Wiki

Welcome to the String-Transformers wiki!

String Transformers

String Transformers is a collection of Java classes which implement a transform method, taking a character string and changing it into another. They are useful as part of a process for matching character strings against each other and deciding whether the things they represent are the same. For example, in a messy database we probably want “Royal Botanic Gardens, Kew” to match against ”royal botanic gardens kew”, “ROYAL BOTANIC GARDENS KEW” and maybe even “R.B.G. Kew”.

Some of the transformers are generic: CapitalLettersExtractor removes non-capital letters from a string. Others are geared towards handling scientific names, like StripBasionymAuthorTransformer.

Examples

To transform

  • “Royal Botanic Gardens, Kew”
  • ”royal botanic gardens kew”
  • “ROYAL BOTANIC GARDENS KEW”

so they match

A LowerCaseTransformer followed by a StripNonAlphanumericCharacters would turn all three strings into “royal botanic gardens kew”:

Original After LowerCaseTransformer After StripNonAlphanumericCharacters
“Royal Botanic Gardens, Kew” “royal botanic gardens, kew” “royal botanic gardens kew”
”royal botanic gardens kew” ”royal botanic gardens kew” ”royal botanic gardens kew”
“ROYAL BOTANIC GARDENS KEW” ”royal botanic gardens kew” ”royal botanic gardens kew”

If we also want to match “R.B.G. Kew” then we could use StripNonAlphanumericCharacters, then TitleCaseTransformer, then CapitalLettersExtractor to end with “R B G K” in each case — although this will also match against “Rather big grey koala”, so we must be careful!

To transform these scientific names

  • Coffea sapinii
  • Coffea sapini

into the same string we might use an A2BTransformer with search pattern (A) (\\w)\\1, which means any letter followed by the same letter, and the replace pattern (B) $1 which means the first letter (since it was in brackets). This replaces double (or more) letters with only a single letter: “Cofea sapini”.

To transform

  • “(De Wild.) A.P.Davis”
  • “A.Davis”

we first use a RemoveBracketedTextTransformer to remove the often-absent basionym author. We could add an A2BTransformer to remove the initials, but there are specific transformers geared towards botany: we want a SurnameExtractor — both author strings should be converted to “Davis”.

Usage

See Usage with OpenRefine.

See the API Documentation for the full list of Transformers.