Home - RBGKew/String-Transformers GitHub Wiki
Welcome to the String-Transformers wiki!
String Transformers
String Transformers is a collection of Java classes which implement a transform
method, taking a character string and changing it into another. They are useful as part of a process for matching character strings against each other and deciding whether the things they represent are the same. For example, in a messy database we probably want “Royal Botanic Gardens, Kew” to match against ”royal botanic gardens kew”, “ROYAL BOTANIC GARDENS KEW” and maybe even “R.B.G. Kew”.
Some of the transformers are generic: CapitalLettersExtractor
removes non-capital letters from a string. Others are geared towards handling scientific names, like StripBasionymAuthorTransformer
.
Examples
To transform
- “Royal Botanic Gardens, Kew”
- ”royal botanic gardens kew”
- “ROYAL BOTANIC GARDENS KEW”
so they match
A LowerCaseTransformer
followed by a StripNonAlphanumericCharacters
would turn all three strings into “royal botanic gardens kew”:
Original | After LowerCaseTransformer | After StripNonAlphanumericCharacters |
---|---|---|
“Royal Botanic Gardens, Kew” | “royal botanic gardens, kew” | “royal botanic gardens kew” |
”royal botanic gardens kew” | ”royal botanic gardens kew” | ”royal botanic gardens kew” |
“ROYAL BOTANIC GARDENS KEW” | ”royal botanic gardens kew” | ”royal botanic gardens kew” |
If we also want to match “R.B.G. Kew” then we could use StripNonAlphanumericCharacters
, then TitleCaseTransformer
, then CapitalLettersExtractor
to end with “R B G K” in each case — although this will also match against “Rather big grey koala”, so we must be careful!
To transform these scientific names
- “Coffea sapinii”
- “Coffea sapini”
into the same string we might use an A2BTransformer
with search pattern (A) (\\w)\\1
, which means any letter followed by the same letter, and the replace pattern (B) $1
which means the first letter (since it was in brackets). This replaces double (or more) letters with only a single letter: “Cofea sapini”.
To transform
- “(De Wild.) A.P.Davis”
- “A.Davis”
we first use a RemoveBracketedTextTransformer
to remove the often-absent basionym author. We could add an A2BTransformer
to remove the initials, but there are specific transformers geared towards botany: we want a SurnameExtractor
— both author strings should be converted to “Davis”.
Usage
See the API Documentation for the full list of Transformers.