20090409 decomposing part due - plembo/onemoretech GitHub Wiki
title: Decomposing ... Part Due link: https://onemoretech.wordpress.com/2009/04/09/decomposing-part-due/ author: lembobro description: post_id: 342 created: 2009/04/09 03:24:36 created_gmt: 2009/04/09 03:24:36 comment_status: open post_name: decomposing-part-due status: publish post_type: post
Decomposing ... Part Due
So everything looked like it was coming together. Especially after Text::Unaccent built and tested without error on the Solaris 8 server at work.
Unfortunately I hit a speed bump.
Mr. Roßberg.
You see, that ß character isn’t a “b”. In German it actually sounds more like “s”, as in “glass”. As I was to learn later, the accepted transliteration, at least into English is “ss”. The Unicode Code Chart for Latin-1 describes it as “LATIN LETTER SHARP S”. The actual code is “U+00DF”. Uppercase would simply be “SS”, a double Latin capital “S”.
And it isn’t included in the precompiled mappings for Text::Unaccent.
Fortunately the module didn’t munge the data or spit back an escaped character code sequence, but just passed on the string unmolested.
But I still needed to be able to decompose it. To be honest I was a little bit bummed by the whole thing. Then I discovered the pure perl rework of Text::Unaccent, called…
Thankfully, PurePerl.pm
sets out all the character mappings in nicely formatted code blocks. Scrolling down I found the last place where “LATIN SMALL LETTER S” was mentioned and inserted the following:
`
# 00DF LATIN SMALL LETTER SHARP S
# -> 0073 DOUBLE LATIN SMALL LETTER S
"x{00DF}" => "ss",
`
Running my script again, I was relieved to see:
Donald Roßberg
Donald Rossberg
The only problem I have now is that Text::Unaccent::PurePerl requires perl 5.8 or higher and the Solaris box is still on 5.6.1, which means there’s a perl upgrade in my future.
POSTSCRIPT:
To keep things clean in my environment I’ve located the modified version of this module to Custom::Text::Unaccent::PurePerl
.
Copyright 2004-2019 Phil Lembo