20090409 decomposing part due - plembo/onemoretech GitHub Wiki

title: Decomposing ... Part Due link: https://onemoretech.wordpress.com/2009/04/09/decomposing-part-due/ author: lembobro description: post_id: 342 created: 2009/04/09 03:24:36 created_gmt: 2009/04/09 03:24:36 comment_status: open post_name: decomposing-part-due status: publish post_type: post

Decomposing ... Part Due

So everything looked like it was coming together. Especially after Text::Unaccent built and tested without error on the Solaris 8 server at work.

Unfortunately I hit a speed bump.

Mr. Roßberg.

You see, that ß character isn’t a “b”. In German it actually sounds more like “s”, as in “glass”. As I was to learn later, the accepted transliteration, at least into English is “ss”. The Unicode Code Chart for Latin-1 describes it as “LATIN LETTER SHARP S”. The actual code is “U+00DF”. Uppercase would simply be “SS”, a double Latin capital “S”.

And it isn’t included in the precompiled mappings for Text::Unaccent.

Fortunately the module didn’t munge the data or spit back an escaped character code sequence, but just passed on the string unmolested.

But I still needed to be able to decompose it. To be honest I was a little bit bummed by the whole thing. Then I discovered the pure perl rework of Text::Unaccent, called…

Text::Unaccent::PurePerl

Thankfully, PurePerl.pm sets out all the character mappings in nicely formatted code blocks. Scrolling down I found the last place where “LATIN SMALL LETTER S” was mentioned and inserted the following:

`

# 00DF LATIN SMALL LETTER SHARP S
# ->   0073 DOUBLE LATIN SMALL LETTER S
"x{00DF}" => "ss",

`

Running my script again, I was relieved to see:

Donald Roßberg
Donald Rossberg

The only problem I have now is that Text::Unaccent::PurePerl requires perl 5.8 or higher and the Solaris box is still on 5.6.1, which means there’s a perl upgrade in my future.

POSTSCRIPT:

To keep things clean in my environment I’ve located the modified version of this module to Custom::Text::Unaccent::PurePerl.

Copyright 2004-2019 Phil Lembo