20150421 dealing with special characters in names - plembo/onemoretech GitHub Wiki

title: Dealing with special characters in names link: https://onemoretech.wordpress.com/2015/04/21/dealing-with-special-characters-in-names/ author: phil2nc description: post_id: 9596 created: 2015/04/21 16:34:17 created_gmt: 2015/04/21 20:34:17 comment_status: closed post_name: dealing-with-special-characters-in-names status: publish post_type: post

Dealing with special characters in names

Most applications nowadays handle Unicode in text seamlessly. If only that were true. A partial solution from one directory manager below. For around 12 years I managed an LDAP directory service for a big multinational that held tens of thousands of active entries, and twice that in inactive ones. In the early 2000's when I started we had around 200 names, mostly in Northern Europe, that had special characters. Because we were a Sun shop at the time we were running Sun's Directory Server, which fortunately stored such data encoded in UTF-8 (even though the operating system defaulted to Latin-1). Now many years later that number of "special" names is just under 2,000 active users. Too many to ignore, and enough to warrant greater attention. My early solution was to create custom attributes to store the decomposed (ASCII-ized) value of any givenname or sn that had special characters in it, and then add them to the user's entry as additional attributes. Applications could then add these attributes to their search filters to find those entries using ASCII characters. This was done using a maintenance routine written in perl that ran every day as a batch job. It turns out that this solution still works well today, but with huge improvements in perl's UTF-8 handling, it was time to refactor my routines for creating and managing them. My original script used Text::Unaccent::PurePerl, and was applied to all name strings indiscriminately. By using Test::utf8 in the new script I was able to make the script more efficient by only treating names with non-ASCII characters. One factor I didn't consider all those years ago were local differences in how certain characters are decomposed. For example, throughout most of Europe a ü is simply translated into "u". But in Germany (DE), Austria (AT) and Switzerland (CH) the preference of most German speakers is to translate to "ue". Although someone might have written a completely new perl module to replace Text::Unaccent::PurePerl that could take a country or language code as a parameter, I decided to take the easy way out and write a wrapper routine around it. For name strings associated with people in DE, AT or CH I would pass the value through a "for" loop that would do the substitution prior to going through Text::Unaccent::PurePerl's more comprehensive unac_string method. I moved all unaccent processing to its own subroutine to modularize the process, and added paged searching as an additional efficiency measure. Following find my complete script. [code language="perl"] #!/usr/bin/perl # decomp2usr.pl Script to decompose Unicode characters in user names and # then push decomposed form of name back into LDAP. Creates 2 new attribs # for every user, corpdecompgn and corpdecompsn. # Originally created 1/14/02 by P Lembo, Refactored 4/21/15. use strict; use Net::LDAP; use Net::LDAP::Entry; use Net::LDAP::LDIF; use Net::LDAP::Control::Paged; use Net::LDAP::Constant qw( LDAP_CONTROL_PAGED ); use Test::utf8; use Text::Unaccent::PurePerl; use File::Copy; use String::Util qw(trim); my $HOME = $ENV{'HOME'}; my $BIN = "/usr/local/bin"; our ($dirUsr,$dirHost, $dirPass, $dirPort); require "$HOME/etc/ldapapp.conf"; my $time = localtime(); my $errFile = "$HOME/data/logs/decomp2usr.log"; my $changefile = "$HOME/data/import/decompchg.ldif"; my $changefileName = "decompchg.ldif"; open LOGZ,">$errFile" or die $!; print LOGZ "$time\tStart UserName Decomp Process\n"; decompnames(); update_ldap(); clean_up(); $time = localtime (); print LOGZ "$time\tCompleted UserName Decomp Process\n"; close LOGZ; sub decompnames { open FH, ">$changefile" or die $!; my @attrs = qw(sn givenname corpdecompgn corpdecompsn c); my $basedn = "ou=People,dc=corp,dc=com"; my $query = "(&(objectclass=inetorgperson)(givenname=)(sn=))"; my $ldap = Net::LDAP->new($dirHost, port =>$dirPort); my $mesg = $ldap->start_tls(verify=>'none', sslversion =>'tlsv1'); $mesg = $ldap->bind($dirUsr, password=> $dirPass) or die $!; my $page = Net::LDAP::Control::Paged->new( size => 1000) or die $!; my @args = ( base => $basedn, scope => 'sub', filter => $query, attr =>@attrs, control => [ $page ], ); my $cookie; while (1) { $mesg = $ldap->search ( @args ) or die $!; while (my $entry = $mesg->shift_entry()) { my $dn = $entry->dn; my $uid = $entry->get_value('uid'); my $sn = $entry->get_value('sn'); my $givenname = $entry->get_value('givenname'); my $corpdecompsn = $entry->get_value('corpdecompsn'); my $corpdecompgn = $entry->get_value('corpdecompgn'); my $c = $entry->get_value('c') $sn = trim($sn); $givenname = trim($givenname); my $sntest = is_within_ascii($sn); my $gntest = is_within_ascii($givenname); if(($sntest eq '1')&&($gntest eq '1')) { next; } else { # If both first and last are non-ASCII if(($sntest eq '0')&&($gntest eq '0')) { my $dLname = cust_unac($c, $sn); my $dFname = cust_unac($c, $givenname); # Does result match both existing? If not, make change. if(($dLname !~ /$corpdecompgn/i)&&($dFname !~ /$corpdecompsn/i)) { if(($dLname =~ /.+/)&&($dFname =~ /.+/)) { print FH "dn: $dn\n"; print FH "changetype: modify\n"; print FH "replace: corpdecompsn\n"; print FH "corpdecompsn: $dLname\n"; print FH "-\n"; print FH "replace: corpdecompgn\n"; print FH "corpdecompgn: $dFname\n"; print FH "\n"; print LOGZ $uid . " full name " . $givenname . " " . $sn . " decomped name is " . $dFname . " " . $dLname, "\n"; } } } # If last name is non-ASCII but first name is ASCII elsif(($sntest eq '0')&&($gntest eq '1')) { my $dLname = cust_unac($c, $sn); # Does result match existing? If not, make change. if($dLname !~ /$corpdecompsn/i) { if($dLname =~ /.+/) { print FH "dn: $dn\n"; print FH "changetype: modify\n"; print FH "replace: corpdecompsn\n"; print FH "corpdecompsn: $dLname\n"; print FH "\n"; print LOGZ $uid . " last name " . $sn . " decompsn is " . $dLname, "\n"; } } } # If last name is ASCII but first name is non-ASCII elsif(($sntest eq '1')&&($gntest eq '0')) { my $dFname = cust_unac($c, $givenname); # Does result match existing? If not, make change if($dFname !~ /$corpdecompgn/i) { if($dFname =~ /.+/) { print FH "dn: $dn\n"; print FH "changetype: modify\n"; print FH "replace: corpdecompgn\n"; print FH "corpdecompgn: $dFname\n"; print FH "\n"; print LOGZ $uid . " first name " . $givenname . " decompgn is " . $dFname, "\n"; } } } } # if some non-ASCII } # while search $mesg->code and last; my ( $resp ) = $mesg->control ( LDAP_CONTROL_PAGED ) or last; $cookie = $resp->cookie or last; $page->cookie( $cookie ); } # while paging if ($cookie) { $page->cookie($cookie); $page->size(0);