20090504 an iconv nient truth use perl texticonv - plembo/onemoretech GitHub Wiki

title: an iconv-nient truth: use perl Text::Iconv; link: https://onemoretech.wordpress.com/2009/05/04/an-iconv-nient-truth-use-perl-texticonv/ author: lembobro description: post_id: 327 created: 2009/05/04 14:51:06 created_gmt: 2009/05/04 14:51:06 comment_status: open post_name: an-iconv-nient-truth-use-perl-texticonv status: publish post_type: post

an iconv-nient truth: use perl Text::Iconv;

A short time ago I discovered that the system [iconv](http://manpages.ubuntu.com/manpages/intrepid/en/man1/iconv.1.html) utility was much more reliable than the old Unicode::Lite module in perl for translating text from UTF-8 to Latin-1 (well, actually ISO-8859-15). It also allowed me to re-encode entire files in one shot.

Over the weekend I had iconv fail on a particularly big file, forcing me to take another look at the issue.

The error I consistently got was:

“iconv: /u01/data/international/allusers.csv: cannot convert”

The root cause of the failure is still unknown, but I’m pretty sure it has something to do with the improper encoding of a particular attribute value or values in the data. Later examination found that the street and city values for a particular set of users in eastern Europe were badly munged.

After doing a little research I decided to give perl’s [Text::Iconv](http://search.cpan.org/~mpiotr/Text-Iconv-1.7/) module a try, especially since all my systems had now been upgraded to perl 5.10.0, whose superior handling of encoding and decoding international characters (i18N) was making all kinds of things work the way they’re supposed to.

It worked.

My code snippet:

 # Convert data file from UTF-8 to ISO-8859-15
 use strict;
 use Text::Iconv;
 use File::Copy;
	
 my $HOME = $ENV{'HOME'};
 my $datFile = "$HOME/data.csv";
	
 my $converter = Text::Iconv->new("UTF-8","ISO-8859-15");
 open FH, "$datFile.tmp" or die $!;
 while() {
	
 	chomp;
 	my $line = $_;
 	my $converted = $converter->convert($line);
        if ($converted =~ /./) {
 	    print FH1 "$convertedn";
 	}
 }
 close FH1;
 close FH;
 move("$datFile.tmp", "$datFile");

In the above script $datFile is the data file whose contents, one line at a time, will be re-encoded. The $converter object takes two arguments on creation: the existing and new encoding for the data. Each line first has any trailing linefeeds or carriage returns chomped off, and is then passed as string data through the module’s convert method. Once converted, the line is printed to a temp file. After the entire file has been processed the temp file is given the name of the original data file using the move method from the File::Copy module.

There is one major difference in how the system iconv and Text::Iconv handle problems with the data: iconv will throw an error (”cannot convert”), while Text::Iconv will simple print a blank. If you’re re-encoding a whole line, then the perl module will just print a blank for the entire line. Either result creates its own downstream problems, but at least for now I’d prefer to get blank lines for a small number of badly coded records rather than abend the whole process. Note that the above code checks for some, any, content in a converted line. This prevents a failed conversion from printing a blank line in the file.