20090120 converting file encoding with iconv - plembo/onemoretech GitHub Wiki

title: Converting file encoding with iconv link: https://onemoretech.wordpress.com/2009/01/20/converting-file-encoding-with-iconv/ author: lembobro description: post_id: 394 created: 2009/01/20 22:53:30 created_gmt: 2009/01/20 22:53:30 comment_status: open post_name: converting-file-encoding-with-iconv status: publish post_type: post

Converting file encoding with iconv

For several years I’ve tried various methods of processing data that needs to be imported into an LDAP directory. One of the biggest headaches has been dealing with input files from around the globe that have different character encodings.

In the past my approach has been to read the file in and then try converting the character set for each individual element (for example, reading each value on every line of a comma separate values file), element by element.

I’ve had mixed success with this due to the way perl handles such things internally and issues with some of the older modules I’ve used for the task. It doesn’t help that my “production” platform is Solaris 8 running perl 5.6.1 (not native, from sunfreeware).

While revisiting the problem again today I stumbled on the iconv utility, almost by accident. As any good Unix admin should know:

The iconv program converts the encoding of characters in inputfile, or from the standard input if no filename is specified, from one coded character set to another.

That is from the iconv man page on Ubuntu Intrepid.

I am not a good Unix admin.

Of course once I’d read the man page I knew my problem was solved. With iconv I could simply convert the entire input file from whatever “foreign” encoding it was in to something more perl and LDAP friendly, like UTF-8.

The syntax is iconv -f input_encoding -t output_encoding [inputfile].

So now I all have to do is add a line to my scripts to shell out and run a command like:

iconv -f ISO-8859-1 -t UTF-8 myansifile.csv >myutf8file.csv

For safety’s sake, I’m going to use a system() wrapper for this, so the perl would be:

system("/usr/bin/iconv -f ISO-8859-1 -t UTF-8 $infile >$outfile");

To list all the encodings iconv knows about (important to know if you are going to specify the correct label on the command line), you need to run it with the “–list” or “-l” switch (Solaris 8 only recognizes “-l”). Here’s what I got on Ubuntu Intrepid (heavily excerpted):

me@mybox:~$ iconv –list
The following list contain all the coded character sets known. This does not necessarily mean that all combinations of these names can be used for the FROM and TO command line parameters. One coded character set can be listed with several different names (aliases).
…
ISO-8859-1, ISO-8859-2, ISO-8859-3, ISO-8859-4, ISO-8859-5, ISO-8859-6, ISO-8859-7, ISO-8859-8, ISO-8859-9, ISO-8859-9E, ISO-8859-10, ISO-8859-11,
…
L7, L8, L10, LATIN-9, LATIN-GREEK-1, LATIN-GREEK, LATIN1, LATIN2, LATIN3, LATIN4, LATIN5, LATIN6, LATIN7, LATIN8, LATIN10, LATINGREEK, LATINGREEK1,
…
UCS-4, UCS-4BE, UCS-4LE, UCS2, UCS4, UHC, UJIS, UK, UNICODE, UNICODEBIG, UNICODELITTLE, US-ASCII, US, UTF-7, UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, UTF-32LE, UTF7, UTF8, UTF16, UTF16BE, UTF16LE, UTF32, UTF32BE, UTF32LE, VISCII, WCHAR_T, WIN-SAMI-2, WINBALTRIM, WINDOWS-31J, WINDOWS-874, WINDOWS-936, WINDOWS-1250, WINDOWS-1251, WINDOWS-1252, WINDOWS-1253, WINDOWS-1254, WINDOWS-1255, WINDOWS-1256,
…

Well, you get the idea. All the usual suspects are covered. Which just about makes my day.