Charsets - skilchen/bots GitHub Wiki

Characters sets, encoding and unicode

The good news: in general Bots handles character sets well.
Incoming files are read using the character-set specified in the 'incoming' syntax, outgoing files are written using the character-set as specified in the 'outgoing' syntax. You do not need to do anything in the mapping, Bots does the character-set conversion automatically.

Notes:

Lots of information about this topic can be found on wikipedia.
Information about the different character-sets in python is at python site.
Outgoing character sets can be set in syntax; this can be done on envelope level, messagetype level of per partner.
Sometimes character-set conversion is not possible, eg uft-8->ascii, as uft8 can contain characters not in utf8.
For edifact there is an option to do character-set mapping (eg é->e); this is also possible for x12 (bit more work).
Note that x12 can be tricky: the separators used can not be in the data. There is no good workaround for this, best way is to change the separators used.

Specifics of different character-sets

Most used in edi is ascii, iso-8859-1, uft-8 (xml!).
Most familiar is ascii. Note that ascii has only 128 characters! One character is one byte.
The extended asccii character-sets.
- One character is always one byte.
- The upper 128 (above ascii) are used as special characters (eg éëè).
- These different character-sets are about displaying (on screen, print etc). If the texts is in iso-8859-1, but displayed as eg IBM850 is looks 'wrong'. Note that the content is still the same, it's only the display that is different.
- In general these extended ascii character-sets will not be problematic in translations, as segment ID's/tags are in ascii and one character is one byte. In Bots the content of the message will just be fetched and passed to the outgoing message. So if a iso-8859-3 is handled as iso-8859-1 that will generally not be problematic.
- Examples: iso-8859-1, windows-1252, iso-8859-2, iso-8859-3, IBM850, UNOC, latin1.
Unicode.
- Examples: utf-8, utf-16, utf-32, UCS-2
- Unicode is designed at accommodate much more characters, think of eg Greek, Japanese, Chinese and Klingon characters.
- One character is not one byte. (the different Unicode characters-sets use different representation schemes.
- Much used is utf-8. Advantage of utf-8 is that the first 128 ascii characters are one byte; this is way uft-8 is 'upward compatible' with ascii: a file with only ascii characters is the same in ascii and uft-8.
- In Microsoft environments often utf-16 and utf-32 is used.
- Fixed files can not have utf-8 character-set, as one character might be more than one byte.
EBCDIC :related to (extended) ascii: one character is one byte. There are some variations of EBCDIC (eg extended EBCDIC).

A nasty situation is eg when one partner sends Unicode (eg utf-8), and another sends extended ascii (eg iso-8859-1). These extended characters work quite differently: Bots has to know what character-set is used before reading them, as these character-sets are treated quite differently. Best way to solve this is to have receive these files via different route(parts). Bots does no 'guessing of character-sets', as this is not appropriate for business data.

x12 incoming

Character set as in envelope is used (if not set here default value of grammar.py is used (ascii)).

Some typical issues and solutions:

if partner sends x12 as eg iso-8859-1, just specify this in the syntax of the envelope (usersys/grammars/x12/x12.py). Note that this is a system-wide setting, this will be used for all incoming 12. Should not be a problem, iso-8859-1 is a superset of ascii. The character-set of outgoing files should also be set to handle the extended characters. Also check your ERP software: what can it handle?
same solution works for utf-8

edifact incoming

Edifact has its own character-sets: UNOA, UNOB, UNOC, etc.
In default bots setup:

UNOA and UNOB have own character-set mapping in usersys/charsets.
other edifact character-sets are aliased in config/bots.ini, section charsets.
default UNOA and UNOB are not 100% strict but allow some extra characters like .
there are some variations of default UNOA/UNOB in sourceforge downloads.

Some typical issues and solutions:

edifact files have 'UNOA' in them, but in fact they send more (UNOC). Solution: use 'unoa_like_unoc.py' from downloads, save in usersys/charsets and rename to unoa.py
partner sends UNOC character-set, but my system can only handle ascii. Solution: use 'unoa_like_unoc.py' from downloads; if you open this file you can see a character mapping (note that incoming is a different character mapping as outgoing). You can map eg incoming é->e (etcetc)
our system uses iso-8859-1, but partner can handle only UNOB. Solution: use 'unoa_like_unoc.py' from downloads; if you open this file you can see a character mapping (note that incoming is a different character mapping as outgoing). You can map eg outgoing é->e (etc etc)