Shift JIS - trigger-segfault/TriggersTools.CatSystem2 GitHub Wiki

Shift JIS (Codepage 932)

🚧 This page is a work in progress

CatSystem1 and CatSystem2, like many Japanese visual novel engines, use the standard Japanese ANSI codepage of Shift JIS on Windows. This is the cause of many headaches with fan-translations, as users may have to emulate the Japanese Locale when running these games. Although CatSystem2 is able to be run using System Locales on the surface, many internals are still dependent on the Shift JIS text encoding.

Terminology

  • Code page, encoding - Represents how characters are stored in or read from data
  • ANSI - The default system code page on Windows
  • System Locale - The current localization on Windows, changes the ANSI codepage   (different from System Language)
  • Code point - The set of bytes needed to represent a single character
  • CJK - Chinese, Japanese, and Korean characters
  • Fullwidth, halfwidth - Set of alternative characters in CJK fonts to match the width of other fixed characters

Important notes

  • Shift JIS characters (code points) are always 1 or 2 bytes in length.
  • ASCII Backslashes \ (0x5C) are displayed as a (half-width) Yen sign ¥ in Japanese fonts.
  • The Yen sign ¥ serves the same purpose as the Backslash \.   (i.e. escapes, path separator)
  • All ASCII+Control characters (0x00-0x7D) with exception to backslash share the same code points.
  • ASCII (single-byte) code points can be present as the second byte in multi-byte code points.
  • The byte 0x00 can always safely be assumed as a null terminator.
  • Full-width spaces   (0x3000) are commonly found when one may expect a regular space . These are used in CatSystem2 scripts (even English translations) to avoid whitespace-triggered actions and syntaxes.   (i.e. character names in scripts, choice text)
  • Shift JIS has absolutely no support for accents and diacritical marks on characters. This is the bane of many localization teams for more-complex alphabets (like Italian, where specifying an accent can make a critical difference). Some teams may go the extra step, and create a custom font that visually replaces unused Japanese characters with the desired characters.

Points of failure

  • When converting to-and-from Shift JIS, you should always replace Backslashes \ with (half-width) Yen signs ¥ and vice-versa. This is usually done by the converter, but it is a point of failure to be aware of.
  • Shift JIS encoded text is often found mangled in file formats, especially music metadata, and older ZIP archives. In these cases, the resulting text is often contains invalid Shift JIS characters. It is sometimes possible to reverse this by looking directly at the byte values.
  • Make sure your text editor is reading and writing your filed in the correct encoding. Your editor will probably no recognize the encoding unless explicitly told to. Changes saved in this state will mangle the text and make reversing the mistake a difficult process. This is easy to miss when editing XML and script files that have no visible CJK characters in view.

See also

External links

Wikipedia

Other

Helpful snippets

Some helpful snippets taken from other sources and articles

Yen sign > Code Points - Source

In JIS X 0201, of which Shift JIS is an extension, the yen sign has the same byte value (0x5C) as the backslash in ASCII. This standard was widely adopted.

Japanese-language locales of Microsoft operating systems use the code page 932 character encoding, which is a variant of Shift JIS. Hence, 0x5C is displayed as a yen sign in Japanese-locale fonts on Windows.[1] It is nonetheless used wherever a backslash is used, such as the directory separator character (for example, in C:¥) and as the general escape character (¥n).[1] It is mapped onto the Unicode U+005C REVERSE SOLIDUS (i.e. backslash),[2] while Unicode U+00A5 YEN SIGN is given a one-way "best fit" mapping to 0x5C in code page 932,[1] and 0x5C is displayed as a backslash in Microsoft's documentation for code page 932,[3] essentially making it a backslash given the appearance of a yen sign by localized fonts.

Understanding Character Sets > Nonshifting DBCSs - Source

Nonshifting DBCSs use ranges of codepoints, specified by the character set definition, to determine whether a particular byte represents one character or is part of a two-byte character.

In nonshifting DBCSs, the two bytes that are used to form a character are called lead bytes and trail bytes. The lead byte is the first in a two-byte character, and the trail byte is the last. Nonshifting DBCSs differentiate single-byte characters from double-byte characters by the numerical value of the lead byte. For example, in the Japanese Shift-JIS encoding, if a byte is in the range 0x81-0x9F or 0xE0-0xFC, then it is a lead byte and must be paired with the following byte to form a complete character.

The most popular client-side Japanese code page, Shift-JIS, uses this lead byte/trail byte encoding scheme, as do most Microsoft Windows and Unix/Linux ASCII-based double-byte character sets that represent Chinese, Japanese, and Korean characters. Contrary to its name, Shift-JIS is a nonshifting double-byte character set.

⚠️ **GitHub.com Fallback** ⚠️