utf.h - acli/trn GitHub Wiki

utf.c and utf.h are a newly written, unit-testable module created to implement logic related to UTF-8 handling. The module supports conversion from a small number of ISO-Latin1-class character sets to UTF-8.

Unit tests for utf.c are in tests/test_utf.c.

Rationale

Trn was originally written on the assumption that all characters are one-byte ASCII. This assumption leads to practices like the copious use of the ++ operator to increment pointers by one byte, clearing up “grey space” (control and 8-bit characters) etc., all of which lead to the corruption of UTF-8. Adding support for UTF-8 thus involves finding places where this assumption is held. Because UTF-8 is variable-width (and because eventually we’ll need to support more character sets), the byte and pointer manipulations needed to correctly process UTF-8 are too involved to nicely fit into macros and therefore new functions need to be written.

Exported types

CODE_POINT

An unsigned long, to represent a Unicode code point

Exported constants

Numeric identifiers for supported character sets

Constant	Meaning
CHARSET_ASCII	The US-ASCII character set.
CHARSET_ISO8859_1	The ISO 8859-1 (Latin1) character set.
CHARSET_ISO8859_15	The ISO 8859-15 (Latin9) character set.
CHARSET_UNKNOWN	An unknown character set.
CHARSET_UTF8	The UTF-8 character set.
CHARSET_WINDOWS_1252	The Windows-1252 character set.

String labels for supported character sets

Constant	Meaning
TAG_ASCII	The US-ASCII character set.
TAG_UTF8	The UTF-8 character set.
TAG_ISO8859_1	The Latin1 character set.
TAG_ISO8859_15	The Latin9 character set.
TAG_WINDOWS_1252	The Windos-1252 character set.

Others

Constant	Meaning
INVALID_CODE_POINT	An invalid code point.

Exported functions

at_norm_char

bool (const char *s)
s: string to check
returns: whether the character at *s should not be replaced by a space

Drop-in replacement for the AT_NORM_CHAR macro in util.h. Checks whether the (potentially non-ASCII) character at *s is a “normal” character (i.e., should not be replaced by a space). Returns 1 if the character at *s is “normal”, 0 if not.

This should be called through either the AT_NORM_CHAR or AT_GREY_SPACE macros in util.h, which have been modified to use the at_norm_char() function.

byte_length_at

int byte_length_at(const char *s)
s: string to check
returns: number of bytes taken up by the character at *s

Determines how many bytes the (potentially non-ASCII) character at *s takes up. Returns an int from 0 to 6.

(0 should only ever be returned if s is NULL; otherwise byte_length_at should return a value from 1 to 6.)

code_point_at

CODE_POINT code_point_at(const char *s)
s: string to check
returns: code point at start of s

Returns the Unicode code point for the character at *s, as an unsigned long. Returns INVALID_CODE_POINT if s is NULL or if *s contains a bit pattern that’s invalid for UTF-8.

input_charset_name

const char *input_charset_name();

Returns a short label representing the currently active input character set. Used in current_charsubst() in charsubst.c to display the current conversion on screen and in decode_header() in cache.c to save system state.

insert_unicode_at

int insert_unicode_at(char *s, CODE_POINT c)
s: buffer to modify
c: Unicode code point to insert
returns: number of bytes written

Inserts one UTF-8 character with code point c at the buffer pointed to by s; returns an int representing the number of bytes written. Caller must ensure there is enough space for the worst-case of 6 bytes plus 1 (6 for the character, 1 for the terminating '\0'). This is used for implementing numerical character references.

output_charset_name

const char *output_charset_name();

Returns a short label representing the currently active output character set. Used in current_charsubst() in charsubst.c to display the current conversion on screen and in decode_header() in cache.c to save system state.

put_char_adv

int put_char_adv(char **sp, bool_int outputok)
sp: pointer to string to output
outputok: whether to actually output anything
returns: number of bytes written (or would have written)

Displays one UTF-8 character at **sp, then increments the pointer at *sp by the number of bytes the character takes up, then returns the number of character cells the character takes up. If outputok is FALSE the character is not actually written to stdout.

This (e.g., w = put_char_adv(&s);) is intended as a not-quite-drop-in replacement for both putchar(*s++) and i = putsubstchar(c, limit, outputok).

Example:

Old code	Unicode-safe equivalent
putchar(*s);	i += put_char_adv(&s, TRUE) - 1; s--;

Note: The existing code often assumes counters and pointers are always incremented by one. Until these assumptions are completely rooted out, it is necessary to resort to kludges like the s-- in the example.

utf_init

int utf_init(const char *f, const char *t);
f: input character set
t: output character set (must be "utf-8")
returns: an int representing the active input character set

Sets the input character set to f, if supported, and output character set to t, if supported. Returns an int representing the currently active input character set, which could correspond to either f (if supported) or the previously active input character set.

visual_length_of

int visual_length_of(const char *s)

Determines how many character cells the (potentially non-ASCII) string at s takes up on screen, as an int. The byte equivalent is just strlen().

visual_width_at

int visual_width_at(const char *s)

Determines how many character cells the (potentially non-ASCII) character at *s takes up on screen. Returns an int from 0 to 2.