Unicode - adesutherland/CREXX GitHub Wiki

Normalisation Algorithms

The REXX code serves as a code generator that outputs C code tailored for the re2c lexer generator. Its purpose is to create efficient Unicode processors. The REXX script reads the Unicode data file, processes various Unicode attributes like canonical combining classes, decompositions, and more, and then generates C code that incorporates these attributes into a series of character classes and conditions.

NFD Normalizer

Preprocessing Steps

Read Unicode Data File: Parse each line from the Unicode data file to extract various fields.
- Reference: Unicode Standard, Section 4.1, "Unicode Character Database"
Handle Hangul Syllables: Algorithmically add all Hangul syllables to the list of code points if the line indicates the start of Hangul syllables.
- Reference: Unicode Standard, Section 3.12, "Conjoining Jamo Behavior"
Store Record: Store attributes like name, general category, canonical combining class, etc., for each code point.
- Reference: Unicode Standard, Section 4.1, "Unicode Character Database"
Process Decompositions: Recursively process the canonical and compatibility decompositions for each code point.
- Reference: Unicode Standard, Section 3.7, "Decomposition"
Generate Character Classes: Create character classes for characters grouped by their canonical combining class.
- Reference: Unicode Standard, Section 3.6, "Combining Characters"

Main Loop

Start Main Loop: Iterate through the entire input string.
- Reference: Unicode Standard, Section 3.11, "Normalization Forms"

Handling Different Cases

Characters with No Canonical Decomposition: Check if each code point has a Canonical Combining Class (CCC) but no canonical decomposition.
- Action: Depending on the CCC, go to either the starter or single label.
- Reference: Unicode Standard, Section 3.6, "Combining Characters"
Characters with Canonical Decomposition: Identify code points with a canonical decomposition.
- Action: Navigate to one of the following labels based on decomposition and CCCs: starter, single, starter_then_non_starter, or complex.
- Reference: Unicode Standard, Section 3.7, "Decomposition"
Complex Cases: Handle characters with complex canonical decompositions involving multiple code points with different CCCs.
- Action: Navigate to the complex label.
- Reference: Unicode Standard, Section 3.11, "Normalization Forms"
Starter Followed by Non-Starter: Manage sequences where a starter is followed by one or more non-starters.
- Action: Navigate to the starter_then_non_starter label.
- Reference: Unicode Standard, Section 3.11, "Normalization Forms"

Error Handling and End of Input

Anything Else: Catch any characters that don't fit into the defined categories, print an error message, and return 1.
Invalid Input: Similar to the above but specifically for invalid input. Print an error message and return 1.
End of Input: Handle the end of the input stream by breaking the loop.

Branch Targets for Special Cases

Starter: Handle single codepoint starters and sequences that begin and end with a starter.
Single: Manage single codepoint non-starters by adding them to a queue sorted by CCC.
Starter Then Non-Starter: Handle sequences that start with a starter and are followed by a non-starter.
Complex: Manage complex cases, such as code points before a starter, between starters, or following the last starter.

Final Write-Out

Write Out Any Queued Code Points: After all the looping and branching, write out any remaining queued code points.

Performance Optimization

The use of goto labels and pre-computed arrays optimizes performance.
Utilizing re2c for generating the state machine ensures high efficiency in the normalization process.

By automating the tedious task of writing lexer conditions and actions for Unicode normalization, this REXX script makes it easier to create an efficient and accurate NFD normalizer.

NFKD Normalizer: A Canonical vs. Compatibility Variant of NFD

The NFKD (Normalization Form KD) normalizer is fundamentally similar to the NFD (Normalization Form D) normalizer in its algorithmic framework. Both are designed to decompose Unicode characters into their constituent parts for easier processing. However, the key difference lies in the types of decompositions they handle:

Canonical vs. Compatibility Decompositions

NFD: Focuses solely on Canonical Decompositions, adhering to the Unicode Standard, Section 3.7, "Decomposition."
NFKD: Extends this to include Compatibility Decompositions, in line with the Unicode Standard, Section 3.7, "Decomposition."

This distinction allows NFKD to handle a broader range of Unicode characters, including those that have compatibility mappings, making it more versatile for certain applications.

By understanding this core difference, one can appreciate how NFKD serves as an extension of NFD, built on the same algorithmic foundation but designed to handle additional types of Unicode characters.

NFC Normalizer

Logical Overview

Processing Principles

Initialize State Machine: Utilize re2c to generate a state machine tailored for Unicode normalization tasks, adhering to the principles of optimization and efficiency.
UTF-8 Encoding: Ensure that the input string is encoded in UTF-8, prioritizing this encoding format for all text processing steps.
Compile-time Hardcoding: Use REXX pre-processors to generate C code with hardcoded Unicode properties, thereby eliminating the need for runtime database lookups.

New Definitions not used in the Unicode Standard but required for this algorithm

Inert Starter

An Inert Starter is a Starter character that does not participate in any canonical composition to form a Primary Composite. In other words, it is a Starter character for which there exists no sequence <Inert Starter, C> or <L, Inert Starter> that can be replaced by a Primary Composite in the Canonical Composition Algorithm.

Leftmost Inert Starter

A Leftmost Inert Starter is a Starter character that does not participate in any canonical composition as the first character in a sequence to form a Primary Composite. In other words, it is a Starter character for which there exists no sequence <Leftmost Inert Starter, C> that can be replaced by a Primary Composite in the Canonical Composition Algorithm.

Rightmost Inert Starter

A Rightmost Inert Starter is a Starter character that does not participate in any canonical composition as the second character in a sequence to form a Primary Composite. In essence, it is a Starter character for which there exists no sequence <L, Rightmost Inert Starter> that can be replaced by a Primary Composite in the Canonical Composition Algorithm.

Inert Non-Starter

An Inert Non-Starter is a non-starter character (Canonical Combining Class > 0) that does not participate in any canonical composition as either the preceding or succeeding character in a composition pair. While non-starters are generally found to the right of the starter they combine with in the Canonical Composition Algorithm, an Inert Non-Starter will not combine with any starter or non-starter, regardless of its position relative to them.

Input Analysis and Decomposition

Read and Analyze Input: Read the UTF-8 encoded input string and use the re2c-generated state machine to determine if it can be short-circuited or requires full decomposition.
- Short-Circuit Cases: Identify sequences that are already in their composed form or that can be transformed without full decomposition, and output them as-is or with minimal changes.
- Full Decomposition: For other cases, proceed to decompose the string into its constituent code points. If multiple combining characters are present, sort them by their Canonical Combining Class (CCC) as per Section 3.7 of the Unicode Standard.

Composition and Output Generation

Initialize Composition Buffer: Create a dynamic buffer to hold code points as they are processed, aiming for possible composition.
Dynamic Composition and Buffer Management:
- Add decomposed code points to the buffer.
- Continuously check for composition opportunities based on predefined rules, which are hardcoded via REXX pre-processors.
- If a composition is possible, replace the relevant code points in the buffer with the composed code point.
- Upon encountering a starter code point, flush the buffer to the output string, as it marks the end of a possible composition sequence.
Generate and Validate Output: Construct the output string from the buffer, ensuring it is in a normalized form. Implement error handling mechanisms based on Section 3.11 of the Unicode Standard to ensure compliance and robustness.

By adhering to this logical flow, the algorithm aims to provide an efficient, optimized, and Unicode Standard-compliant method for text normalization.

Composition Buffer Management: Starter-to-Starter and Inert Character Segmentation

Role of Starters and Inert Characters:
- In Unicode normalization, a "starter" is a character with a Canonical Combining Class (CCC) of zero. It serves as the base character upon which subsequent combining characters (non-starters) are layered.
- Inert characters, both starters and non-starters, do not participate in any canonical composition and can be processed directly without further checks.
Buffer Segmentation:
- The composition buffer is segmented into blocks, each starting with either a starter or an inert character and ending just before the next starter or inert character. This block of characters is the unit of composition or decomposition.
Real-time Composition:
- As each new code point is read and added to the buffer, the algorithm checks for possible compositions within the current block. If a composition is possible, it is executed immediately, and the buffer is updated. For inert characters, no such check is necessary.
Buffer Flushing:
- When a new starter or inert character is encountered, it signifies the end of the current block and the beginning of a new one. At this point, the algorithm flushes the composed or decomposed characters from the current block to the output string, making room for the next block in the buffer.
Efficiency and Compliance:
- This approach allows the algorithm to operate efficiently without backtracking, while still adhering to the principles laid out in Sections 3.7 and 3.11 of the Unicode Standard. The inclusion of inert characters in the segmentation further optimizes the algorithm.

By understanding and implementing this starter-to-starter and inert character segmentation in the composition buffer, the algorithm optimizes its operations, making it both efficient and compliant with the Unicode Standard.

Optimization Strategies in Unicode Normalization

Between Inert Starters:
- Pre-calculate entire sequences that start and end with an inert starter. This ensures that the sequence between these inert starters can be directly replaced by its precomputed normalized form without further checks.
- Reference: Unicode Standard Section 3.11.
Inert Starter Followed by Non-Starters:
- For sequences beginning with an inert starter followed by non-starters, calculate the immediate composition or decomposition up to the point where a new starter or non-starter is introduced. This ensures that any subsequent characters will not affect the already normalized sequence.
- Reference: Canonical Ordering in Section 3.11 of the Unicode Standard.
Inert Characters (Starters and Non-Starters):
- Directly process inert characters without any need for further checks or transformations. These characters do not participate in any canonical composition. For inert non-starters, it's essential to note that while they typically appear to the right of the starters they combine with, the normalization process should be prepared for any valid Unicode sequence.
- Reference: Unicode Standard Annex #15, Unicode Normalization Forms, Section 1.1.

By implementing these optimization strategies, the normalization algorithm can significantly improve its efficiency while ensuring adherence to the Unicode Standard.

Logical Outline of the End-to-End NFC Algorithm with Short-Circuits

Preprocessing Steps

Canonical Decomposition Mapping: Pre-calculate the canonical decomposition for each Unicode character that has one. Store these in a lookup table for quick access.
- Standard Reference: Unicode Standard Annex #15, Unicode Normalization Forms, Section 1.2.
Canonical Composition Mapping: Pre-calculate the canonical composition for each pair of Unicode characters that can be composed into a single character. Store these in another lookup table.
- Standard Reference: Unicode Standard Annex #15, Unicode Normalization Forms, Section 1.1.
Non-Combinable Codepoints: Identify codepoints that never combine with others and store them in a separate lookup table.
- Standard Reference: Unicode Standard Annex #15, Unicode Normalization Forms, Section 1.1.

Algorithm Steps

Initialization:
- Initialize a composition buffer to hold sequences from a starter up to the next starter.
- Initialize output buffer and set output length to zero.
Input Loop:
- Loop through each Unicode character in the input string.
Non-Combinable Check (goto non_combinable):
- If the character is in the non-combinable lookup table, directly add it to the output buffer without further checks.
- Standard Reference: Section 3.11.
Immediate Composition Check (goto starter_to_starter):
- If the character starts and ends a composition sequence (both are starters), replace it with the composed character.
- Standard Reference: Section 3.11.
Starter Followed by Non-Starter Check (goto starter_then_non_starter):
- If a starter is followed by one or more non-starters, calculate the immediate composition or decomposition.
- Standard Reference: Section 3.11.
Individual Starter Mapping (goto individual_starter):
- If an individual starter maps directly to another starter, replace it immediately.
- Standard Reference: Section 3.7.
Composition Buffer (goto composition_buffer):
- Add the character to the composition buffer. If it's a starter, attempt to compose the sequence in the buffer, then flush it to the output.
Output (goto output):
- Write the composed or decomposed characters to the output buffer.
End of Input (goto end_of_input):
- Once the end of the input is reached, flush any remaining characters in the composition buffer to the output.
Return Output:
- Return the output buffer and its length.

By incorporating these short-circuits, the NFC algorithm can normalize Unicode strings more efficiently while still adhering to the Unicode Standard.