Unicode Implementation Details - adesutherland/CREXX GitHub Wiki

Normalisation Algorithms

Read Unicode Data File: Read each line from the Unicode data file and parse the various fields.
- Reference: Unicode Standard, Section 4.1, "Unicode Character Database"
Handle Hangul Syllables: If the line indicates the start of Hangul syllables, algorithmically add all Hangul syllables to the list of code points.
- Reference: Unicode Standard, Section 3.12, "Conjoining Jamo Behavior"
Store Record: For each code point, store its various attributes like name, general category, canonical combining class, and so on.
- Reference: Unicode Standard, Section 4.1, "Unicode Character Database"
Process Decompositions: For each code point, process its canonical and compatibility decompositions recursively.
- Reference: Unicode Standard, Section 3.7, "Decomposition"
Generate Character Classes: Generate character classes for characters with no canonical or compatibility decomposition, grouped by their canonical combining class.
- Reference: Unicode Standard, Section 3.6, "Combining Characters"

Start Main Loop: Loop through the entire input string.
- Reference: Unicode Standard, Section 3.11, "Normalization Forms"

Check for Characters with No Canonical Decomposition: For each code point, check if it has a Canonical Combining Class (CCC) but no canonical decomposition.
- Action: If the CCC is zero, go to the starter label. If the CCC is non-zero, go to the single label.
- Reference: Unicode Standard, Section 3.6, "Combining Characters"

Check for Characters with Canonical Decomposition: For each code point, check if it has a canonical decomposition.
- Action: Depending on the decomposition and CCCs, go to one of the following labels: starter, single, starter_then_non_starter, or complex.
- Reference: Unicode Standard, Section 3.7, "Decomposition"

Handle Complex Cases: For characters that have complex canonical decompositions involving multiple code points with different CCCs.
- Action: Go to the complex label and handle the decomposition accordingly.
- Reference: Unicode Standard, Section 3.11, "Normalization Forms"

Handle Starter Followed by Non-Starter: For sequences where a starter is immediately followed by one or more non-starters.
- Action: Go to the starter_then_non_starter label and handle the sequence accordingly.
- Reference: Unicode Standard, Section 3.11, "Normalization Forms"

End of String: Once the end of the string is reached, finalize the output.
- Reference: Unicode Standard, Section 3.11, "Normalization Forms"

The use of goto labels and pre-computed arrays for code points, lengths, and CCCs is aimed at optimizing performance.
The use of re2c for generating the state machine ensures that the normalization process is highly efficient.