Unicode Implementation Details - adesutherland/CREXX GitHub Wiki
Normalisation Algorithms
Algorithm Description for NFD Normalization with Preprocessing
Preprocessing
-
Read Unicode Data File: Read each line from the Unicode data file and parse the various fields.
- Reference: Unicode Standard, Section 4.1, "Unicode Character Database"
-
Handle Hangul Syllables: If the line indicates the start of Hangul syllables, algorithmically add all Hangul syllables to the list of code points.
- Reference: Unicode Standard, Section 3.12, "Conjoining Jamo Behavior"
-
Store Record: For each code point, store its various attributes like name, general category, canonical combining class, and so on.
- Reference: Unicode Standard, Section 4.1, "Unicode Character Database"
-
Process Decompositions: For each code point, process its canonical and compatibility decompositions recursively.
- Reference: Unicode Standard, Section 3.7, "Decomposition"
-
Generate Character Classes: Generate character classes for characters with no canonical or compatibility decomposition, grouped by their canonical combining class.
- Reference: Unicode Standard, Section 3.6, "Combining Characters"
Main Loop
-
Start Main Loop: Loop through the entire input string.
- Reference: Unicode Standard, Section 3.11, "Normalization Forms"
Case 1: Characters with No Canonical Decomposition
-
Check for Characters with No Canonical Decomposition: For each code point, check if it has a Canonical Combining Class (CCC) but no canonical decomposition.
-
Action: If the CCC is zero, go to the
starter
label. If the CCC is non-zero, go to thesingle
label. -
Reference: Unicode Standard, Section 3.6, "Combining Characters"
-
Case 2: Characters with Canonical Decomposition
-
Check for Characters with Canonical Decomposition: For each code point, check if it has a canonical decomposition.
-
Action: Depending on the decomposition and CCCs, go to one of the following labels:
starter
,single
,starter_then_non_starter
, orcomplex
. -
Reference: Unicode Standard, Section 3.7, "Decomposition"
-
Case 3: Complex Cases
-
Handle Complex Cases: For characters that have complex canonical decompositions involving multiple code points with different CCCs.
-
Action: Go to the
complex
label and handle the decomposition accordingly. -
Reference: Unicode Standard, Section 3.11, "Normalization Forms"
-
Case 4: Starter Followed by Non-Starter
-
Handle Starter Followed by Non-Starter: For sequences where a starter is immediately followed by one or more non-starters.
-
Action: Go to the
starter_then_non_starter
label and handle the sequence accordingly. -
Reference: Unicode Standard, Section 3.11, "Normalization Forms"
-
Finalization
-
End of String: Once the end of the string is reached, finalize the output.
- Reference: Unicode Standard, Section 3.11, "Normalization Forms"
Performance Optimization
-
The use of
goto
labels and pre-computed arrays for code points, lengths, and CCCs is aimed at optimizing performance. -
The use of re2c for generating the state machine ensures that the normalization process is highly efficient.