Unicode Implementation Details - adesutherland/CREXX GitHub Wiki

Normalisation Algorithms

Algorithm Description for NFD Normalization with Preprocessing

Preprocessing

  1. Read Unicode Data File: Read each line from the Unicode data file and parse the various fields.

    • Reference: Unicode Standard, Section 4.1, "Unicode Character Database"
  2. Handle Hangul Syllables: If the line indicates the start of Hangul syllables, algorithmically add all Hangul syllables to the list of code points.

    • Reference: Unicode Standard, Section 3.12, "Conjoining Jamo Behavior"
  3. Store Record: For each code point, store its various attributes like name, general category, canonical combining class, and so on.

    • Reference: Unicode Standard, Section 4.1, "Unicode Character Database"
  4. Process Decompositions: For each code point, process its canonical and compatibility decompositions recursively.

    • Reference: Unicode Standard, Section 3.7, "Decomposition"
  5. Generate Character Classes: Generate character classes for characters with no canonical or compatibility decomposition, grouped by their canonical combining class.

    • Reference: Unicode Standard, Section 3.6, "Combining Characters"

Main Loop

  1. Start Main Loop: Loop through the entire input string.

    • Reference: Unicode Standard, Section 3.11, "Normalization Forms"

Case 1: Characters with No Canonical Decomposition

  1. Check for Characters with No Canonical Decomposition: For each code point, check if it has a Canonical Combining Class (CCC) but no canonical decomposition.

    • Action: If the CCC is zero, go to the starter label. If the CCC is non-zero, go to the single label.

    • Reference: Unicode Standard, Section 3.6, "Combining Characters"

Case 2: Characters with Canonical Decomposition

  1. Check for Characters with Canonical Decomposition: For each code point, check if it has a canonical decomposition.

    • Action: Depending on the decomposition and CCCs, go to one of the following labels: starter, single, starter_then_non_starter, or complex.

    • Reference: Unicode Standard, Section 3.7, "Decomposition"

Case 3: Complex Cases

  1. Handle Complex Cases: For characters that have complex canonical decompositions involving multiple code points with different CCCs.

    • Action: Go to the complex label and handle the decomposition accordingly.

    • Reference: Unicode Standard, Section 3.11, "Normalization Forms"

Case 4: Starter Followed by Non-Starter

  1. Handle Starter Followed by Non-Starter: For sequences where a starter is immediately followed by one or more non-starters.

    • Action: Go to the starter_then_non_starter label and handle the sequence accordingly.

    • Reference: Unicode Standard, Section 3.11, "Normalization Forms"

Finalization

  1. End of String: Once the end of the string is reached, finalize the output.

    • Reference: Unicode Standard, Section 3.11, "Normalization Forms"

Performance Optimization

  • The use of goto labels and pre-computed arrays for code points, lengths, and CCCs is aimed at optimizing performance.

  • The use of re2c for generating the state machine ensures that the normalization process is highly efficient.