character_sets - dwilson2547/wiki_demo GitHub Wiki
Character sets (also called charsets) are fundamental to how computers represent and interpret text. Here’s a detailed breakdown:
- 1. What is a Character Set?
- 2. Key Concepts
- 3. Common Character Sets
- 4. Why Unicode Dominates
- 5. Practical Implications
- 6. Example: Encoding "Hello"
- 7. Common Issues
A character set is a defined collection of characters (letters, numbers, symbols, etc.) that a computer system recognizes and can display. Each character is assigned a unique numerical value, allowing computers to store, transmit, and render text consistently.
- Each character in a set is assigned a code point, a unique numerical identifier (e.g., the code point for the uppercase letter "A" is 65 in the ASCII set).
- Code points are often represented in hexadecimal (e.g.,
U+0041
for "A").
- Encoding is the process of converting characters into binary (or other formats) for storage or transmission.
- Decoding is the reverse process, converting binary back into characters.
- Size: 128 characters (7-bit).
- Scope: Basic Latin letters (A-Z, a-z), digits (0-9), punctuation, and control characters (e.g., newline, tab).
- Limitation: Only supports English and lacks accents or special symbols for other languages.
- Size: 256 characters (8-bit).
- Scope: Adds symbols for European languages (e.g., é, ü, ñ).
- Limitation: Still limited; multiple extended ASCII variants exist, causing compatibility issues.
- Size: Over 144,000 characters (as of 2025) and growing.
- Scope: Supports almost all writing systems (Latin, Cyrillic, Arabic, Chinese, emojis, etc.).
-
Encoding Schemes:
- UTF-8: Variable-width (1–4 bytes per character). Backward-compatible with ASCII. Most widely used on the web.
- UTF-16: Variable-width (2 or 4 bytes per character). Used in Windows and Java.
- UTF-32: Fixed-width (4 bytes per character). Rarely used due to inefficiency.
- Size: 256 characters (8-bit).
- Scope: A family of 16 sets (e.g., ISO 8859-1 for Western European languages, ISO 8859-5 for Cyrillic).
- Limitation: Limited to specific language groups.
- Universality: Supports all major scripts and symbols.
- Flexibility: UTF-8 is space-efficient for ASCII and scalable for global text.
- Standardization: Avoids conflicts between regional encodings.
-
Compatibility: Mismatched encodings can cause "mojibake" (garbled text, e.g.,
é
instead ofé
). -
Web Development: HTML5 defaults to UTF-8. Always declare encoding (e.g.,
<meta charset="UTF-8">
). -
Programming: Strings in code are often UTF-8/UTF-16. Libraries handle encoding/decoding (e.g., Python’s
encode()
/decode()
).
Character | ASCII (Decimal) | UTF-8 (Hex) |
---|---|---|
H | 72 | 0x48 |
e | 101 | 0x65 |
l | 108 | 0x6C |
l | 108 | 0x6C |
o | 111 | 0x6F |
- Mojibake: Occurs when text is decoded using the wrong encoding.
- Legacy Systems: Older systems may only support ASCII or regional encodings.
- File Transfers: Ensure consistent encoding between sender/receiver.
Character sets define how text is represented digitally. Unicode (UTF-8) is the modern standard, ensuring global compatibility, while older sets like ASCII remain relevant for specific use cases.
Would you like a deeper dive into a specific aspect, like UTF-8’s variable-width design or how emojis are encoded?