character_sets - dwilson2547/wiki

Character sets (also called charsets) are fundamental to how computers represent and interpret text. Here’s a detailed breakdown:

1. What is a Character Set?
2. Key Concepts
- a. Code Points
- b. Encoding
3. Common Character Sets
4. Why Unicode Dominates
5. Practical Implications
6. Example: Encoding "Hello"
7. Common Issues
- Key Takeaway

1. What is a Character Set?

A character set is a defined collection of characters (letters, numbers, symbols, etc.) that a computer system recognizes and can display. Each character is assigned a unique numerical value, allowing computers to store, transmit, and render text consistently.

2. Key Concepts

a. Code Points

Each character in a set is assigned a code point, a unique numerical identifier (e.g., the code point for the uppercase letter "A" is 65 in the ASCII set).
Code points are often represented in hexadecimal (e.g., U+0041 for "A").

b. Encoding

Encoding is the process of converting characters into binary (or other formats) for storage or transmission.
Decoding is the reverse process, converting binary back into characters.

3. Common Character Sets

a. ASCII (American Standard Code for Information Interchange)

Size: 128 characters (7-bit).
Scope: Basic Latin letters (A-Z, a-z), digits (0-9), punctuation, and control characters (e.g., newline, tab).
Limitation: Only supports English and lacks accents or special symbols for other languages.

b. Extended ASCII

Size: 256 characters (8-bit).
Scope: Adds symbols for European languages (e.g., é, ü, ñ).
Limitation: Still limited; multiple extended ASCII variants exist, causing compatibility issues.

c. Unicode

Size: Over 144,000 characters (as of 2025) and growing.
Scope: Supports almost all writing systems (Latin, Cyrillic, Arabic, Chinese, emojis, etc.).
Encoding Schemes:
- UTF-8: Variable-width (1–4 bytes per character). Backward-compatible with ASCII. Most widely used on the web.
- UTF-16: Variable-width (2 or 4 bytes per character). Used in Windows and Java.
- UTF-32: Fixed-width (4 bytes per character). Rarely used due to inefficiency.

d. ISO/IEC 8859

Size: 256 characters (8-bit).
Scope: A family of 16 sets (e.g., ISO 8859-1 for Western European languages, ISO 8859-5 for Cyrillic).
Limitation: Limited to specific language groups.

4. Why Unicode Dominates

Universality: Supports all major scripts and symbols.
Flexibility: UTF-8 is space-efficient for ASCII and scalable for global text.
Standardization: Avoids conflicts between regional encodings.

5. Practical Implications

Web Development: HTML5 defaults to UTF-8. Always declare encoding (e.g., <meta charset="UTF-8">).
Programming: Strings in code are often UTF-8/UTF-16. Libraries handle encoding/decoding (e.g., Python’s encode()/decode()).

6. Example: Encoding "Hello"

Character	ASCII (Decimal)	UTF-8 (Hex)
H	72	0x48
e	101	0x65
l	108	0x6C
l	108	0x6C
o	111	0x6F

7. Common Issues

Mojibake: Occurs when text is decoded using the wrong encoding.
Legacy Systems: Older systems may only support ASCII or regional encodings.
File Transfers: Ensure consistent encoding between sender/receiver.

Key Takeaway

Character sets define how text is represented digitally. Unicode (UTF-8) is the modern standard, ensuring global compatibility, while older sets like ASCII remain relevant for specific use cases.

Would you like a deeper dive into a specific aspect, like UTF-8’s variable-width design or how emojis are encoded?

character_sets - dwilson2547/wiki_demo GitHub Wiki

1. What is a Character Set?

2. Key Concepts

a. Code Points

b. Encoding

3. Common Character Sets

a. ASCII (American Standard Code for Information Interchange)

b. Extended ASCII

c. Unicode

d. ISO/IEC 8859

4. Why Unicode Dominates

5. Practical Implications

6. Example: Encoding "Hello"

7. Common Issues

Key Takeaway

⚠️ GitHub.com Fallback ⚠️

character_sets - dwilson2547/wiki_demo GitHub Wiki

1. What is a Character Set?

2. Key Concepts

a. Code Points

b. Encoding

3. Common Character Sets

a. ASCII (American Standard Code for Information Interchange)

b. Extended ASCII

c. Unicode

d. ISO/IEC 8859

4. Why Unicode Dominates

5. Practical Implications

6. Example: Encoding "Hello"

7. Common Issues

Key Takeaway

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️