Unicode Strictness - yaml/YAML2 GitHub Wiki

This problem is being discussed on the mailing list in the YAML2 thread

[William Spitzak](William Spitzak) proposed:

PLEASE!!! This is the main reason we cannot use unaltered YAML:

SUPPORT FOR INVALID UTF-8 AND UTF-16

There needs to be a way to put an arbitrary byte sequence into a scalar without losing the ability to make valid byte sequences human-readable.

Currently YAML is limited to only putting byte sequences that are valid UTF-8 into scalars, unless some transformation is done that makes some (often all) Unicode unreadable in the YAML input. This has the counter-productive effect of discouraging use of Unicode on any backend that uses bytes where there is no guarantee that the backend limits the byte sequences to valid UTF-8. Examples are all byte-based file formats, most internet protocols, Unix filenames, and Windows resource identifiers.

My recommendation is here. However any solution that allows an arbitrary byte stream to be produced, while allowing valid UTF-8 bytes to be represented by the correct Unicode character in the YAML source, would be acceptable:

  1. The backslash escape of \xNN represents a "raw UTF-8 byte" with the given value. This is only different from current YAML for 0x80-0xFF. The sequence \u00NN must be used for actual Unicode code points in this range.
  2. An api that requests YAML scalars as UTF-8 gets these bytes, inserted between the UTF-8 encoding of all other characters, as raw data.
  3. An api that requests scalers in some other form, such as UTF-16, gets these bytes as unchanged code units. This makes \xNN work identically to current YAML/JSON when the UTF-16 api is used. It may also allow invalid forms of other encodings to be supported.

In addition invalid UTF-16 must also be supported. Support of invalid UTF-16 is more common, due to it's use on Windows and therefore the realization by otherwise ignorant programmers of the inability to work without supporting them. Technically the YAML spec does not allow invalid UTF-16, but my proposal here formalizes the actual support that is in most (all?) YAML and JSON implementations:

  1. The backslash escape of \uNNNN for NNNN in the range 0xD800..0xDFFF represents a "raw UTF-16 code unit".
  2. An api that requests UTF-16 or other 16-bit code units will get these codes unchanged.
  3. An api that requests bytes will get 3 for each of these, these three bytes match the encoding you get from UTF-8 if you extend it to these invalid code points.

Devin Jeanpierre commented:

Mixing byte data into what is ostensibly unicode seems like a bad idea. Either have a particular part of the document be unicode, or bytes, not both.

That said, if you mix unicode and bytes in the same file, it ceases to be exactly readable in a standard text editor. So I don't like that either. Maybe two separate types of YAML file (text / binary/compressed)? more than one protocol has had a "binary" version made of it.

[Bill Spitzak] reply:

The proposal does not use any bytes other than the '', 'x', 'u', and 0-9 and A-F and a-f. These are all ASCII characters and the YAML files are still valid encodings.


Some minor questions by Dennis Lissov

  • Is the suggestion to use the standard scalar forms, making these to produce either a string or a byte sequence depending on the presence of invalid Unicode?
  • Does it play well with encoding changes? As a user I'd assume being able to transcode YAML documents from utf16 into utf8 and the other way round with iconv-like utility without any changes in the actual data. Is this assumption still valid?
  • And generally, may arbitrary byte sequences be safely stored in these and retreived exactly as-is with no utf8/utf16 transcoding issues?

daxim comments: [UTF-8b](UTF-8b is http://www.nntp.perl.org/group/perl.perl5.porters/;[email protected]) is for binary in UTF-8.