Invalid utf - yaml/YAML2 GitHub Wiki
Allow scalars to contain invalid UTF-8 and UTF-16
Current YAML prevents the use of UTF-8 in any byte-oriented interface if there is any possibility that a byte string will be needed that is not a valid UTF-8 encoding. This is always true if the system being controlled does not enforce valid UTF-8, in particular systems that wish to transition from older encodings (so existing data is almost always not going to be valid UTF-8). Of particular concern are filenames on Unix/Linux systems.
There are exactly equivalent problems for invalid UTF-16, at least in the official YAML spec. However due to the knowledge that such sequences can physically exist by Windows programmers, most (all?) implementations of YAML support this in various ways. There needs to be official support for invalid UTF-16 as well, and any design should be compatible with existing implementations.
What is wrong with all proposed "solutions"
All proposed solutions involve storing the data in a different form than other strings. Either all instances are stored encoded into a scalar, or a tag is used. A tag is a complex subject that many users of YAML who just need strings and numbers would otherwise not need to get into. This is also a misleading use of a tag, as the resulting data type is the same. YAML does not require a special tag to put NaN or infinity into a floating point constant, so there already exist examples where "invalid data" is allowed to be quoted without a different data type.
The parser for this tagged data is not under YAML's control. This will prevent standardization. It also makes if far too easy for users to make mistakes, perhaps designing a quoting mechanism that cannot allow all possible strings. Many of the designs by users have the effect of making all non-ASCII characters unreadable, and unreadable text is exactly contrary to YAML's design goals! One popular solution has the unfortunate result of not only making Unicode text unreadable, but also making ISO-8859-1 encoding in the bytes readable, thus encouraging the use of a legacy encoding!
Making UTF-8 into a second-hand citizen, and making it hard for users to use without knowing extremely complex portions of YAML such as "tags", is a result that is exactly counter-productive, in that it strongly discourages use of Unicode! YAML should not be doing this.
Proposed changes to quoted scalars for invalid UTF-8
It is only possible to place invalid UTF-8 in a scalar by using the double-quoted notation. My proposal is to alter the meaning of \xNN where NN is in the range 0x80-0xFF to instead indicate a "raw code unit" value. The meaning of \xNN for NN less than 0x80 is unchanged (but in effect compatible with this description).
Any API to a YAML-reading spec that retrieves a scalar in UTF-8 encoding would get a byte string where these escapes are turned into exactly this byte value, placed between the adjacent UTF-8 encoded characters or other raw byte values.
An API that returns UTF-16 (or UTF-32) would return code units (and thus Unicode characters) exactly equal to the value. This preserves compatibility with the current meaning of \xNN for such an api and is also much easier to describe (there are proposals to turn UTF-8 errors into 0xDC00+N in UTF-16 but I believe this will destroy the current useful ability to encode invalid UTF-16 into UTF-8 strings in a lossless manner).
A YAML-reading library may provide an API that returns the offsets into the returned scalars that all raw code bytes are so that a reader may do some other action with these escapes.
It is possible to arrange several \xNN sequences in a row so that the bytes form a valid UTF-8 encoded character. I think this should, technically, be invalid YAML input. However implementations are unlikely to detect this and are allowed to interpret it as either the same as \uNNNN for the matching character, or preserve the individual \xNN values. These rules allow an implementation to use UTF-8 internally, while not requiring it.
Proposed changes to quoted scalars for invalid UTF-16
It is only possible to place invalid UTF-16 in a scalar by using the double-quoted notation. My proposal is to alter the meaning of \uNNNN where NNNN is in the range 0xD800 through 0xDFFF from being an invalid sequence to being a "raw code unit" value. The meaning of all other \uNNNN values is unchanged.
Any API to a YAML-reading library that retrieves a scalar in UTF-16 encoding would get exactly these word values, placed between the adjacent UTF-16 encoded characters or other raw code units.
Any API to a YAML-reading library that retrieves a scalar as UTF-8 will turn each of these into three bytes, using the obvious UTF-8 encoding of the values.
Any API to a YAML-reading library that retrieves a scalar as UTF-32 will turn each of these into the exactly matching Unicode code point.
A YAML-reading library may provide an API that returns the offsets into the returned scalars that all raw code units are so that a reader may do some other action with these escapes.
It is possible to arrange two \uNNNN sequences in a row so that they form a valid UTF-16 encoded character. An implementation must not interpret this as the matching \U00NNNNNN character. For instance the UTF-8 API must return 6 bytes, not 4, for this pair. This rule is necessary to allow a YAML output routine to be able to write arbitrary UTF-8 without excessive complexity in the quoting rules.
Invalid encodings in YAML files
The YAML spec should state that invalid UTF-8 or UTF-16 in the input file is a parsing error.
However if an implementation decides (perhaps with an option switch) to not produce such errors, it must interpret each invalid UTF-8 byte exactly the same as \xNN notation described above, and invalid UTF-16 words exactly the same as \uNNNN notation described above.
Such an option is useful to read YAML files produced by code that did not correctly quote it, or to make very light-weight YAML parsers.