TSV2 Proposal - oils-for-unix/oils GitHub Wiki
Now named QTT: Quoted, Typed Tables
QTT is an interchange format for tables that's a graceful upgrade of TSV. A TSV file is a valid QTT file.
It uses QSN (Quoted String Notation) as a building block.
TSV2 is a textual interchange format for tables, like JSON is for nested data. It's most easily described as a TSV file with some nice properties:
- The header row is required. It may or may not have type annotations.
- Every row is exactly one line (unlike CSV and TSV).
- It can represent any byte or character in a field. This is done with single quoted strings like
'foo\n'and'nul byte: \x00 !'and'\u1234'. Most fields should not be quoted.
In canonical form, a TSV2 file is a valid UTF-8 text file. There may be reasons to use non-canonical forms (e.g. for compatibility with old TSV readers).
NOTE: → are tab characters (written in HTML as →).
name → age:int → checksum:bytes → ratio:float → child:boolean 'Homer Simpson' → 10 → '\xab\xcd' → 0.12 → false Marge Simpson → 20 → '\xef\xff' → 0.34 → false Bart Simpson → 30 → '\xdd\xee' → 0.56789 → true 'Lisa Simpson' → 40 → '\xdd\xee' → 0.56789 → true
The default column type is string.
This example is inconsistent to illustrate the principle that single quotes are optional in some cases but required in others. They're recommended when a field contains spaces, but required when a field contains newlines or tabs.
- Oil's xargs "dialect" (each Keyword in Oil)
- It provides xargs with
rows, rather than faking it with-n 4for a row of 4 items.- xargs can similarly use the names. We could provide some alternative to the
-I {}string.
- xargs can similarly use the names. We could provide some alternative to the
- It provides xargs with
- awk "dialect" (See http://www.oilshell.org/blog/tags.html?tag=awk#awk)
- It provides awk with named columns instead of
$1 $2 $3. - It provides awk with types.
- It provides awk with named columns instead of
- Scripts that scrape coreutils. The output of
psandlscan be massaged into Tables. - Oil's augmentation of the
findcommand (TBD) - Oil's history mechanism (TODO: issue #320)
- ... any other tool that can write text to stdout :)
A line is not a record in CSV, e.g.
"record
with
spaces",10
is a single record. This doesn't work well with line-oriented shell tools.
Also, CSV in practice tends to be ill-specified. (e.g. hardware outputs really broken CSV files.)
- You can't have tabs in TSV fields.
- If you have newlines, you no longer have the property that a record is a line (i.e. the same problem that CSV has).
- TSV has no printable representation for
NULbytes like'\x00'.
Because JSON uses double quotes, and JSON represents unicode strings. TSV2 represents byte strings, which can be utf-8 encoded.
Strictly speaking, no. Hence the "major version number bump". However:
- A TSV2 reader can always read a TSV file correctly. Spaces are allowed in raw TSV2 fields, even though a more canonical representation is:
-
'spaces and single quotes'<TAB>'another field', rather than -
no single quotes<TAB>another field.
-
- If a TSV reader consumes a TSV2 file, then it will "work", except some fields may have leading
'characters. This only happens for "unprintable" characters and tabs, so in the common case there are NO corrections necessary.
Based on JSON primitive types, but less tied to JavaScript:
-
string(default) -
bytes(can represent arbitrary bytes like invalid UTF-8) - "number" is split into
intandfloat, because most languages have that distinction boolean
Nothing as far as the TSV2 format is concerned. string is an indication to the language's library to try to decode utf-8 and escape sequences into a string data structure, if applicable. (e.g. Python, Java, and JavaScript provide random code point access; Go and Rust use UTF-8 internally.)
See the "regular expressions" at json.org. We use true, false, and null.
- The file must have valid header.
- Column names should match
[a-zA-Z_][a-zA-Z0-9_]+. (Should we allow spaces? Language can convert to _ ?)
- Column names should match
- The number of columns in the TSV2 stream is determined the header. If any row has an invalid number of cells, a fatal error must be raised.
- Invalid string syntax, e.g.
- a field starts with
'but doesn't end with one. - invalid escape sequences
- a field starts with
- Invalid syntax in any other column:
- int
- float
- boolean
- The TSV2 file is valid UTF-8.
-
NULbytes are written'\x00'(only allow\xABinbytescolumns?)
-
-
'fields with spaces have single quotes'-- this makes TSV readers less likely to work on TSV2 files, but that's OK. You have the option to emit non-canonical TSV2. - empty strings should be
''!
- Python: if you do repr() and the string does not contain a single quote, Python will use single quotes. And then it will be a valid TSV2 string.
- Oil: TODO
- JavaScript: if it doesn't contain
\xFFcodes, it should work? Well there's also the "bell" character and such.
-
uxy, very similar, and scrapes existing tools. Discussed on lobste.rs.
alignandtrimare useful.
- String and Bytes columns can't contain
null-
intandbooleancolumns can't either
-
- What about
floatcolumns?nullcould be useful forNA. It can be represented by IEEE floats.