Parquet - ghdrako/doc_snipets GitHub Wiki

Parquet is an efficient, binary file format for table data.

One of the main advantages of Parquet is the format has an explicit schema that is embedded within the file - and that schema includes type information.

The representation of types is also standardised. Parquet has both a date type and the datetime type (both sensibly recorded as integers in UTC).

Parquet provides a single way to represent missing data - the null type.

Parquet is partly row oriented and partly column oriented. The data going into a Parquet file is broken up into "row chunks" - largeish sets of rows. Inside a row chunk each column is stored separately in a "column chunk" - this best facilitates all the tricks to make the data smaller. Compression works better when similar data is adjacent. Run-length encoding is possible. So is delta encoding.

At the end of the file is the index, which contains references to all the other row chunks, column chunks, etc. because the index is at the end of the file you can't stream it. Instead, with Parquet, you tend to split your data across multiple files (there is explicit support for this in the format) and then use the indexes to skip around to find the data you want. But again - that requires random access - no streaming.

You can add .parquet to any csvbase table url to get a Parquet file, so that's an easy way to try the format out:

import pandas as pd
df = pd.read_parquet("https://csvbase.com/meripaterson/stock-exchanges.parquet")
pip install -U parquet-tools
curl -O "https://csvbase.com/meripaterson/stock-exchanges.parquet"
parquet-tools inspect --detail stock-exchanges.parquet

That shows a lot of detail and in conjunction with the spec can help you understand exactly how the format is arranged.