Parquet Sink - rambabu-chamakuri/PSTL-DOC GitHub Wiki

The Parquet Sink is a specific subset of the File Sink. As such, file sink options may also be configured on a Parquet Sink. As always, the path option of the File Sink must be configured. Please review the File Sink if you are not already familiar with it.

Options

compression

The compression codec to use when generating Parquet files. If not specified directly on the Parquet Sink, spark.sql.parquet.compression.codec will be used. The underlying spark system setting defaults to snappy. Valid values include: none, uncompressed, snappy, gzip, lzo. Values of none and uncompressed are synonymous.

Defaults to spark.sql.parquet.compression.codec.

SAVE STREAM foo
TO PARQUET
OPTIONS(
  'compression'='gzip'
);

spark.hadoop.parquet.*

Allows specifying internal settings for the underlying ParquetOutputFormat. Typically, users will only modify these settings for use cases requiring fine grained tuning. ParquetOutputFormat exposes a compression setting, but users should prefer the compression option exposed directly by the Parquet Sink.

If you are unfamiliar with these settings, you can use the Parquet Output Format class as a reference.

-- spark.properties: spark.hadoop.parquet.memory.pool.ratio=0.1
SAVE STREAM foo
TO PARQUET
OPTIONS();

spark.sql.parquet.output.committer.class

The output committer class used by Parquet. The specified class needs to be a subclass of org.apache.hadoop.mapreduce.OutputCommitter. Typically, it's also a subclass of org.apache.parquet.hadoop.ParquetOutputCommitter. If it is not, then metadata summaries will never be created, irrespective of the value of parquet.enable.summary-metadata.

Defaults to org.apache.parquet.hadoop.ParquetOutputCommitter

-- spark.properties: spark.sql.parquet.output.committer.class=com.acme.parquet.CustomParquetOutputCommitter
SAVE STREAM foo
TO PARQUET
OPTIONS();

spark.sql.parquet.binaryAsString

Some Parquet producing systems, in particular Impala and older versions of Spark, do not differentiate between binary data and strings when writing out the Parquet schema. This flag tells Spark to interpret binary data as a string to provide compatibility with these systems.

Defaults to false.

-- spark.properties: spark.sql.parquet.binaryAsString=true
SAVE STREAM foo
TO PARQUET
OPTIONS();

spark.sql.parquet.int96AsTimestamp

Some Parquet producing systems, in particular Impala, store timestamp into int96. Spark would also store timestamp as int96 because we need to avoid losing precision within the nanoseconds field. This flag tells Spark to interpret int96 data as a timestamp to provide compatibility with these systems.

Defaults to true.

-- spark.properties: spark.sql.parquet.int96AsTimestamp=false
SAVE STREAM foo
TO PARQUET
OPTIONS();

spark.sql.parquet.writeLegacyFormat

Whether to follow Parquet's format specification when converting Parquet schema(s) to Spark schema(s) and vice versa.

Defaults to false.

-- spark.properties: spark.sql.parquet.writeLegacyFormat=true
SAVE STREAM foo
TO PARQUET
OPTIONS();

spark.sql.parquet.int64AsTimestampMillis

When true, timestamp values will be stored as int64 with TIMESTAMP_MILLIS as the extended type. In this mode, the microsecond portion of the timestamp value will be truncated.

Defaults to false.

-- spark.properties: spark.sql.parquet.int64AsTimestampMillis=true
SAVE STREAM foo
TO PARQUET
OPTIONS();