Parquet Source - rambabu-chamakuri/PSTL-DOC GitHub Wiki

The Parquet Source is a specific subset of the File Source. As such, File Source options may also be configured on a Parquet Source. As always, the path option of the File Source must be configured. Please review the File Source if you are not already familiar with it.

Options

spark.hadoop.parquet.*

Allows specifying internal settings for the underlying ParquetInputFormat. Typically, users will only modify these settings for use cases requiring fine grained tuning. If you are unfamiliar with these settings, you can use the ParquetInputFormat class as a reference.

-- spark.properties: spark.hadoop.parquet.strict.typing=false
CREATE STREAM foo
FROM PARQUET
OPTIONS();

spark.sql.parquet.binaryAsString

Some Parquet producing systems, in particular Impala and older versions of Spark, do not differentiate between binary data and strings when writing out the Parquet schema. This flag tells Spark to interpret binary data as a string to provide compatibility with these systems.

Defaults to false.

-- spark.properties: spark.sql.parquet.binaryAsString=true
CREATE STREAM foo
FROM PARQUET
OPTIONS();

spark.sql.parquet.enableVectorizedReader

When true, enables vectorized Parquet decoding.

Defaults to true.

WARNING: In the current state of affairs with Spark and Parquet integration, vectorized Parquet decoding is partially supported. If all of the columns in the underlying schema are atomic types (e.g., primitives), then vectorized reads will be enabled. If any column in the underlying schema is a non-atomic type (e.g., complex types), then vectorized reads will be disabled regardless of whether or not this option is enabled.

-- spark.properties: spark.sql.parquet.enableVectorizedReader=false
CREATE STREAM foo
FROM PARQUET
OPTIONS();

spark.sql.parquet.filterPushdown

When true, enables Parquet filter (e.g., predicate) push-down optimization.

Defaults to true.

-- spark.properties: spark.sql.parquet.filterPushdown=false
CREATE STREAM foo
FROM PARQUET
OPTIONS();

spark.sql.parquet.int96AsTimestamp

Some Parquet producing systems, in particular Impala, store timestamp into int96. Spark would also store timestamp as int96 because we need to avoid losing precision within the nanoseconds field. This flag tells Spark to interpret int96 data as a timestamp to provide compatibility with these systems.

Defaults to true.

-- spark.properties: spark.sql.paquet.int96AsTimestamp=false
CREATE STREAM foo
FROM PARQUET
OPTIONS();

spark.sql.parquet.int64AsTimestampMillis

When true, timestamp values will be stored as int64 with TIMESTAMP_MILLIS as the extended type. In this mode, the microsecond portion of the timestamp value will be truncated.

Defaults to false.

-- spark.properties: spark.sql.parquet.int64AsTimestampMillis=true
CREATE STREAM foo
FROM PARQUET
OPTIONS();