Practical value - animeshtrivedi/notes GitHub Wiki
You should use the latest release 1.9.x series. 1.8.x series cannot read the metadata
./bin/hadoop jar /home/atr/parquet-tools-1.9.0.jar --help
usage: parquet-tools cat [option...] <input>
where option is one of:
--debug Enable debug output
-h,--help Show this help string
-j,--json Show records in JSON format.
--no-color Disable color output even if supported
where <input> is the parquet file to print to stdout
usage: parquet-tools head [option...] <input>
where option is one of:
--debug Enable debug output
-h,--help Show this help string
-n,--records <arg> The number of records to show (default: 5)
--no-color Disable color output even if supported
where <input> is the parquet file to print to stdout
usage: parquet-tools schema [option...] <input>
where option is one of:
-d,--detailed Show detailed information about the schema.
--debug Enable debug output
-h,--help Show this help string
--no-color Disable color output even if supported
where <input> is the parquet file containing the schema to show
usage: parquet-tools meta [option...] <input>
where option is one of:
--debug Enable debug output
-h,--help Show this help string
--no-color Disable color output even if supported
where <input> is the parquet file to print to stdout
usage: parquet-tools dump [option...] <input>
where option is one of:
-c,--column <arg> Dump only the given column, can be specified more than
once
-d,--disable-data Do not dump column data
--debug Enable debug output
-h,--help Show this help string
-m,--disable-meta Do not dump row group and page metadata
-n,--disable-crop Do not crop the output based on console width
--no-color Disable color output even if supported
where <input> is the parquet file to print to stdout
usage: parquet-tools merge [option...] <input> [<input> ...] <output>
where option is one of:
--debug Enable debug output
-h,--help Show this help string
--no-color Disable color output even if supported
where <input> is the source parquet files/directory to be merged
<output> is the destination parquet file
$./bin/hadoop jar /home/atr/parquet-tools-1.9.0.jar dump --disable-data -n /ex/s/1
17/11/30 15:05:46 0 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
row group 0
--------------------------------------------------------------------------------
intKey: INT32 UNCOMPRESSED DO:0 FPO:4 SZ:436/436/1.00 VC:100 ENC:BIT_PACKED,PLAIN
payload: BINARY UNCOMPRESSED DO:0 FPO:440 SZ:104887/104887/1.00 VC:100 ENC:BIT_PACKED,RLE,PLAIN
intKey TV=100 RL=0 DL=0
----------------------------------------------------------------------------
page 0: DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[min: 17507697, max: 2145850160, num_nulls: 0] SZ:400 VC:100
payload TV=100 RL=0 DL=1
----------------------------------------------------------------------------
page 0: DLE:RLE RLE:BIT_PACKED VLE:PLAIN ST:[no stats for this column] SZ:102807 VC:100
https://db-blog.web.cern.ch/blog/luca-canali/2017-06-diving-spark-and-parquet-workloads-example This one shows how to use the CPP tool to show more information.