Practical value - animeshtrivedi/notes GitHub Wiki

How to use the parquet tool

You should use the latest release 1.9.x series. 1.8.x series cannot read the metadata

./bin/hadoop jar /home/atr/parquet-tools-1.9.0.jar --help 
usage: parquet-tools cat [option...] <input>
where option is one of:
       --debug     Enable debug output
    -h,--help      Show this help string
    -j,--json      Show records in JSON format.
       --no-color  Disable color output even if supported
where <input> is the parquet file to print to stdout

usage: parquet-tools head [option...] <input>
where option is one of:
       --debug          Enable debug output
    -h,--help           Show this help string
    -n,--records <arg>  The number of records to show (default: 5)
       --no-color       Disable color output even if supported
where <input> is the parquet file to print to stdout

usage: parquet-tools schema [option...] <input>
where option is one of:
    -d,--detailed  Show detailed information about the schema.
       --debug     Enable debug output
    -h,--help      Show this help string
       --no-color  Disable color output even if supported
where <input> is the parquet file containing the schema to show

usage: parquet-tools meta [option...] <input>
where option is one of:
       --debug     Enable debug output
    -h,--help      Show this help string
       --no-color  Disable color output even if supported
where <input> is the parquet file to print to stdout

usage: parquet-tools dump [option...] <input>
where option is one of:
    -c,--column <arg>  Dump only the given column, can be specified more than
                       once
    -d,--disable-data  Do not dump column data
       --debug         Enable debug output
    -h,--help          Show this help string
    -m,--disable-meta  Do not dump row group and page metadata
    -n,--disable-crop  Do not crop the output based on console width
       --no-color      Disable color output even if supported
where <input> is the parquet file to print to stdout

usage: parquet-tools merge [option...] <input> [<input> ...] <output>
where option is one of:
       --debug     Enable debug output
    -h,--help      Show this help string
       --no-color  Disable color output even if supported
where <input> is the source parquet files/directory to be merged
   <output> is the destination parquet file

Show (min, max) metadata

$./bin/hadoop jar /home/atr/parquet-tools-1.9.0.jar dump --disable-data -n  /ex/s/1
17/11/30 15:05:46 0 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
row group 0 
--------------------------------------------------------------------------------
intKey:   INT32 UNCOMPRESSED DO:0 FPO:4 SZ:436/436/1.00 VC:100 ENC:BIT_PACKED,PLAIN
payload:  BINARY UNCOMPRESSED DO:0 FPO:440 SZ:104887/104887/1.00 VC:100 ENC:BIT_PACKED,RLE,PLAIN

    intKey TV=100 RL=0 DL=0
    ----------------------------------------------------------------------------
    page 0:  DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[min: 17507697, max: 2145850160, num_nulls: 0] SZ:400 VC:100

    payload TV=100 RL=0 DL=1
    ----------------------------------------------------------------------------
    page 0:  DLE:RLE RLE:BIT_PACKED VLE:PLAIN ST:[no stats for this column] SZ:102807 VC:100

https://db-blog.web.cern.ch/blog/luca-canali/2017-06-diving-spark-and-parquet-workloads-example This one shows how to use the CPP tool to show more information.

⚠️ **GitHub.com Fallback** ⚠️