Manipulating Hive Tables - ignacio-alorre/Hive GitHub Wiki
1- Creating a Hive table on Avro files
CREATE EXTERNAL TABLE IF NOT EXISTS exampleTb
STORED AS AVRO
LOCATION '/data/raw/example/full/'
TBLPROPERTIES ('avro.schema.url'='hdfs:////apps/conf/avro/raw/example/example_schema.avsc');
2- Creating a Hive table on Parquet Files
First we need to create an avro table, given the required schema as avsc
. Then we can create a parquet table using the avro one as a reference.
CREATE External TABLE IF NOT EXISTS exampleTb
STORED AS AVRO
LOCATION '/data/raw/example/full/'
TBLPROPERTIES ('avro.schema.url'='hdfs:////apps/conf/avro/raw/example/example_schema.avsc');
CREATE EXTERNAL TABLE IF NOT EXISTS exampleTbParquet
LIKE exampleTb
STORED AS PARQUET
LOCATION '/data/raw/example/full/parquetVersion/'
3- Creating a Hive table wrapping full & incremental
The scenario is as follows: there was an initial load of data with all the historical files, which were placed under a subdirectory named full
. Then everyday new data is coming to the table in an incremental way. This data is stored under incremental
, and it is placed in an independent directory based on the ingestion date, for example dd-MM-yyyy
. For example:
/data/raw/example/full
/data/raw/example/incremental/01-01-2020
/data/raw/example/incremental/02-01-2020
/data/raw/example/incremental/03-01-2020
...
We need to set the following flags, so the table look for the avro files nested under the root directory given when creating the hive table /data/raw/
set hive.mapred.supports.subdirectories=TRUE;
set mapred.input.dir.recursive=TRUE;
CREATE EXTERNAL TABLE IF NOT EXISTS exampleTb
STORED AS AVRO
LOCATION '/data/raw/example/'
TBLPROPERTIES ('avro.schema.url'='hdfs:////apps/conf/avro/raw/example/example_schema.avsc');
4- Creating a Hive table which contains a new column with respect the historical version
The scenario is as follows: You are getting for several months avro files with 10 columns. But since Today, one new column is added to the avro files (that is 11 from now on). You need to create a table which is able to point to the historical files with 10 columns and the new files with 11.
It is required then to give a default value to that column in the avro schema
{
"name" : "EXPERIMENT",
"type" : [ "null", "string" ],
"default" : null
}
Otherwise when querying the table it will trigger the following exception
Bad status for request TFetchResultsReq... : TFetchResultsResp(status=TStatus(errorCode=0, errorMessage='java.io.IOException: org.apache.avro.AvroTypeException: Found topLevelRecord, expecting exampleTb, missing required field EXPERIMENT
Note In the opposite scenario, where we need to drop a column from the original dataset. For example a Hive table with 9 columns instead of 10. Here they share an approach, which I am not sure if it is the best one: https://stackoverflow.com/questions/25032097/hive-load-specific-columns