Hive SerDe - ayushmathur94/Spark GitHub Wiki
Serialization and deserialization (SerDe) are important concepts in Hive that allow Hive to interact with different data formats.
Serialization is the process of converting a complex data structure or object into a format that can be stored or transmitted. In Hive, this process is used to convert the data in a table into a format that can be stored on disk or transmitted over a network.
Deserialization is the opposite process, where the serialized data is converted back into its original form. In Hive, this process is used to read the data stored on disk or received over a network and convert it into a format that can be used by Hive.
Hive uses SerDe (Serializer and Deserializer) classes to handle different data formats. These classes are used to convert the data between the internal format used by Hive and the data's external format. Hive ships with a number of built-in SerDe classes that can be used to handle common data formats like CSV, JSON, and Avro. Developers can also write their own custom SerDe classes to handle other data formats.
For example, when a Hive table is created with a TEXTFILE data format, Hive uses the TEXTFILE SerDe by default. This SerDe can read and write data in the standard text format used by Hive. Similarly, when a table is created with a JSON data format, Hive uses the JSON SerDe to read and write data in JSON format.
When you create a table in Hive, you can specify the SerDe to be used for the table. For example, when creating a table that will store JSON data, you can specify the JSON SerDe to be used to read and write the data.
In summary, the role of serializer and deserializer in Hive is to convert the data between the internal format used by Hive and the data's external format, so that Hive can understand the data and interact with it properly.
Here is an example of creating a Hive table that stores JSON data and specifying the JSON SerDe to be used to read and write the data: CREATE TABLE json_table (json_data string) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' STORED AS TEXTFILE; In this example, we are creating a table called "json_table" with a single column called "json_data" of data type string. The ROW FORMAT SERDE clause is used to specify the SerDe to be used for the table. In this case, we are specifying the 'org.openx.data.jsonserde.JsonSerDe' SerDe, which is a built-in SerDe for reading and writing JSON data in Hive. The STORED AS TEXTFILE clause is used to specify the storage format for the table. In this case, we are specifying that the data should be stored in a standard text format, which is the default format for Hive. Once this table is created, you can load data in json format into it and query it using HiveQL. Please note that the Serde package 'org.openx.data.jsonserde.JsonSerDe' is not part of the default hive distribution, you would need to download and add it to your classpath, in case it's not available in your hive distribution.
Here is an example of a Hive query that uses the ROW FORMAT SERDE clause to specify a custom SerDe for
reading and writing data in a table:
Copy code
CREATE TABLE custom_table (id int, name string, address string)
ROW FORMAT SERDE 'com.example.hive.CustomSerDe'
WITH SERDEPROPERTIES ('field.delimiter' = '|')
STORED AS TEXTFILE;
In this example, we are creating a table called "custom_table" with three columns: "id", "name", and "address".
The ROW FORMAT SERDE clause is used to specify the custom SerDe 'com.example.hive.CustomSerDe' to be
used for the table. This SerDe is used to read and write data in the table using a custom format, this
is a custom serde class you need to have in your classpath.
The WITH SERDEPROPERTIES clause is used to set properties for the SerDe. In this case, we are specifying
that the field delimiter for the data should be the '|' character.
The STORED AS TEXTFILE clause is used to specify the storage format for the table. In this case,
we are specifying that the data should be stored in a standard text format, which is the default format for Hive.
Once this table is created, you can load data into it and query it using HiveQL. This table will be read and
written by the custom SerDe specified in the ROW FORMAT SERDE clause, which will use the custom format and
field delimiter specified in the WITH SERDEPROPERTIES clause.
Here is an example of a Hive query that uses the ROW FORMAT SERDE clause to specify a built-in SerDe
provided by Hive for reading and writing data in a table:
CREATE TABLE csv_table (id int, name string, address string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES ('separatorChar' = ',')
STORED AS TEXTFILE;
In this example, we are creating a table called "csv_table" with three columns: "id", "name", and "address".
The ROW FORMAT SERDE clause is used to specify the built-in SerDe 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
to be used for the table. This SerDe is provided by Hive and is used to read and write data in CSV format.
The WITH SERDEPROPERTIES clause is used to set properties for the SerDe. In this case, we are specifying that
the field separator for the CSV data should be the ',' character.
The STORED AS TEXTFILE clause is used to specify the storage format for the table. In this case, we are
specifying that the data should be stored in a standard text format, which is the default format for Hive.
Once this table is created, you can load CSV data into it and query it using HiveQL. This table will be
read and written by the built-in SerDe specified in the ROW FORMAT SERDE clause, which will use the CSV format
and field separator specified in the WITH SERDEPROPERTIES clause.
Please note that this Serde package 'org.apache.hadoop.hive.serde2.OpenCSVSerde' is part of the Hive
distribution but it's not available in all of the distribution.