EMR 013 Query DynamoDB Export Backup Data in Hive Presto Athena - qyjohn/AWS_Tutorials GitHub Wiki
AWS customers can use Data Pipeline to export the data in DynamoDB tables to S3. It is desired to have the capability to query the exported data directly from Hive / Presto / Athena. However, the exported data has the DynamoDB data type along with actual data.
For example, you have the following attributes in the DynamoDB table:
year, Number
last_updated_at, Number
body_class, String
created_at, Number
make_model_code, String
countries, String Set
fuel_type, String
attributes_confirmed, String
After you export the table to S3 using Data Pipeline, the exported data would look like the following. Each attribute is represented by a JSON object with both the data type and the actual data.
{"year":{"n":"2013"},"last_updated_at":{"n":"1564607918006"},"body_class":{"s":"Sedan"},"created_at":{"n":"1564607918006"},"make_model_code":{"s":"XXX"},"countries":{"sS":["CA","US"]},"fuel_type":{"s":"gas"},"attributes_confirmed":{"s":"could_not_confirm"}}
{"year":{"n":"2014"},"last_updated_at":{"n":"1564607918007"},"body_class":{"s":"Carola"},"created_at":{"n":"1564607918007"},"make_model_code":{"s":"XX"},"countries":{"sS":["AU","CN"]},"fuel_type":{"s":"gas"},"attributes_confirmed":{"s":"confirmed"}}
If you use Glue to crawl the data, you will get the following table schema. Each column now becomes a struct, and the DynamoDB data type becomes an element in the struct.
  year struct<n:bigint>, 
  last_updated_at struct<n:bigint>, 
  body_class struct<s:string>, 
  created_at struct<n:bigint>, 
  make_model_code struct<s:string>, 
  countries struct<ss:array<string>>, 
  fuel_type struct<s:string>, 
  attributes_confirmed struct<s:string>
If you want to query by the value of year in DynamoDB, you will need to query by year.n in Hive. For example:
SELECT COUNT(*) FROM t WHERE year.n = 2013;
It is desired to be able to query the table by year instead of year.n. There exists no SerDe that can achieve this directly. However, you can achieve the same effect using a view. For example:
CREATE EXTERNAL TABLE ddb_export_table(
  year struct<n:bigint>, 
  last_updated_at struct<n:bigint>, 
  body_class struct<s:string>, 
  created_at struct<n:bigint>, 
  make_model_code struct<s:string>, 
  countries struct<ss:array<string>>, 
  fuel_type struct<s:string>, 
  attributes_confirmed struct<s:string>)
ROW FORMAT SERDE  'org.openx.data.jsonserde.JsonSerDe' 
LOCATION 's3://your-bucket-name/dynamodb-export-data/';
CREATE VIEW ddb_view AS SELECT
  year.n AS year,
  last_updated_at.n AS last_updated,
  body_class.s AS body_class,
  created_at.n AS created_at,
  make_model_code.s AS make_model_code,
  countries.ss AS countries,
  fuel_type.s AS fuel_type,
  attributes_confirmed.s AS attributes_confirmed
FROM ddb_export_table;
SELECT * FROM ddb_view WHERE year=2013;
SELECT COUNT(*) FROM ddb_view WHERE year=2013;