303. TPC DS Tests - qyjohn/AWS_Tutorials GitHub Wiki
(1) Generate Test Data
First of all, you need to download the TPC-DS tools. Extract it to a folder (TPC-DS), and compile the tools. After that you can generate some test data:
cd tools
make
./dsdgen
By default the dsdgen utility will generate approximately 1 GB of data. If you need to generate 5 GB of data, you can use a "scale" of 5. Similarly, use a "scale" of 10 to generate 10 GB of data.
./dsdgen -scale 5 -force
(2) Loading Data Into MySQL
Create a database in your MySQL server, say "tpcds", then import the table definitions:
mysql -h hostname -u username -p tpcds < tpcds.sql
Use the following SQL script (bulk_load.sql) to import the data:
LOAD DATA LOCAL INFILE 'call_center.dat' INTO TABLE call_center FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n';
LOAD DATA LOCAL INFILE 'catalog_page.dat' INTO TABLE catalog_page FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n';
LOAD DATA LOCAL INFILE 'catalog_returns.dat' INTO TABLE catalog_returns FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n';
LOAD DATA LOCAL INFILE 'catalog_sales.dat' INTO TABLE catalog_sales FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n';
LOAD DATA LOCAL INFILE 'customer_address.dat' INTO TABLE customer_address FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n';
LOAD DATA LOCAL INFILE 'customer.dat' INTO TABLE customer FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n';
LOAD DATA LOCAL INFILE 'customer_demographics.dat' INTO TABLE customer_demographics FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n';
LOAD DATA LOCAL INFILE 'date_dim.dat' INTO TABLE date_dim FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n';
LOAD DATA LOCAL INFILE 'dbgen_version.dat' INTO TABLE dbgen_version FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n';
LOAD DATA LOCAL INFILE 'household_demographics.dat' INTO TABLE household_demographics FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n';
LOAD DATA LOCAL INFILE 'income_band.dat' INTO TABLE income_band FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n';
LOAD DATA LOCAL INFILE 'inventory.dat' INTO TABLE inventory FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n';
LOAD DATA LOCAL INFILE 'item.dat' INTO TABLE item FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n';
LOAD DATA LOCAL INFILE 'promotion.dat' INTO TABLE promotion FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n';
LOAD DATA LOCAL INFILE 'reason.dat' INTO TABLE reason FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n';
LOAD DATA LOCAL INFILE 'ship_mode.dat' INTO TABLE ship_mode FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n';
LOAD DATA LOCAL INFILE 'store.dat' INTO TABLE store FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n';
LOAD DATA LOCAL INFILE 'store_returns.dat' INTO TABLE store_returns FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n';
LOAD DATA LOCAL INFILE 'store_sales.dat' INTO TABLE store_sales FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n';
LOAD DATA LOCAL INFILE 'time_dim.dat' INTO TABLE time_dim FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n';
LOAD DATA LOCAL INFILE 'warehouse.dat' INTO TABLE warehouse FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n';
LOAD DATA LOCAL INFILE 'web_page.dat' INTO TABLE web_page FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n';
LOAD DATA LOCAL INFILE 'web_returns.dat' INTO TABLE web_returns FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n';
LOAD DATA LOCAL INFILE 'web_sales.dat' INTO TABLE web_sales FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n';
LOAD DATA LOCAL INFILE 'web_site.dat' INTO TABLE web_site FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n';
Use the following command to do the bulk loading:
mysql -h hostname -u username -p tpcds < bulk_load.sql
mysql -h hostname -u username -p tpcds < tpcds_ri.sql
(3) Loading Data Into RedShift
RedShift can do bulk COPY from S3. You can use the following bash script to generate the SQL statements to perform the bulk loading:
#
# Location of the input files (*.dat) on S3
#
s3='s3://my-bucket/folder/'
#
# IAM role for the import
#
iam='arn:aws:iam::xxxxxxxxxxxx:role/role_name'
#
# AWS Region
#
region='ap-southeast-2';
#
# COPY call_center FROM 's3_url' iam_role 'iam_role' region 'ap-southeast-2';
#
for s in `find . -name '*.dat'`;
do
a=${s#*/}
b=${a%.*}
echo COPY $b FROM \'$s3$b\' iam_role \'$iam\' region \'$region\'
done
Save the command output as redshift_bulk_load.sql. Create a database in RedShift, then use the following commands to create the test tables and import the test data.
psql -h redshift_cluster -U username -d tpcds_db -p 5439 < tpcds.sql
psql -h redshift_cluster -U username -d tpcds_db -p 5439 < redshift_bulk_load.sql
psql -h redshift_cluster -U username -d tpcds_db -p 5439 < tpcds_ri.sql
(4) Generate Test Queries
dsqgen -DIRECTORY ../query_templates -INPUT ../query_templates/templates.lst -VERBOSE Y -QUALIFY Y -SCALE 1 -DIALECT netezza -OUTPUT_DIR /tmp