ingestion deployment runbook - ja-guzzle/guzzle_docs GitHub Wiki
- Do the following on all nodes of hadoop cluster (One-time setup on Hadoop Cluster / VM)
Create file at /usr/hdp/current/spark-client/conf/hive-site.xml with following contents -
<configuration>
<property>
<name>hive.metastore.uris</name>
<value>thrift://sandbox.hortonworks.com:9083</value>
</property>
</configuration>
-
Create two folders on a shared network drive which is mounted on all nodes of the cluster. The paths of these two folders could be anything. The same paths should be referred to in the spark-submit command. (One-time setup on Hadoop Cluster / VM)
/guzzle/conf - for guzzle config directory which is accessible from all nodes. under this folder, all environments, logical & physical endpoints and job configuration files are placed
/guzzle/libs - for shared library dependencies which are accessible from all nodes -
set environment variable: GUZZLE_HOME=/guzzle
-
Copy the archive of external dependency jar files required to the shared folder created for shared library and extract it
./gradlew archiveDependencies
scp -r -P 2222 common/dependencies/guzzle-libs.zip [email protected]:/guzzle/libs/
- download mysql JDBC driver in /guzzle/libs/ directory:
wget http://central.maven.org/maven2/mysql/mysql-connector-java/5.1.46/mysql-connector-java-5.1.46.jar
-
Create $GUZZLE_HOME/bin directory
-
Build the guzzle framework jar files and copy it to the hortonworks cluster edge node from where spark job will be submitted.
./gradlew :common:build :orchestration:build :ingestion:build :processing:build :recon:build :dq:build :housekeeping:build
scp -P 2222 common/build/libs/common.jar [email protected]:/guzzle/bin/
scp -P 2222 orchestration/build/libs/orchestration.jar [email protected]:/guzzle/bin/
scp -P 2222 ingestion/build/libs/ingestion.jar [email protected]:/guzzle/bin/
scp -P 2222 processing/build/libs/processing.jar [email protected]:/guzzle/bin/
scp -P 2222 recon/build/libs/recon.jar [email protected]:/guzzle/bin/
scp -P 2222 dq/build/libs/dq.jar [email protected]:/guzzle/bin/
scp -P 2222 housekeeping/build/libs/housekeeping.jar [email protected]:/guzzle/bin/
- Go to hbase advanced configuration in ambari interface and add following property in Custom hbase-site section (login as admin).
phoenix.schema.isNamespaceMappingEnabled=true
-
Save and restart hbase services
-
create hbase schema guzzle using phoenix sqlline (phoenix-sqlline localhost:2181:/hbase-unsecure).
CREATE SCHEMA IF NOT EXISTS guzzle;
- Copy the guzzle config folder to the shared network folder created for guzzle config
scp -r -P 2222 samples/test-config/* [email protected]:/guzzle/conf/
- generate tables related to batches and job audits in guzzle schema using following:
java -cp /usr/hdp/current/phoenix-client/phoenix-client.jar:/guzzle/libs/*:./common.jar com.justanalytics.guzzle.common.DatabaseInitializer
- Create the target hive table for the ingestion job. If the target hive table is in a separate hive schema, create the schema first.
create database test;
create table test.users ( id int, first_name string, last_name string, age decimal(2,0), created_time timestamp) partitioned by (instance_id bigint, system string, location string);
-
Upload the delimited file and control file to hdfs:
upload sample data files from samples/test-data to hdfs folder /test-data (this is the path configured in jobs) -
Login to the hortonworks VM as maria_dev user. And run the following command
spark-submit --num-executors 2 --driver-memory 512m --executor-memory 512m --conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=file:/guzzle/conf/guzzle-log4j.properties" --files /usr/hdp/current/spark-client/conf/hive-site.xml --conf "spark.executor.extraClassPath=/usr/hdp/current/phoenix-client/phoenix-client.jar:/guzzle/bin/*:/guzzle/libs/*:/guzzle/libs" --conf spark.yarn.maxAppAttempts=1 --conf spark.yarn.appMasterEnv.GUZZLE_HOME=/guzzle --conf spark.executorEnv.GUZZLE_HOME=/guzzle --conf spark.batchpipeline.threads=4 --master yarn --deploy-mode cluster --driver-class-path /usr/hdp/current/phoenix-client/phoenix-client.jar:/guzzle/libs/*:/guzzle/libs --jars /guzzle/libs/mysql-connector-java-5.1.46.jar,/guzzle/libs/spark-xml_2.11-0.4.1.jar --class com.justanalytics.guzzle.ingestion.Main /guzzle/bin/ingestion.jar environment=test location=IN system=default job_instance_id=302 "business_date=2018-05-01 03:03:00.000" stage_id=123 job_config_name=csv_demo batch_id=123