Low level install instructions for the V1 synchronization service - IKANOW/Aleph2 GitHub Wiki

Overview

All of this will be provided via RPM. This is just an interim/internal page to capture the steps before that happens.

Currently this can only be done on one node on a cluster (This will change soon!)

Pre-install requirements

The following install should take place on the API node of a v1.0+ IKANOW cluster, with the following Hadoop distribution installed:

  • Any YARN based distribution (eg CDH5.x or HDP2.x) with the following services:
    • Storm, Zookeeper, Kafka, HDFS, MapReduce v2
    • Note the only distribution that ships all of the above out of the box is Hortonworks HDP 2.1+
      • For v2 only functionality, only a vanilla install is required
        • Don't forget to download the "site configuration" ZIP from Ambari and copy all the *-site.xml files into the local YARN config directory listed below (/opt/aleph2-home/yarn-config)
      • (for v1 analytics functionality, v1.0+ of the IKANOW platform is required and the following additional HDP install steps are required; otherwise just ensure that hadoop.standalone_mode=true in the v1 configuration, eg /opt/infinite-install/config/infinite.configuration.properties)

Configuring the local file system

Create the following directory structure:

  • /opt/aleph2-home
    • /opt/aleph2-home/bin
    • /opt/aleph2-home/libs
    • /opt/aleph2-home/logs
    • /opt/aleph2-home/config
    • /opt/aleph2-home/cached-jars
    • /opt/aleph2-home/yarn-config

Then populate the directories:

  • Copy the aleph2 JARs into /opt/aleph2-home/libs (see below for how to get them)
  • Copy the configuration file (see below) into /opt/aleph2-home/config
  • Copy all the files from the V1 Hadoop configuration directory into /opt/aleph2-home/yarn-config:
    • cp /opt/hadoop-infinite/mapreduce/hadoop/*.xml /opt/aleph2-home/yarn-config/
      • (If installing on an Infinit.e node running standalone Hadoop, then simply:
        • a) download the HDFS, YARN, MRv2 "site configuration" zips from Ambari/HDP, unzip, and copy the *site.xml files into `/opt/aleph2-home/yarn-config/``
          • (or you can get the XML files directly from /usr/hdp/current/hadoop-yarn-client/etc/hadoop/*-site.xml)
        • b) Run sed -i s/'${hdp.version}'/<HDP_VERSION>/g /opt/aleph2-home/yarn-config/*.xml
          • (Where "<HDP_VERSION>" can be obtained by doing hadoop fs -ls /hdp/apps/, eg "2.2.4.2-2"
  • Copy defaults.yaml from the HDP storm configuration (eg from /usr/hdp/current/storm-client/conf/storm.yaml) into /opt/aleph2-home/yarn-config/storm.yaml (ie renaming it from defaults.yaml to storm.yaml)
  • Copy zoo.cfg from the HDP zookeeper configuration (eg from /usr/hdp/current/zookeeper-client/conf/zoo.cfg)

"Chown" /opt/aleph2-home recursively to tomcat.tomcat (chown -R tomcat.tomcat /opt/aleph2-home/, using sudo if necessary)

Configuring the distributed file system

Using runuser hdfs -s /bin/sh -c "hadoop fs -mkdir -p <dir>", create the following directory structure:

  • /app
    • /app/aleph2
      • /app/aleph2/library
      • /app/aleph2/data

"Chown" /app/aleph2 recursively to tomcat (runuser hdfs -s /bin/sh -c "hadoop fs -chown -R tomcat /app/aleph2", using sudo if necessary)

Run the synchronization service

Inside /opt/aleph2-home/libs, run runuser tomcat -c "java -classpath '/opt/aleph2-home/config/:./*' com.ikanow.aleph2.data_import_manager.harvest.modules.IkanowV1SynchronizationModule ../config/v1_sync_service.properties"

Configuration file

The following configuration file should be placed into /opt/aleph2-home/config, called v1_sync_service.properties:

# SERVICES
service.CoreDistributedServices.interface=com.ikanow.aleph2.distributed_services.services.ICoreDistributedServices
service.CoreDistributedServices.service=com.ikanow.aleph2.distributed_services.services.CoreDistributedServices
service.StorageService.interface=com.ikanow.aleph2.data_model.interfaces.data_services.IStorageService
service.StorageService.service=com.ikanow.aleph2.storage_service_hdfs.services.HDFSStorageService
service.ManagementDbService.interface=com.ikanow.aleph2.data_model.interfaces.data_services.IManagementDbService
service.ManagementDbService.service=com.ikanow.aleph2.management_db.mongodb.services.MongoDbManagementDbService
service.CoreManagementDbService.interface=com.ikanow.aleph2.data_model.interfaces.data_services.IManagementDbService
service.CoreManagementDbService.service=com.ikanow.aleph2.management_db.services.CoreManagementDbService
service.SearchIndexService.interface=com.ikanow.aleph2.data_model.interfaces.data_services.ISearchIndexService
service.SearchIndexService.service=com.ikanow.aleph2.search_service.elasticsearch.services.ElasticsearchIndexService
# CONFIG

# MANAGEMENT DB:
MongoDbManagementDbService.mongodb_connection=localhost:27017
MongoDbManagementDbService.v1_enabled=true

# CORE DISTRIBUTED SERVICES
CoreDistributedServices.application_name=DataImportManager
CoreDistributedServices.application_port.DataImportManager=2252

# SEARCH INDEX
ElasticsearchCrudService.elasticsearch_connection=localhost:9300
#(use whatever cluster name is running at "elasticsearch_connection")

# DATA IMPORT MANAGER:
DataImportManager.harvest_enabled=true
DataImportManager.streaming_enrichment_enabled=true
DataImportManager.batch_enrichment_enabled=false

Logging

Place a file like the following into /opt/aleph2-home/config/log4j2.xml:

<?xml version="1.0" encoding="UTF-8"?>
<Configuration status="WARN">
        <Appenders>
                <Console name="Console" target="SYSTEM_OUT">
                        <PatternLayout pattern="%d{YYYY-MM-dd HH:mm:ss} [%t] %-5p %c{1}:%L - %msg%n" />
                </Console>
         <RollingFile name="fileWriter"
                     fileName="/opt/aleph2-home/logs/v1_sync_service.log"
                     filePattern="/opt/aleph2-home/logs/v1_sync_service.%d{yyyy-MM-dd}.gz">
                        <PatternLayout pattern="%d{YYYY-MM-dd HH:mm:ss} [%t] %-5p %c{1}:%L - %msg%n" />
            <TimeBasedTriggeringPolicy/>
        </RollingFile>
        </Appenders>
        <Loggers>
                <Root level="info">
                        <AppenderRef ref="fileWriter" />
                </Root>
        </Loggers>
</Configuration>

Obtaining the Aleph2 JARs

(NOTE: nightly builds are available here. The build instructions used to generate the nightlies are here)

In each of Aleph2 and Aleph2-contrib, from the top-level directory:

  • mvn -e clean install -Dmaven.test.skip=true [-Daleph2.version=<DESIRED VERSION ID>]
  • mvn -e clean package -Dmaven.test.skip=true -Daleph2.scope=provided [-Daleph2.version=<DESIRED VERSION ID>]

(You will need maven to point to a JDK 8.x - note the command line maven is recommended for "production JAR building" not Eclipse/M2E).

NOTE: there are currently some issues with circular test dependencies in this build - if Aleph2 fails at the management_db_service, go install Aleph2-contrib, which should work, and then come back and repeat the Aleph2 build again. To avoid the error simply only build aleph2_data_model and aleph2_core_distributed_services from Aleph2, then everything from Aleph2-contrib, then everything from Aleph2.

This generates a target directory in each project directory with a JAR called --SNAPSHOT-shaded.jar.

These JARs should be copied into the /opt/aleph2-home/libs directory, eg:

  • /cygdrive/c/cygwin/bin/find ~/github/Aleph2 -name "*-SNAPSHOT-shaded.jar" -exec scp '{}' ec2-USER@HOST:/opt/aleph2-home/libs/ \;
  • /cygdrive/c/cygwin/bin/find ~/github/Aleph2-contrib -name "*-SNAPSHOT-shaded.jar" -exec scp '{}' ec2-USER@HOST:/opt/aleph2-home/libs/ \;
⚠️ **GitHub.com Fallback** ⚠️