User Developer Guide - IKANOW/Aleph2 GitHub Wiki

This guide is to aid a developer into some of the basic usable features of the system rather than the nitty gritty. This is broken into 3 sections:

  1. Basic Bucket (source) Creation/Options
  2. Debugging/Logging
  3. Deployment/Maintenance

1. Basic Bucket (source) Creation/Options

This section will describe some of the basic source creation tasks and options: Sources can be thought of being in 1 of 2 similar camps:

  • Harvesting -> Bringing data into the system from external sources (files, web services, etc)
  • Analytics -> Manipulating data (enrichment, cleaning, validation, deduplication, etc)

These 2 camps are blurry as we allow you to run whatever code you want in either step so you could user a harvester to generate generic data that you make up in code rather than pull in external code or you could use an analytics engine to pull in external data, but for clarity we break them into these 2 camps although both work by letting you run arbitrary code.

When going about creating a new bucket to manipulate data, you can follow this basic flow to determine what elements to pull into your source (this is loosely based around what is available in the cardUI: https://github.com/Alex-Ikanow/aleph2_bucket_builder)

Input Data

  • If data is coming from an external source (file/web/etc) I should use one of the harvesters
  • Otherwise data is already in the system, I should use one of the analytics engines
  • If internal data is being passed to me via externalEmit, I should set my input to batch
  • If internal data is in another source, I should use one of the data_services to query that data (aka search_index_service)

Processing/Deduplication

  • If I want to do simple transformations, I can just use a JS enrichment engine to handle that
  • Otherwise I can use custom Java code to do more complex operations or things that need more heavily tested

Output Data

  • If I want data to go to another bucket, I should use externalEmit in my processing block (or add one to emit)
  • The data schemas I turn on will determine where and how my data is stored for this bucket

Harvesters

Processing

Batch Enrichment

Deduplication

https://github.com/IKANOW/Aleph2/wiki/Deduplication-service

Data Services (aka output options)

https://github.com/IKANOW/Aleph2/wiki/Service-schema:-generic-settings

https://github.com/IKANOW/Aleph2/wiki/Service-schema:-technology-overrides#search_index_service

Basic Info:

  • Search Index = Elasticsearch, for searchable fields (aka tokenized) can override mapping
  • Columnar Schema = Elasticsearch, for efficient column aggregations
  • Temporal Schema = Elasticsearch, for adding temporal elements (dates, timeout data)
  • Document Schema = N/A, Handles deduplication automatically (overwrite/merge/etc) - can be implemented in a processing block to handle custom deduplication (via Java/JS)
  • Storage Schema = HDFS, can store json at 3 stages of lifecycle (raw = from harvester, json = raw converted to json, processed = after enrichment)
  • Data Warehouse = N/A, no implementations yet (Hbase exists), but can store data and allow sql like operations.

#2. Debugging/Logging

This section describes the various methods you can use to debug some of your logic and where you can see various logging elements.

/opt/aleph2-home/logs/v1_sync_service.log - This houses some system logs, most have been converted to use the internal ILoggingService described below

ILoggingService - this is configured like all other Aleph2 services (/opt/aleph2-home/etc/v1_sync_service.properties), the default configuration sends log messages to elasticsearch in an index named "aleph2_logging_{full_name}*" the logging messages are per bucket (hence the {full_name}) and are configured on the bucket level for what verbosity to output at (I believe they default to INFO for system messages and everything else is off)

Bucket Logging Configuration - to change the default logging level for a bucket using the internal ILoggingService you need to add 2 cards to your bucket - ManagementService -> LoggingService and then choose a different default logging level in LoggingService. It can also be configured per service see https://github.com/IKANOW/Aleph2/wiki/(Detailed-Design)-Aleph2-logging-overview

Viewing log messages - Because ILoggingService sends log messages to elasticsearch, it is easiest to view log messages in kibana to gain the filtering/querying capabilities of that platform. Infinite UI has built in support for managing the query against your bucket (source). To use it open the V2 Data Viewer widget. Modify your Infinite query to point to the source you want (Sources -> communities -> select your source) query infinite once to force the source selection to take place, then set the V2 Data Viewer widget settings (typically "Saved" BucketFilter: ON, "Logs") this widget can also be used to view data or test data or test logs by manipulating the settings. Streaming data can also be viewed via the "Live" setting.

Debugging Logstash - The default logstash harvester turns off logging for production buckets but shows the logs for test runs. The logs will be pushed into elasticsearch via the default ILoggingService and can be viewed there easiest. If you need to turn on the logging for production runs, you can modify /var/init.d/logstash to force the service to start up with logging pointing somewhere (aka override the -l /opt/logstash.log which defaults to /dev/null)

Debugging Javascript - Aleph2 has built in support for passing log messages from JS to the built in ILoggingService. Simple just use the built in functions:

  • _a2.log_error("my message here");
  • _a2.log_info("my message here");
  • //etc, etc

Debugging Java - The currently configured ILoggingService is available in the context object passed to your processing code (e.g. in an IEnrichmentBatchModule you get the IEnrichmentModuleContext during onStageInitialize). You can hold onto this object and use the logger as needed, these log messages will show up in elasticsearch with everything else by default. Note: the logger uses something called a BasicMessageBean to describe the expected fields to send to the logger. Anything put in the BasicMessageBean.details map is searchable in kibana (as it gets its own field in es).

#3. Deployment/Maintenance

Deploying your own custom processing modules - To deploy your own code (say a custom IEnrichmentBatchModule to run in a BatchEnrichmentPipeline) there is an existing path in Infinite to handle this. First just jar up your file and dependencies. You can upload the file to the http://yourserver:port/manager/fileUploader.jsp with the title /app/aleph2/library/.jar, the description can be the full path to your entrypoint. Share it with a community that your users can see (Infinite System if you want everyone to see it). This causes your jar file to be picked up by aleph2 and registered as a shared library, it will dump it in HDFS at /app/aleph2/library/myname.jar. It can now be referenced in any bucket config using that full path (/app/aleph2/library/myname.jar).

Deploying custom services - If you have implemented a custom service (such as a new DB to use for one of the DataSchema types) you can deploy a jar to the /opt/aleph2-home/lib/ directory, modify the properties file (/opt/aleph2-home/etc/v1_sync_service.properties) to point to your services full path rather than the currently configured one. Restart aleph2 (service aleph2-ikanow restart) and it should be immediately picked up. View the log file to double check for any errors starting your service.

Updating existing code - To update to new versions (say the latest build) you can replace the jars in /opt/aleph2-home/lib folder and restart aleph2 (service aleph2-ikanow restart). This file should build the project in the correct order: https://github.com/IKANOW/Aleph2/blob/master/aleph2_uber/build_aleph2.sh (TODO validate this against what our RPMs were doing).

⚠️ **GitHub.com Fallback** ⚠️