Using Your Own Data, Part I - Sotera/track-communities GitHub Wiki

Example Files

Aggregate Micro Paths provides the following example files:

a .csv formatted data file,
an Apache Hive table SQL file for describing your data,
a configuration file for your data loader and analytic runner,
an executable script to load your data and run the analytics

The following sub-sections help describe how you can utilize and customize those files to load your own data into the system.

Making Your Own Data File

The example .csv formatted file is located at:

/srv/software/aggregate-micro-paths/hive-streaming/aisShipData.csv
or
[git repository home]/aggregate-micro-paths/hive-streaming/aisShipData.csv

Your structure, fields, and format does not need to mirror the given sample file, but must contain the following Key Data Fields:
[ ID, TIMESTAMP, LATITUDE, LONGITUDE ]

'OBJECTAID',25.20564,120.6755,1203220340
'OBJECTBID',24.54659,54.24008,1203220339
...
etc.

Defining Your Own Data Table

The example Apache Hive table SQL file for describing your data is located at:

/srv/software/aggregate-micro-paths/hive-streaming/etl.sql
or
[git repository home]/aggregate-micro-paths/hive-streaming/etl.sql

You need to map your data into a structure similar to as follows:

drop table my_example;
create external table my_example
(
  name string,
  latitude string,
  longitude string,
  time string
)
ROW FORMAT DELIMITED
   FIELDS TERMINATED BY ','
location '/tmp/my_exampleone/';

drop table my_example_final;
create table my_example_final as
select trim(name) as name, trim(latitude) as latitude, trim(longitude) as longitude, concat('20',substr(trim(time),1,2),'-',substr(trim(time),3,2),'-',substr(trim(time),5,2),' ',substr(trim(time),7,2),':',substr(trim(time),9,2),':00') as dt
from my_example;

Configuring the Data Loader and Analytic Runner

The example configuration file for your data loader and analytic runner is located at:

/srv/software/aggregate-micro-paths/hive-streaming/conf/ais.ini
or
[git repository home]/aggregate-micro-paths/hive-streaming/conf/ais.ini

The key edits you need to worry about are mapping your table name and field names to the previous step's values:

[AggregateMicroPath]
table_name:my_example_final

table_schema_id: name
table_schema_dt: dt
table_schema_lat: latitude
table_schema_lon: longitude

# in seconds
time_filter: 86400
# in KM
distance_filter: 1000

# let's just do the whole world
lower_left_lat: -90
lower_left_lon: -180
upper_right_lat: 90
upper_right_lon: 180

trip_name: my_example_final

# MUST be a factor of 10
# 0.1 ~= 10KM, 0.01 ~= 1KM, 0.001 ~= 100M...

resolution_lat: 0.1
resolution_lon: 0.1
temporal_split: hour

Making Your Own Executable Script

The example executable script to load your data and run the analytics is located at:

/srv/software/aggregate-micro-paths/hive-streaming/run_ais.sh
or
[git repository home]/aggregate-micro-paths/hive-streaming/run_ais.sh

The key edits you need to you need to worry about are mapping the directories to the previous steps' table name values:

# Prepare data in HDFS and Hive table
hadoop fs -rm /tmp/my_exampleone/*
hadoop fs -rmdir /tmp/my_exampleone
hadoop fs -mkdir /tmp/my_exampleone
gzip -d myExampleData.csv.gz
hadoop fs -put myExampleData.csv /tmp/my_exampleone/
hive -f myexample.sql

# Create output directory, remove old output
mkdir -p output
rm -f output/micro_path_myexample_results.csv

# Run Job
python AggregateMicroPath.py -c myexample.ini

# Get Results
echo -e "latitude\tlongitude\tcount\tdate" > output/micro_path_myexample_results.csv
hive -S -e "select * from micro_path_intersect_counts_my_example_final;" >> output/micro_path_myexample_results.csv

Running Aggregate Micro Paths

cd /srv/software/aggregate-micro-paths/hive-streaming
./run_myexample.sh

Note: The AggregateMicroPath.py script assumes you place your new data files, configuration files, scripts, etc. in parallel locations as the original example files. Otherwise, you may need to adjust several includes within the file to find and operate on your data.

Completing these tasks will confirm that your own data file is valid and you are able to complete the full Track Communities package of analytics, as provided in Part II.