Quick Start - stonezhong/DataManager GitHub Wiki

Brief

This article discuss how to use Data Manager UI after you have installed it.

The sample Data Application

Build and deploy

You need to ssh to the dm server, to build the application, you need to do

ssh <your-host>
eae dmapps
./build.sh generate_trading_samples

Register with Data Manager

The purpose of this step is to let Data Manager know you have this data application. To do this, you need to login to Data Manager, then

  • Click menu "Application", then click button "Create"
  • set name to Import Trading Data
  • set team to trading
  • set description to Application to generate Trading Samples
  • set Location to hdfs:///beta/etl/apps/generate_trading_samples/1.0.0.0

Create a data repo

  • Click the menu "Data Repositories"
  • set name to main
  • set type to Hadoop File System
  • set details to
    {
      "base_url": "hdfs:///beta/data"
    }
            

Create a pipeline

Now, let's create a pipeline to ingest the trading sample data.

  • Click the menu "Pipelines", then click button "Create"
  • In "Basic Info" tab
    • set name to import-trading-data
    • set team to trading
    • set category to daily-trading
    • set type to simple-flow
  • In "Tasks" tab:
    • Create task begin -- click the button "Add Task"
      • name is begin
      • type is Dummy
    • create task import-trading-data-nasdaq
      • name is import-trading-data-nasdaq
      • type is Application
      • Select "Import Trading Data" application
      • set arguments to
        {
            "action": "import-data", 
            "market": "NASDAQ", 
            "base_location":"/", 
            "repo":"main",
            "dt": "{{dt}}"
        }
        
    • create task import-trading-data-nyse
      • name is import-trading-data-nyse
      • type is Application
      • Select "Import Trading Data" application
      • set arguments to
        {
            "action": "import-data", 
            "market": "NYSE", 
            "base_location":"/", 
            "repo":"main",
            "dt": "{{dt}}"
        }
        
    • create task create-view
      • name is create-view
      • type is Application
      • Select "Import Trading Data" application
      • set arguments to below
        {
            "action": "create-view", 
            "dt": "{{dt}}",
            "repo": "main",
            "loader": {
                "name": "union",
                "args": {
                    "dsi_paths": [
                        "{{xcom['import-trading-data-nasdaq'].dsi_path}}",
                        "{{xcom['import-trading-data-nyse'].dsi_path}}"
                    ]
                }
            }
        }
        
  • Create task end
    • name is end
    • type is Dummy
  • Set dependency, so we have
    begin --> import-trading-data-nasdaq
    begin --> import-trading-data-nyse
    import-trading-data-nyse --> create-view
    import-trading-data-nyse --> create-view
    create-view --> end
    

After pipeline is created, you can refresh the page to see the airflow DAG link.

Create a scheduler

Now, we need to create a scheduler so we can run this pipeline regularly. Click the button "Schedulers", then click the button "Create"

  • set name to daily-trading
  • set description to daily trading scheduler
  • set category to daily-trading
  • set context to {"dt": "{{due.strftime(‘%Y-%m-%d’)}}"}
  • set team to trading
  • set Interval to 1 DAY
  • set Start to 2021-03-08 00:00:00 for example then click button "Save changes"

Unpause the pipeline

Since the pipeline is in "paused" status after it has been created. Let's unpause it.

Click the menu "Pipelines", then click "import-trading-data" and click the button "unpause"

Create a SQL pipeline

Now, let's create another pipeline to do some data transformation using SQL statement.

  • First, click the menu "Pipelines", then click button "create"
  • In "Basic Info" tab
    • set name to get-top-picks
    • set team to trading
    • set category to daily-trading
    • set type to simple-flow
    • set "Required assets" to tradings:1.0:1:/{{dt}}
  • In "Tasks" tab:
    • Create task get-top-pick -- click the button "Add Task"
      • type is Spark-SQL
      • Click button "Add Step" in tab Spark-SQL
        • set name to get-top-picks
        • import asset tradings:1.0:1:/{{dt}} as tradings
        • set SQL statement to below
          SELECT 
              symbol, sum(amount) as volume
          FROM tradings
          GROUP BY symbol
          ORDER BY sum(amount)
          LIMIT 3
          
        • check "Write Output"
        • set Location to hdfs:///beta/data/top_picks/{{dt}}.parquet
        • set Asset Path to top_picks:1.0:1:/{{dt}}
        • set datatime to {{dt}} 00:00:00
        • set type to parquet
        • set Write Mode to overwrite
⚠️ **GitHub.com Fallback** ⚠️