Alcor v1.0 Release Plan - futurewei-cloud/alcor-int GitHub Wiki

Open-Source Plan for v1.0 Release

Tentative date: 01/31/2021

Release: Alcor v1.0

Release link: TBD

Release Goal

  • E2E performance
    • VPC API throughput reaches RPS 500 | Status: Achieved
    • Subnet API throughput reaches RPS 500 | Status: Achieved
    • Port API throughput reaches RPS 1000 (current RPS = 800) | Status: RPS reaches 1300
    • Port API latency cuts down to 200 ms when PRS = 100 and when the database is populated with 100K VPC and 1M Ports | Status: Reaches 200~250 ms
    • VM boot throughput | Status: Launch 1500 VMs in one time (success rate = 100%)
  • ACA + NCM
    • Performance testing for Hazelcast, Ignite and ETCD | Status: test report ongoing (ETA: 1/30)
    • Cut latency for 1 Million OVS rules/flows (local ports + neighbor ports) to single host from ~60 sec to 20 sec (50K ports/second) | Status: Achieved, reduced to 13 seconds
    • Measure one host ACA on-demand throughput (Initial goal: 100K) and latency (under 100ms) | Status: Achieved, single-host throughput reaches 300K
    • Measure NCM + with stress gRPC clients on-demand throughput and latency | Status: Testing ongoing (ETA: 1/28)
  • Large VPC provisioning | Status: Research ongoing, multiple release
    • Build a large-scale emulation framework for large VPC up to 1M ports per VPC (MiniNet, MaxiNet, DistriNet)
      • MiniNet v2.0 stress test on single server (setup a custom tree topology, stress test with Ryu controller
      • Setting up Distrinet cluster on multiple servers and build basic test cases.

Feature Development and Component Upgrade

  • Microservice Development
    • Goal State v2 E2E
      • DPM v2.1 to support GS v2
        • new gRPC clients based on GS v2
        • new programming path from DPM to NCM
      • ACA to support GS v2 for routing rule update
      • DPM to support L2/L3 neighbor, and L3 routing rule update end-to-end
    • Cache/DB schema redesign to improve latency for large scale
    • VPC/Subnet/Route Manager v2.0 for higher concurrency and throughput
      • VPC manager performance improvement
      • Subnet manager performance improvement
      • Route manager performance improvement
  • Host network configuration optimization
    • Topic I: Topology-aware policy based route reachability detection and lookup
      • Phase I: fundamental lookup data structure of consolidated reachability-adjusted routes
      • Phase II: (enhancement) lookup path trimming and optimization (stretch goal)
    • Topic II: Multi types input policy conflict detection and routes optimization (stretch goal)

Alcor Performance & Scalability

  • Database optimization
    • Ignite version upgraded to v2.10
    • Optimization techniques
      • Prefix query
    • NCM to Ignite profiling and latency data collection for both goal state provisioning and on-demand requests (?)
    • NCM Ignite batch write
  • ACA major refactor & On-demand workflow perf profiling
    • ACA state computation/orchestration layer that processes Goal State and orchestrates programming jobs to data plane
      • evaluation of cpp logic agnostic high performance framework
      • ACA threading model redesign to support high concurrency
      • migration of existing goal-state to flow manipulation to new mechanism
      • Implementation and perf test 1 million ports (target: 50+ seconds to 20 seconds)
    • Alcor benchmarking framework based on CBench
      • Investigate CBench (an OF controller benchmarking tool) and leverage its packet in/out test mechanism for ACA and on-demand
      • Throughput measurement for on-demand ACA standalone (within ACA, local cache + threading model)
        • Throughput measurement for ACA current threading model, find bottleneck
        • Throughput measurement evaluation and comparison, across upper layer threading pool candidates
      • Performance measurement for ACA upper layer (GS communication) and medium layer (state programming)
    • Alcor benchmarking framework based on Dubhe
      • Throughput measurement for on-demand NCM standalone
        • Measure grpc round trip latency
        • Measure NCM internal latency
      • Throughput measurement for on-demand E2E (ACA + NCM) (stretch goal)
    • ACA perf profiling for aca/ovs interaction improvement
      • perf profiling for new channel mechanism of ovs connection/flows
    • Driver communication layer that exchanges commands and events
      • lib-fluid library (for connection control) importing and wrapper migration
      • openvswitch library (for flow control) importing and wrapper migration
      • integration with existing ovs control based on vconn
      • support of normal flow operations (add/mod/del etc.)
      • support of advanced feature of bundling (requires OF1.4 and above)
    • Fix ACA crash issue when concurrently processing a large number of GoalStates
  • Build a large-scale emulation framework for large VPC up to 1M ports per VPC (MiniNet, MaxiNet, DistriNet)
    • MiniNet v2.0 stress test on single server
      • Setup a custom tree topology
      • Stress test with Ryu controller (reaches 1K nodes per server reliably)
      • Stress test with Alcor controller
    • Distrinet stress test on multiple servers
      • Deploy Distrinet on multiple vms (deploy one master + 2 worker nodes)
      • Deploy Distrinet on multiple servers
      • Stress test on a customized tree topology with Alcor controller

Alcor Fundamental

University Collaboration

  • VPC-based implementation for Message Queue scale path (Min Chen/Luyao Luo)
    • 10/13 Status:
      • Min: Submit PR #1 to upgrade GS v1.0 to GS v2.0 by 10/16 (Sat.) Submit PR #2 to add Pulsar support to DPM by 10/18 (Monday)
      • Luyao: Submit PR to ACA repo by 10/17 (Sunday)
      • Target: Start integration test by 10/20 (Wed.)
    • 10/20 Status:
      • Min: PR #1 merged/Jenkins passed PR #2 submitted (#695)
      • Luyao: ACA PR submitted and under review
        • Integration Test Plan:
          • Step 1: Use PostMan to test basic API functionality (pub & sub)
          • Step 2: Basic E2E test cases: port creation/update, routing rule etc.
          • Step 3: Run Jenkins jobs (ping Liguang & Prasad on Slack)
  • Scalability test framework for 1M nodes regions and 100K ports VPC (Jiawei Liu/Hanfeng Zhan/Jing Fan)
    • 10/13 Status:
      • Jiawei/Hanfeng: Set up Maxinet in 3 nodes following quick setup guide; Hitting a blocking issue of a missing folder "pox";
      • Target: Prepare a Maxinet demo by next meeting 10/21 (Wed.)
    • 10/20 Status:
      • Jiawei/Hanfeng: Maxinet Worker node registration failure; Maxinet not well maintained and working only on Ubuntu 14.02
      • Next Step: Distrinet is good alternative; Try MiniNet to get max # of containers per host; Reimage containers with truncated ACA.
  • ML-based on-demand programming (Yan Yu/Shuang Liang/Chen Min)
    • 10/13 Status:
      • Min: Present a draft design based on online retail recommendation system and leverage VM-VM similarity
      • Shuang: Working on historical data modeling for VM-VM connectivity that covers network neighborhood, routing, security group
      • Yu: Working on modeling historical data as vectors and GoalState recommendation algorithm
      • Target:
        • Start a paper presentation starting from next week 10/20 (bi-weekly)
        • ETA for project ETA: 10/20
    • 10/20 Status:
      • Yu: Present a paper for online retail recommendation system and its ML algorithm (static catalog + customer purchase history)
      • Shuang: Discuss one example of public cloud deployment

Nice to have

  • Alcor Monitoring by Prometheus and Grafana
    • Metrics collector for K8s services, container, and bare metal
    • Set up Prometheus and Grafana

Postpone to Next Release

  • Microservice development
    • UT enhancement for routing rule update in ACA and DPM
  • Host network configuration optimization
    • Topic I: Topology-aware policy based route reachability detection and lookup
      • Phase II: (enhancement) lookup path trimming and optimization
    • Topic II: Multi types input policy conflict detection and routes optimization
  • Database optimization
    • Ignite version upgraded to v2.10
    • Optimization techniques
      • Thin client Java API - async API
      • Thin client continuous query
  • ACA major refactor & On-demand workflow perf profiling
    • ACA perf profiling for aca/ovs interaction improvement
      • perf profiling for new channel mechanism of ovsdb connection/records
      • ACA memory footprint investigation and optimization
    • Coding style alignment across ACA codes
    • On-demand test automation script (stretch goal)