Analytics Workload management - joshid43016/AnalyticsDataEcosystem GitHub Wiki

Workload management

In simple terms, workload management on an analytics data platform is similar to managing traffic on a busy highway. On a SQL based platform, we have requests coming from multiple tools & technologies for data integration and consumption. These requests aka demand for resources have to be processed and results provided back to the requestor within defined SLAs (Service Level agreements)

Below are high level business requirement for workload management.

Tactical requests that require less resources should be prioritized higher on the system and response time are fast [FAST LANE]
Analytical batch processes which run once a week or month or year are prioritized lower but have completion SLAs [SLOW LANE]
From BI tool perspective, ability to run analytical workloads that run long in background [STELTH]
There could be multiple fast lanes to handle high user concurrency and fewer Slow lanes with ability to hold requests in queue

From technical perspective, requests could

Large CPU consuming requests. These requests could have calculations or extensive number of joins
Large IO consuming requests. Such requests are going after large table with no constraints or performing Cartesian product joins

Modern OS which are basis of analytics data platform, handles memory using LRU (least recently used) algorithm. Data is written to memory and swapped out by OS are no longer key factor. Storage on SSD (Solid state devices) in recent years have also reduced reliance on memory.

To meet these needs, different vendors have different strategies

Snowflake recommends using different warehouses for different use cases.
Teradata and AWS Redshift have workload management engine that allows queries to be prioritized based on estimation. Short query bias and demotion after x number of seconds allows tactical workloads to complete quickly.

Concurrency

There are many ways to look at concurrency. We could report this at user level or at query level. From time horizon perspective, this could be by sec, hour, day or month. Incoming query volume, query type and analytics processing platform's processing capability impacts concurrency.

Latency

There are two aspects to network bandwidth - 1. throughput volume 2. number of requests per sec.

More often than not, a few sessions cannot saturate the whole network pipe which these days is large (2Gb - 10Gb), but it has been observed that chatty apps which send too many small requests could be impacted with data migration to cloud. Ground to cloud or Cloud to Ground patterns need to be tested before such projects are initiated.

Throughput

Operational system (OLTP applications) are optimized for shorter and many read and write transactions while Analytic applications are optimized for throughput and are more complex. Analytic databases are optimized for reads and support complex SQL operations like moving sum, cube and rollups, etc