AWS RedShift - keshavbaweja-git/guides GitHub Wiki

Workload management

Allows separation of different workload types. Goal is to prioritise important workloads/queries while throttling less important workloads. Allows to

  • control number of concurrent queries
  • allocate cluster resources compute and memory across workload types
  • set query timeouts to abort long running queries

Short Query Acceleration

Short queries are routed to Short query queue

Concurrency scaling

  • Spiky workloads
  • Incremental automated snapshots are pushed down to S3
  • Transient secondary cluster is spun up
  • Detection of queuing up, spinning up transient cluster and termination of transient cluster is managed automatically by Redshift.
  • For every 24hrs of primary cluster usage, 1hr of free transient secondary cluster

Typical workload

  • Continuous ingestion
  • Reporting and Dashboard queries
  • Heavy data processing, ELT process

WLM Queue attributes

  • % memory allocation
  • Concurrency scaling mode (off/manual/auto)
  • Queue priority
  • User groups
  • Query groups
  • Query monitoring rules

Query monitoring rule

  • Detect queries with poor execution plans
  • Take action - alert notification, change query priority, terminate query

Unload is better than returning all records through leader node.

Loading Data

Loading Data From Amazon DynamoDB table

  • COPY command
  • If RedShift cluster and Amazon DynamoDB table are not in the same region, specify REGION option on COPY command
  • COPY command loads data in parallel across cluster nodes
  • Data transfer incurred for COPY command counts towards table's provisioned data read throughput
  • READRATIO option on COPY command can be used to specify read throughput percentage of DynamoDB table to be used for data transfer
  • NOLOAD option on COPY command can be used to validate data without actually loading the data in cluster
  • COPY command format
copy <redshift_tablename> from 'dynamodb://<dynamodb_table_name>'
authorization
readratio '<integer>';
  • Format for authorization
IAM_ROLE 'arn:aws:iam::<aws-account-id>:role/<role-name>'
ACCESS_KEY_ID '<access-key-id>'
SECRET_ACCESS_KEY '<secret-access-key>';
  • COPY command applies automatic compression whenever an empty table with no compression encoding is specified.

Vacuum

  • Why is Vacuum operation not reclaiming disk space? If there are active transactions when delete is issued in another statement, rows are logically deleted but not physically removed to allow prior active transactions to access them. Once the active transactions have completed/rolled back, execute vacuum operation again to reclaim disk space.

Query scheduler

  • Functionality to schedule queries is built into Redshift Query Editor
  • IAM user configuring schedule should be able to assume role with following policies
    • AmazonEventBridgeFullAccess
    • AmazonRedshiftDataFullAccess
  • Schedule can be specified in cron format
  • SNS notifications for query monitoring
  • Redshift data CLI can be used to download results for a scheduled execution

Cross account access - Amazon Redshift Spectrum / S3, Glue

  • Account A
    • Create an IAM role - roleA, with Redshift as Trusted entity
    • Attach this role to Amazon Redshift cluster
    • Attach inline policy to roleA to allow it to assume roleB in Account B
  • Account B
    • Create a policy with list and read permissions for Glue and S3
    • Create an IAM roleB with Trusted entity as another AWS account - Account A
    • Attach policy to roleB
    • With this roleB trusts all entities in Account A
    • To limit the trust scope, modify trust relationship to limit to roleA in Account A

Leader node

  • SQL endpoint
  • Query optimizer
  • Coordinates parallel SQL processing across Compute nodes
  • Not charged for clusters with > 2 nodes

Compute node

  • Local, columnar storage
  • Executes queries in parallel
  • Load(COPY), Unload, backup, restore from S3

Redshift Spectrum node

  • Execute queries directly against S3 data lake
⚠️ **GitHub.com Fallback** ⚠️