AWS RedShift - keshavbaweja-git/guides GitHub Wiki
Allows separation of different workload types. Goal is to prioritise important workloads/queries while throttling less important workloads. Allows to
- control number of concurrent queries
- allocate cluster resources compute and memory across workload types
- set query timeouts to abort long running queries
Short queries are routed to Short query queue
- Spiky workloads
- Incremental automated snapshots are pushed down to S3
- Transient secondary cluster is spun up
- Detection of queuing up, spinning up transient cluster and termination of transient cluster is managed automatically by Redshift.
- For every 24hrs of primary cluster usage, 1hr of free transient secondary cluster
- Continuous ingestion
- Reporting and Dashboard queries
- Heavy data processing, ELT process
- % memory allocation
- Concurrency scaling mode (off/manual/auto)
- Queue priority
- User groups
- Query groups
- Query monitoring rules
- Detect queries with poor execution plans
- Take action - alert notification, change query priority, terminate query
Unload is better than returning all records through leader node.
- COPY command
- If RedShift cluster and Amazon DynamoDB table are not in the same region, specify
REGION
option on COPY command - COPY command loads data in parallel across cluster nodes
- Data transfer incurred for COPY command counts towards table's provisioned data read throughput
-
READRATIO
option on COPY command can be used to specify read throughput percentage of DynamoDB table to be used for data transfer -
NOLOAD
option on COPY command can be used to validate data without actually loading the data in cluster - COPY command format
copy <redshift_tablename> from 'dynamodb://<dynamodb_table_name>'
authorization
readratio '<integer>';
- Format for
authorization
IAM_ROLE 'arn:aws:iam::<aws-account-id>:role/<role-name>'
ACCESS_KEY_ID '<access-key-id>'
SECRET_ACCESS_KEY '<secret-access-key>';
- COPY command applies automatic compression whenever an empty table with no compression encoding is specified.
- Why is Vacuum operation not reclaiming disk space? If there are active transactions when delete is issued in another statement, rows are logically deleted but not physically removed to allow prior active transactions to access them. Once the active transactions have completed/rolled back, execute vacuum operation again to reclaim disk space.
- Functionality to schedule queries is built into Redshift Query Editor
- IAM user configuring schedule should be able to assume role with following policies
- AmazonEventBridgeFullAccess
- AmazonRedshiftDataFullAccess
- Schedule can be specified in cron format
- SNS notifications for query monitoring
- Redshift data CLI can be used to download results for a scheduled execution
- Account A
- Create an IAM role - roleA, with Redshift as Trusted entity
- Attach this role to Amazon Redshift cluster
- Attach inline policy to roleA to allow it to assume roleB in Account B
- Account B
- Create a policy with list and read permissions for Glue and S3
- Create an IAM roleB with Trusted entity as another AWS account - Account A
- Attach policy to roleB
- With this roleB trusts all entities in Account A
- To limit the trust scope, modify trust relationship to limit to roleA in Account A
- SQL endpoint
- Query optimizer
- Coordinates parallel SQL processing across Compute nodes
- Not charged for clusters with > 2 nodes
- Local, columnar storage
- Executes queries in parallel
- Load(COPY), Unload, backup, restore from S3
- Execute queries directly against S3 data lake