AWS Data Pipeline - vedratna/aws-learning GitHub Wiki
AWS Data Pipeline is a web service that helps you reliably process and move data between different AWS compute and storage services, as well as on-premises data sources, at specified intervals. With AWS Data Pipeline, you can regularly access your data where itโs stored, transform and process it at scale, and efficiently transfer the results to AWS services such as Amazon S3, Amazon RDS, Amazon DynamoDB, and Amazon EMR.
- TaskRunner : It's an application that polls AWS Data pipeline for tasks and then perform those tasks and report status back to Data pipeline
- DataNode : It defines location and type of data Data service uses as input or output. SQLDataNode, RedshiftDataNode, S3DataNode, DynamodbDataNode.
- Activity : An action that Data Pipeline initiates on your behalf as a part of pipeline.
Built in Activities
- CopyActivity
- HiveActivity
- EMRActivity
- ShellCommandActivity
- Precondition : Pipeline component containing conditional statement that must be true before activity can run
Built in Preconditions
- DynamodbTableExists
- DynamodbDataExists
- S3KeyExists
- S3PrefixExists
- ShellCommandPrecondition