AWS Data Pipeline - vedratna/aws-learning GitHub Wiki

AWS Data Pipeline is a web service that helps you reliably process and move data between different AWS compute and storage services, as well as on-premises data sources, at specified intervals. With AWS Data Pipeline, you can regularly access your data where it’s stored, transform and process it at scale, and efficiently transfer the results to AWS services such as Amazon S3, Amazon RDS, Amazon DynamoDB, and Amazon EMR.

TaskRunner : It's an application that polls AWS Data pipeline for tasks and then perform those tasks and report status back to Data pipeline
DataNode : It defines location and type of data Data service uses as input or output. SQLDataNode, RedshiftDataNode, S3DataNode, DynamodbDataNode.
Activity : An action that Data Pipeline initiates on your behalf as a part of pipeline.

Built in Activities

CopyActivity

HiveActivity

EMRActivity

ShellCommandActivity

Precondition : Pipeline component containing conditional statement that must be true before activity can run

Built in Preconditions

DynamodbTableExists

DynamodbDataExists

S3KeyExists

S3PrefixExists

ShellCommandPrecondition