AWS Data Pipeline - vedratna/aws-learning GitHub Wiki

AWS Data Pipeline is a web service that helps you reliably process and move data between different AWS compute and storage services, as well as on-premises data sources, at specified intervals. With AWS Data Pipeline, you can regularly access your data where itโ€™s stored, transform and process it at scale, and efficiently transfer the results to AWS services such as Amazon S3, Amazon RDS, Amazon DynamoDB, and Amazon EMR.

  • TaskRunner : It's an application that polls AWS Data pipeline for tasks and then perform those tasks and report status back to Data pipeline
  • DataNode : It defines location and type of data Data service uses as input or output. SQLDataNode, RedshiftDataNode, S3DataNode, DynamodbDataNode.
  • Activity : An action that Data Pipeline initiates on your behalf as a part of pipeline.

Built in Activities

  • CopyActivity
  • HiveActivity
  • EMRActivity
  • ShellCommandActivity
  • Precondition : Pipeline component containing conditional statement that must be true before activity can run

Built in Preconditions

  • DynamodbTableExists
  • DynamodbDataExists
  • S3KeyExists
  • S3PrefixExists
  • ShellCommandPrecondition