AWS Data Analytics - keshavbaweja-git/guides GitHub Wiki
Data Analysis Work flow
- Ingest/Collect
- Store
- Process/Analyze
- Visualize
Analysis
is a detailed examination of something in order to understand its nature or determine its essential features.
Analytics
is the systematic analysis of data.
Data analytics
is the specific analytical process being applied.
Several Data analytic processes can be combined to produce a data analysis solution.
Data Sources
- Human generated data (largest form of data)
- Computer generated data (based on human input)
- Application/infrastructure logging (can provide valuable insights)
Big Data Challenges
- Volume
- Variety
- Velocity
- Veracity
- Degree to which data is accurate, precise and trusted. Data analytics solutions should be able to identify common flaws in data and fix them before data is stored. This is known as data cleansing. Data cleansing must be completed within the time requirements of the solution, up to and including real time processing speeds.
- Value
- The ability of the data analytics solution to extract meaningful information from the data that has been stored and analyzed. Solutions must be able to produce right form of analytical results to inform business decision makers and stakeholders of insights using trusted reports and dashboards.
Data Sources
- Internal databases and file stores
- Data is usually highly structured, processing requirements within a data analytics solutions are less.
- Streaming Data
- Semi structured/unstructured data
- At high velocity, may require special software for data collection and processing.
- Public Data sets
- May require transformations for internal analytics.
Data Lake
A Data Lake is a centralized repository(single source of truth) that allows you to store structured, semi-structured and unstructured data at any scale.
AWS S3 provides a centralized data storage solution that can be used with a variety of data analytics tools without data movement.
Data Lakes such as one built on top AWS S3 are cheaper the specialized data storage solutions. Data is loaded into Data Lake in raw format and ETL processes are applied by different data analytics tools against Data Lake.
A data lake on AWS can help you do the following:
- Collect and store any type of data, at any scale, and at low cost
- Secure the data and prevent unauthorized access
- Catalog, search, and find the relevant data in the central repository
- Quickly and easily perform new types of data analysis
- Use a broad set of analytic engines for one-time analytics, real-time streaming, predictive analytics, AI, and machine learning
Data Warehouse
A centralized repository of structured data from many data sources. This data is transformed, aggregated and prepared before it is loaded into Data Warehouse. DW store transactional data from many different data sources in a format that facilitates efficient execution of complex queries. DW provides Curated data sets for faster analysis.
Amazon Redshift is a Data Warehouse service supporting GB, PB data sizes.Amazon Redshift Spectrum allows you to combine S3 and Redshift in a single query.