Home - joshid43016/AnalyticsDataEcosystem GitHub Wiki
Welcome to the AnalyticsDataEcosystem wiki!
In the modern analytics software stack, object store is the standard and we have multiple compute engines to process the data. There is NO "one system fits all" and based on the requirement, solution could be different. Analytics system fall in these high level categories
- Batch/Interactive querying
- Stream processing
- Machine learning
Ability to streamline the data pipeline is important and we should lean towards reducing data movements. Illustration below shows how the data flows from data producers (transactional systems) to consumers for insights using analytics.

This is a complicated domain with multiple technology stacks. Our success depends on how well we simplify these steps in each one of these steps. I plan to share my thoughts and ideas on many of these technologies and document best practices that have survived test of time.
Data Collection or acquisition covers following topics. It is of immense importance as bad practices could lead to wrong insight
- Will source systems push the data or do we have to pull it
- [ ]How long do we have to retain this data in staging
- [ ]Who will need access to this data
- [ ]Will we have privileged service accounts to manage the acquisition process
Analytics Data store has couple of options. Object data store is a necessary component in any modern analytics stack
Process/Analyze step involve steps in data integration (cleanse, standardize - layout and values, conform/harmonize, metric derivation)
Various Analytics use cases are as follows:-
| Category | Capability | Area |
|---|---|---|
| Data Exploration | Data discovery, profiling, modelling | Operational Intelligence |
| Data Analysis | Reporting, Dashboards w/ KPI, OLAP | Tactical / Operational Intelligence (Embedded Analytics) |
| Visual Data Analysis | Visual discovery, Search, Content Analytics, NLP, Predictive Analytics | Strategic Decision making |
| Data Preparation / Management | Data Extraction, transformation and load | Data Integration, Data Quality / Integrity |
| Data ingestion & curation | Event streaming | Realtime Analytics |
In the Consume step, we look at Business Intelligence tools
Architectural principles for good analytics ecosystem
To be successful, following are ever test principles. Based on my experience, I have seen these being successful in many occasions
-
Reduce data redundancy and movement Data silos are harder to manage and discourage collaboration. Strongly recommend data sharing and reducing data movements where no value is added. Each data hop introduces a point of failure which could cause defects. Every process could break and less number of process between data & insight is better than more steps.
-
Simplicity Use the right tool for the job. Consider the following in the selection process and use the KISS principle
-
Data structure
-
Latency
-
Throughput
-
Access patterns
-
Loose coupling Ability to change each component in the architecture independently helps as technology changes. A decoupled architecture is easier to manage than a monolithic application which is difficult to manage. A micro services based architecture where different components serve based on input will be sustainable in long run.
-
Leverage managed services and serverless services Cloud is the foundation for the next generation of analytics which enables managed services. In the managed environment, vendor to support and secure the environment
Full scope of data platforms and characteristics

There are plethora of tools in the market place and each have different strengths. Best of breed or best of suite is a common debated strategy.