Data pipelines - liniribeiro/machine_learning GitHub Wiki
When your data has several steps of process before getting into the final output.
raw data -> data cleaning -> ml
POC and production phases
POC:
- Goal is decide if the application is workable and worth deploying
- Focus on getting the prototype work
- Its ok if the data pre-processing is manual, but take extensive notes/comments
Production phase:
- After project utility is established, use more sophisticated tools to make sure the data pipeline is replicable
- Ex: tensorflow, transform, apache beam, airflow
Data pipeline example:
We need to keep track of data provenance (where the data came from) and lineage (Sequence of steps to the end of the pipeline)
Having an extense documentations is good to track the provenance and lineage, but to build robust and maintainable systems there are sophisticated tools that help us to keep track of what happen at the pipeline, without to much unnecessary complexity. Tensorflow transform one example.
TIP: Make extensive usage of meta-data Meta-data is data about data meta-data can be very usefull
Useful for:
- Error analysis: Spotting unexpected effects
- Keep track of data provenance
Scoping
Define Project
- What project should we work on?
- What are the metrics for success?
- What are the resources needed? (Data, time, people)
Scoping process
- Brainstorm business problems (Not Ai problems) - > Identification of the problems
Ex:
- What are the top 3 things you wish were working better?
- Increase Conversion
- Increase margin
- etc..
-
Brainstorm AI solutions - > Identification of the solutions ps: Not everything can be solved with ia
-
Access the feasibility and the value of potential solutions (Double check something that you wish that is true)
-
Determine the milestones
-
Budget for resources
Is this project technically feasible? Use external benchmark (literature, other company, competitor) Someone was able to make something similar?
Are we able to solve the problem? Its HLP? If yes, we have more chances to be able to build it
Why use HLP in benchmark? People are very good on unstructured data tasks
Criteria: Can a human, given the same data, perform the task? Do we have features that are predictive?
Diligence on value
MLE metrics ex: word-level accuracy, query-level accuracy, search result quality -> user engagement Business Metrics: Revenue
Both are important Have a technical and business teams try to agree on metrics that both are comfortable with
Milestone & Resourcing
Key specifications:
- ML metrics (accuracy, precision, recall)
- Software metrics (Latency throughput, etc. given compute resources)
- Business Metrics (revenue, etc.)
- Resources Needed (data, person, help from another team)
- Timeline
If unsure, consider benchmark exercise or building a POC first.