Data pipelines - liniribeiro/machine_learning GitHub Wiki

When your data has several steps of process before getting into the final output.

raw data -> data cleaning -> ml

POC and production phases

POC:

Goal is decide if the application is workable and worth deploying
Focus on getting the prototype work
Its ok if the data pre-processing is manual, but take extensive notes/comments

Production phase:

After project utility is established, use more sophisticated tools to make sure the data pipeline is replicable
Ex: tensorflow, transform, apache beam, airflow

Data pipeline example: Screenshot 2024-09-22 at 12 38 32

We need to keep track of data provenance (where the data came from) and lineage (Sequence of steps to the end of the pipeline)

Having an extense documentations is good to track the provenance and lineage, but to build robust and maintainable systems there are sophisticated tools that help us to keep track of what happen at the pipeline, without to much unnecessary complexity. Tensorflow transform one example.

TIP: Make extensive usage of meta-data Meta-data is data about data meta-data can be very usefull

Useful for:

Error analysis: Spotting unexpected effects
Keep track of data provenance

Scoping

Define Project

What project should we work on?
What are the metrics for success?
What are the resources needed? (Data, time, people)

Scoping process

Brainstorm business problems (Not Ai problems) - > Identification of the problems

Ex:

What are the top 3 things you wish were working better?
- Increase Conversion
- Increase margin
- etc..

Brainstorm AI solutions - > Identification of the solutions ps: Not everything can be solved with ia
Access the feasibility and the value of potential solutions (Double check something that you wish that is true)
Determine the milestones
Budget for resources

Is this project technically feasible? Use external benchmark (literature, other company, competitor) Someone was able to make something similar?

Are we able to solve the problem? Its HLP? If yes, we have more chances to be able to build it

Screenshot 2024-09-22 at 13 28 50

Why use HLP in benchmark? People are very good on unstructured data tasks

Criteria: Can a human, given the same data, perform the task? Do we have features that are predictive?

Diligence on value

MLE metrics ex: word-level accuracy, query-level accuracy, search result quality -> user engagement Business Metrics: Revenue

Both are important Have a technical and business teams try to agree on metrics that both are comfortable with

Milestone & Resourcing

Key specifications:

ML metrics (accuracy, precision, recall)
Software metrics (Latency throughput, etc. given compute resources)
Business Metrics (revenue, etc.)
Resources Needed (data, person, help from another team)
Timeline

If unsure, consider benchmark exercise or building a POC first.