Modeling ‐ Data Iteration - liniribeiro/machine_learning GitHub Wiki

Data-centric AI development

Model-centric view -> Take the data that you have and develop a model that does as well as possible on it

Hold the data fixed and iteratively improve the code/model

Data-centric view -> The quality of the data is paramount. Use tools to improve data quality, this will allow multiple models todo well. If your data is good enough, the model will do just fine.

Hold the code fixed and iteratively improve the data

How we can make our learning algorithm even better?

Data quality How Can I improve data quality
One of the options are data augmentation

Data augmentation

Data augmentation can be a great way to grab more data specially when working with unstructurd data, such as image, audio, text.

Data augmentation is create examples of data that your model can learn from.

Goal:

Create realistic examples that the algorithm does poorly on, but the humans do well on.

checklist:

Does it sound realistic?
Is the x -> y mapping clear? Can humans recognize speech?
is the algorithm currently doing poorly on it?

Data iteration loop: Screenshot 2024-09-22 at 08 56 11

Can add data hurt performance?

For unstructured data problems:

The model is large
The mapping the x -> y is clear Then, adding data rarely hurts accuracy.

Structured Data Problems

For structured data is hard to use data augmentation or collect more data, thats why is good to try adding features to the data can improve performance.

Adding features

Based into past data at x and y

What are the added features that can help make a decision?

Error analysis can be harder on structured data, specially if theres is no good baseline (such as HLP) to compare

Experiment tracking

What to track?

Algorithm/code versioning
Dataset used
Hyperparameters
Results

Tracking tools

Text files
Spreadsheet
Experiment tracking system (sagemaker studio)

Desirable features

Information needed to replicate results
Experiment results, ideally with summary metrics/analysis
Resource monitoring, visualization, model error analysis

From Big Data to Good Data

Hight performance and reliable ML model -> Good data

Good data:

Good coverage of input x
Defined consistently
Feedback from production data (track concept drift and data drift)
Sized appropriately