Data definition - liniribeiro/machine_learning GitHub Wiki

Define data and establish baseline

Data definition questions:

What is the input x?
- Lighting, contrast? Resolution?
- what features need to be included?
What is the target label y?
- How can we ensure that labels can have consistent labels?

Major types of data problems

Defining the best practices for structured and unstructured data is quite different. Data augmentation does not work in a structurted data, we cannot invent things that does not exists.

For small data -> Clean labels are critical Big Data -> Emphasis on data processes (How collect and store the data)

Unstructured x structured data

Unstructured data:

May or may not have a hudge collection of unlabeled examples x
humans can label more data
data augmentation more likely to be helpful

Structured data:

May be more difficult to obtain more data
Human labeling may not be possible (with some exceptions)

Small data vs big data

Small

clean labels are critical
can manual look though dataset and fix labels can get all the labelers to talk to each other

Big data:

Emphasis data process

Improve consistancy of labels

Have multiple labelers label same examples
When there's disagreement, make an agreement of labeling, document it
If labelers believes that x don't have enough info, change x
Iterate until its hard to significantly increase agreement
Standardize labels
Merge classes

Why mesure HLP?

Estimate Bayes error/irreducible error to help with error analysis and prioritization

Uses of HLP:

In academia, establish and beat a respectable benchmark to support publication.
Business or Pm asks for 99% accurancy, mesure HLP helps establish reasonable target
Prove the ML system is superior to the human doing the job, and thus the PO should adopt it. (Use this with caution)

Raising HLP

When the ground truth is externally defined, HLP gives an estimate bayes error/irreducible error . Improving label concictency will raise HLP

HLP on structured data

Structured data problems are less likely to involve human labelers thus HLP is less frequently used Some exceptions:

User id merging
Based on network traffic, is the computer hacked?
Is the transaction fraudulent?
Span account? Bot?
From GPS, whats the model transportation?

Obtaining Data

How long you should spend obtaining data? Screenshot 2024-09-22 at 12 17 39

Get into this iteration loop as quickly as possible
Instead of asking: How long it would take to obtain m examples? Ask: Hoe much data we can obtain in k days?
Exception: If you have worked on the problem before and from experience you know you need m examples.

Labeling data

Options: in-house. vs outsourced vs. crowdsourced
Having MLEs label data is expensive. But doing this for a few days is usually fine.
Who is qualified to label? SME (Subject matter exopert)
Don't increase your data by more than 10x at time, is very unpredictable what happen with your model if your data size increase more than 10x and the problems that can come with it.
Over investing a lot in collecting a lot of data is not a best decision sometimes.