5. Machine Learning System Design - ZYL-Harry/Machine_Learning_study GitHub Wiki

Prioritizing what to work on

How to spend the time to make it have low error?
1.collect lots of data
2.develop sophisticated features based on email routing information
3.develop sophisticated features for message body
4.develop sophisticated algorithm to detect misspellings

Error analysis

Recommended approach:
1.start with a simple algorithm that can be implemented quickly, implement it and test it on the cross validation data
2.plot learning curves to decide what can help to improve the algorithm
3.error analysis: manually examine the examples that the algorithm made errors on, see if any systematic trend in what type of examples it is making errors on

Tip:
The single rule number evaluation metric(numerical evaluation) can help clearly see the error go up or go down when implementing new ideas

Error metrics for skewed classes

Problem:
it's particularly tricky to come up with an appropriate error metric for the learning algorithm==skewed classes
Skewed classes:
having a lot more of examples from one class tahn from the other class, and by just predicting "y=0" or "y=1" all the time, the algorithm can do pretty well, then using classification error as the evaluation metric cannot help well
Method----Precision / Recall:

Trading off precision and recall

How to decide which kind of precision and recall is better?
F_1 score(F score): the harmonic mean of the precision and recall

Method: computing precision and recall for many threshold, and then computing F_1 score for each one to pick the best one and its threshold

How much data to train on

It's not who has the best algorithm that wins, it's who has the most data

First, can human experts look at the features x and then confidently give a prediction y: y can be predicted accurately by features x
More parameters---low bias---smaller cost function J_train(θ)
More data(training set)---low variance---J_test(θ)≈J_train(θ)
Then, smaller cost function J_test(θ)

How to get more data

artificial data synthetic:
1.create a new dataset
2.expand the dataset by distortion on the original one

Ceiling Analysis

Function: figure out which part is the most valuable one that needs to work on in a pipeline machine learning system
Method: give each part the correct data to see how much accuracy the system can be improved and then choose the most valuable one