5. Machine Learning System Design - ZYL-Harry/Machine_Learning_study GitHub Wiki

Prioritizing what to work on

  • How to spend the time to make it have low error?
    1.collect lots of data
    2.develop sophisticated features based on email routing information
    3.develop sophisticated features for message body
    4.develop sophisticated algorithm to detect misspellings

Error analysis

  • Recommended approach:
    1.start with a simple algorithm that can be implemented quickly, implement it and test it on the cross validation data
    2.plot learning curves to decide what can help to improve the algorithm
    3.error analysis: manually examine the examples that the algorithm made errors on, see if any systematic trend in what type of examples it is making errors on

Tip:
The single rule number evaluation metric(numerical evaluation) can help clearly see the error go up or go down when implementing new ideas

Error metrics for skewed classes

  • Problem:
    it's particularly tricky to come up with an appropriate error metric for the learning algorithm==skewed classes
  • Skewed classes:
    having a lot more of examples from one class tahn from the other class, and by just predicting "y=0" or "y=1" all the time, the algorithm can do pretty well, then using classification error as the evaluation metric cannot help well
  • Method----Precision / Recall:
    image

Trading off precision and recall

image

  • How to decide which kind of precision and recall is better?
    F_1 score(F score): the harmonic mean of the precision and recall
    image
    Method: computing precision and recall for many threshold, and then computing F_1 score for each one to pick the best one and its threshold

How much data to train on

It's not who has the best algorithm that wins, it's who has the most data

  • First, can human experts look at the features x and then confidently give a prediction y: y can be predicted accurately by features x
  • More parameters---low bias---smaller cost function J_train(θ)
  • More data(training set)---low variance---J_test(θ)≈J_train(θ)
    Then, smaller cost function J_test(θ)

How to get more data

  • artificial data synthetic:
    1.create a new dataset
    2.expand the dataset by distortion on the original one
    image

Ceiling Analysis

  • Function: figure out which part is the most valuable one that needs to work on in a pipeline machine learning system
  • Method: give each part the correct data to see how much accuracy the system can be improved and then choose the most valuable one
    image
    image
    image