Cost Sensitive One Against All (csoaa) multi class example - KaiWeiChang/vowpal_wabbit GitHub Wiki

Overview

CSOAA stands for "Cost Sensitive One Against All" - A multi-class predictive modeling reduction in VW.

Purpose:

The option --csoaa <K> where <K> is the number of distinct classes directs vw to perform cost-sensitive K multi-class (as opposed to binary) classification. It extends --oaa <K> to support multiple labels per input example, and costs associated with classifying these labels.

Notes:

Data-set labels must be in the natural number set {1 .. <K>}
<K> is the maximum label value, and must be passed as an argument to --csoaa
The input/training format for --csoaa <K> is different than the traditional VW format:
- It supports multiple labels on the same line
- Each label has a trailing cost
- Cost syntax looks just like weight syntax: a colon followed by a floating-point number. For example: 4:3.2 means the class-label 4 with a cost of 3.2, but means the opposite of weights.
- It is critical to note that costs are not weights. They are the inverse of weights. A label with a lower cost is prefered over a label with a higher cost on the same line. That's why they are called 'costs'.
- Another difference from traditional vw input format is that every line (both in training and testing) must include all the allowed labels at the beginning (before the 1st | char).
The reduction with --csoaa is to a regression problem (i.e. conditional mean estimation), so forcing the loss function to logistic does not make much sense. Generally, when using multi-class, you should leave the --loss_function alone and let the algorithm use the built-in default.

Example

Assume we have a 3-class classfication problem. We label our 3 classes {1,2,3}

Our data set csoaa.dat is:

1:1.0 a1_expect_1| a
2:1.0 b1_expect_2| b
3:1.0 c1_expect_3| c
1:2.0 2:1.0 ab1_expect_2| a b
2:1.0 3:3.0 bc1_expect_2| b c
1:3.0 3:1.0 ac1_expect_3| a c
2:3.0 d1_expect_2| d

Notes:

The first 3 examples (lines) have only one label (with costs) each, and the next 3 examples have multiple labels on the same line. Any number of class-labels between {1 .. <K>} (1..3 in this case) is allowed on each line.
We assign a lower cost to the label we want to be preferred. e.g. in line 4 (tagged ab1_expect_2) we have a cost of 1.0, for class-label 2; and a higher cost 2.0, for class-label 1.
The input feature section following the '|' is the same as in traditional VW: you may have multiple name-spaces, numeric features, and optional weights for features and/or name-spaces (Note in this section the weights are weights, not costs, so they are positively correlated with chosen labels)

We train:

vw --csoaa 3 csoaa.dat -f csoaa.model

Which gives us this progress output:

final_regressor = csoaa.model
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading from csoaa.dat
num sources = 1
average    since         example     example  current  current current
loss       last          counter      weight    label  predict features
0.000000   0.000000          3           3.0    known        3        2
0.833333   1.666667          6           6.0    known        1        3

finished run
number of examples = 7
weighted example sum = 7
weighted label sum = 0
average loss = 0.7143
best constant = 0
total feature number = 17

Now we can predict, loading the model csoaa.model and using the same data-set csoaa.predict as our test-set:

vw -t -i csoaa.model csoaa.dat -p csoaa.predict

Similar to what we do in vanilla classification or regression.

The resulting csoaa.predict file has contents:

1.000000 a1_expect_1
2.000000 b1_expect_2
3.000000 c1_expect_3
2.000000 ab1_expect_2
2.000000 bc1_expect_2
3.000000 ac1_expect_3
2.000000 d1_expect_2

Which is a perfect classification:

all the expect_1 lines have a predicted class of 1,
all the expect_2 lines have a predicted class of 2,
and all the expect_3 lines have a predicted class of 3.

QED

Difference from other VW formats

Test examples are different from standard VW test examples because you have to tell VW which labels are allowed. For example, assuming 4 possible labels (1,2,3,4), this is how a test line could look like:

1 2 3 4 | b d e

And here's another, where only labels (1,4) are allowed:

1 4 | b d e

At training time, if there's an example with label 2 that you know (for whatever reason) will never be label 4, you could specify it as:

1:1 2:0 3:1 | example...

This means that labels 1 and 3 have a cost of 1, label 2 has a cost of zero, and no other labels are allowed. You can do the same at test time:

1 2 3 | example...

VW will never predict anything other than the provided "possible" labels.

Credits

Thanks to Ciemo for the example and for asking the right Qs on the mailing list.
Thanks to Stephane for patiently answering Ciemo's Qs.
See also Hal's doc at: http://www.umiacs.umd.edu/~hal/tmp/multiclassVW.html