Notes on Machine Learning: Fixing 'best constant' and 'best constant loss' in VW

It's well-known that best constant value that printed by VW in work summary is correct only for squared loss. I have also pointed out that best constant loss value is never printed out for classes other than {0,1} and John Langford suggested anyone to try to fix this or he may remove it at all otherwise.

So, I have decided to try. Below is quite scrupulous description of what and why have I done in code so anyone could easily verify my conclusions. I nigher ML nor mapple expert so such verification really has sense.

First of all, let's check how it was done in current VW sources:

1. There are two squared loss implementations in loss_functions.cc: squaredloss and classic_squaredloss. Both classes getLoss() functions return (prediction - label)^2. The difference is that first one returns modified result in case prediction is out of min and max label boundaries. For faster convergence of gd, i suppose. I consider them as one loss with classic loss function.

2. As can be seen from main.cc best constant is calculated in a following way (simplified):

best_constant = (weighted_labels - initial_t) / (weighted_examples - weighted_unlabeled_examples)

Let's check what this is means. For further discussion I'll use following definitions:

N - count of examples used for training.
N' - count of examples processed.
c - class labels ( c1 - 1st class label, c2 - second one; c1 < c2).
n - count of c1 examples observed.
m - count of c2 examples observed.
u - count of unlabeled examples observed.
N' = n + m + u = N + u.
w - vector of example weights.

From simple_label.cc we can find:

weighted_labels = sum(c*w), and by definition sum(c*w)
= n*c1*w + m*c2*w
weighted_examples = sum(w), e.g. = (n + m + u)*w

I treat label weights in vw as a multiplier for some observations. So training on example with weight 10 is equal to training on 10 copies of that example. In most cases weights are = 1 and in some algorithms implemented in vw they are even hardcoded to be 1.

In this case we will omit all weight values as they are already counted in some global variables. Thus we well treat our variables as:

weighted_labels = n*c1 +m*c2
weighted_examples = n+m+u = N'

From parse_args.cc:

weighted_unlabeled_examples is initialized with initial_t parameter.

From simple_label.cc:

weighted_unlabeled_examples += ld->label == FLT_MAX ? ld->weight : 0;

That means
weighted_unlabeled_examples = initial_t + sum(w |c=undefined)
e.g. weighted_unlabeled_examples = initial_t + u

Get all above together:

best_constant = ((n*c1+m*c2) - initial_t) / (N' - initial_t - u)
= ((n*c1+m*c2) - initial_t) / (N - initial_t)
= ((n*c1+m*c2) - initial_t) / (n+m - initial_t)

As best value is valid for square loss only we can expect that author assumed that c1is always 0 and c2 is 1. So he expected to get following:

best_constant = (m - initial_t) / (m+n - initial_t)

3. Let's make sure we understand where this formula came up from.

Simple squared loss function is l(x) = (c - x)^2 where c is class label.

And sum of loss functions for all examples (n of c1 and m of c2) will be:

L(x) = n*(c1 - x)^2 + m*(c2 - x)^2

where x is a vector of predictions, but as we are looking for best constant value this vector will contain only one constant value x. I won't consider averaged L() value as optimal x will fit both functions and first one is simpler.

Using maple:

Let's try to find x for minimum of y(x) with y'(x)=0

If we assume c1 = 0 and c2 = 1 we'll get same best_constant formula without initial_t smoothing.
Well, now i think we shall better understand how author calculated best constant value in case of squared loss and be able to reproduce same calculations for other loss functions.

The only unclear thing for me is why numerator and denominator are both decreased by initial_t. I couldn't find a place in code where initial_t could be mixed in weighted_labels. I suspect that's a mistake (at least for current version of source code) and won't use such approach till someone corrects me. In my opinion best_constant shall be calculated in a following way:

best_constant = weighted_labels / (weighted_examples - weighted_unlabeled_examples + initial_t)

Here initial_t is added to compensate additional value which weighted_unlabeled_examples got with initialization in parse_args.cc.

4. Counting examples.

I think that all manipulations with weighted examples and labels described above is just for getting counts of examples seen for both classes. The are needed for best const estimation. There are no variables in vw for counting examples per class and authors were forced to calculate them from implicitly related counters. They use weighted_examples minus weighted_unlabeled_examples to get total number of labeled examples and weighted_labels to get count of examples from class 2. Thus that approach won't work for label of class 1 <> 0.

I don't like the idea of adding new variables to shared data or other global objects too and I would like to make minimal changes to others' code. So we'll stick with estimation of example counts from weighted variables.
Still we need find a way to do that for any pair of labels as other loss functions require pair {-1,1} or even accept any pair of values.

So what do we know now?

We can calculate N = weighted_examples - weighted_unlabeled_examples + initial_t = n+m
and we know that weighted_labels consists of n*c1+m*c2
Lets solve that system of equations:

N = n + m
W = n*c1+ m*c2

result is : n = (W - c2*N) / (c1 - c2), m = N - n

For example for classes {0,1}: n = N - W, n = W
and for {-1,1}: n = (N - W) / 2, m = (N + W) / 2,

Seems legit. Let's add a code for obtaining n & m values for any pair of class labels from wighted sums:

        float weighted_labeled_examples = (float)(all.sd->weighted_examples - all.sd->weighted_unlabeled_examples + all.initial_t);
        label1_cnt = (all.sd->weighted_labels - label2*weighted_labeled_examples)/(label1 - label2);
        label2_cnt = weighted_labeled_examples - label1_cnt;

5. Defining a best constants for all loss functions

Ok, so now we have a function for getting n and m values as well as general understanding of how best constant could be calculated for arbitrary function. Let's define a best constant value for all loss functions which are supported in vw:

We already found it for both squared losses:

I prefer to leave it with generic label values {c1,c2}. Bcs current square loss implementation can form as with {0,1} as with {-1,1}and perhaps for any pair.

Next one is a hinge loss.

It's a piecewise function and defined in loss_functions.cc only pair {-1,1} so let's keep it as simple as possible:

My maple skills didn't allow me to find its minimum automatically, so let's do that analytically.

There are 2 obvious cases:

if n is 0 than only labels 1 were observed and x must be >= 1;
if m is 0 than only labels -1 were observed and x must be <= -1;

In case n*m > 0 where n and m are positive integers we will face with following family of functions:


m = n m < n m > n

(the graphics have logarithmic scale and minimum is well above zero).

So I'll define best constant for hinge loss as:

LogLoss is simpler. It's also defined for {-1,1} only. vw will throw exeption otherwise. So we'll find optimal x for {-1,1}.

l(x) = log(1 + exp(-c*x))
L(x) = n*log(1 + exp(-c1*x)) + m*log(1 + exp(-c2*c))

In case m or n are equal to 0 best constant is -1 and 1 respectively.

And the last one is quantile loss:

I couldn't solve it in mapple, so let's check this piecewise function manually. Keep in mind that c1 < c2 or always could be transformed to such formulation. Also 0<t<1 as it represents quantile.

Consider following cases:

n = 0, m>0: the best constant is obviously = c2 itself
m=0, n >0: the best constant is = c1

for n>0, m > 0 we have to consider 3 cases:

a). x < c1 < c2

L(x) = n*t*(c1 - x) + m*t*(c2 - x)

L(x)= t*(c1*n + c2*m) - x*(n*t + m*t)

as n>0, m>0 and t>0 we shall maximize x to minimize L(x) value. Unfortunately x is limited with c1, so x -> c1-eps, lim(eps) -> 0 - that's our best constant.

b). c1< c2 < x

L(x) = n*( -(1 - t)(c1 - x) ) + m*( -(1 - t)(c2 - x) )

L(x) = (t - 1)*(c1*m + c2*n) + x*(m + n - t*(m+n))

as t < 1 and (m+n-t*(m+n)) is always positive we shall minimize x to minimize L(x). As x is limited with c2 the best constant is x -> c2+eps, eps -> 0.

c). c1 < x < c2.

L(x) = t*(c1*n + c2*m) - c2*m + x*(m - t*(m + n))

We shall minimize x*(m - t*(m+t)). There might be 2 cases:

m >= t*(m + n) : x = c2- eps
m < t*(m + n) : x = c1 + eps

That's all. Let's combine these findings in more or less compact system:

Or, recalling that 0<t<1 we can simplify it to just

6. Best constant loss

Ok, now we have a function for getting example counts by label and 'best constant' estimation methods for all loss functions currently implemented in VW. I hope they are correct. The only one thing left is a method that will give us a loss of our best constant. Current implementation for squared loss is:

best_constant_loss = best_constant*(1.0f - best_constant)*(1.0f - best_constant) + (1.0f - best_constant)*best_constant*best_constant

which could be interpreted as

(x)*(c2-x)^2+(1-x)(c1-x)^2

I would calculate it as (m*(c2-x)^2+n*(c1-x)^2 ) / (n+m).

Let's see if it's the same. We already found that in case of label pair {0,1} the best constant of squared loss is x = m/(n+m). And is (1-x) = n/(n+m) ? Seems like that:

n = N - m,
n / (n + m) = (N - m) / (n + m)= N / (n + m) - x = 1- x

So it's just another form of obvious generic formula:

Best constant loss = (n*Loss(x,c1) + m*Loss(x,c2) ) / (n+m)

And as we are going to use it with loss functions where x could be defined in a other than squared loss way we'll stick with this generic formula. Moreover, we can reuse current loss function via all->loss->getLoss() for that.

7. Implementation

First of all we shall add our function's prototype above the main() in main.cc, just after namespace declaration:

bool get_best_constant(vw& all, float& best_constant, float& best_constant_loss);

in main() we need to remove lines:

        float weighted_labeled_examples = (float)(all->sd->weighted_examples - all->sd->weighted_unlabeled_examples);
        float best_constant = (float)((all->sd->weighted_labels) / (weighted_labeled_examples+all->initial_t));
        float constant_loss = (best_constant*(1.0f - best_constant)*(1.0f - best_constant)
                               + (1.0f - best_constant)*best_constant*best_constant);

and replace these lines:

cerr << endl << "best constant = " << best_constant;
if (all->sd->min_label == 0. && all->sd->max_label == 1. && best_constant < 1. && best_constant > 0.)
cerr << endl << "best constant's loss = " << constant_loss;

with following:

float best_constant; float best_constant_loss;
            if (get_best_constant(*all, best_constant, best_constant_loss))
            {
                cerr << endl << "best constant = " << best_constant;
                cerr << endl << "best constant's loss = " << best_constant_loss;
            }

this function shall be added below main():

bool get_best_constant(vw& all, float& best_constant, float& best_constant_loss)
{
    if ((all.loss == NULL) || (all.sd == NULL)) return false;

    float label1 = all.sd->min_label;
    float label2 = all.sd->max_label;

    float label1_cnt;
    float label2_cnt;

    if (label1 != label2)
    {
        float weighted_labeled_examples = (float)(all.sd->weighted_examples - all.sd->weighted_unlabeled_examples + all.initial_t);
        label1_cnt = (all.sd->weighted_labels - label2*weighted_labeled_examples)/(label1 - label2);
        label2_cnt = weighted_labeled_examples - label1_cnt;
    } else
        return false;

    if ( (label1_cnt + label2_cnt) <= 0.) return false;

    po::parsed_options pos = po::command_line_parser(all.args).
            style(po::command_line_style::default_style ^ po::command_line_style::allow_guessing).
            options(all.opts).allow_unregistered().run();

    po::variables_map vm = po::variables_map();

    po::store(pos, vm);
    po::notify(vm);

    string funcName;
    if(vm.count("loss_function"))
        funcName = vm["loss_function"].as<string>();
    else
        funcName = "squaredloss";

    if(funcName.compare("squared") == 0 || funcName.compare("Huber") == 0 || funcName.compare("classic") == 0)
    {
        best_constant = (label1*label1_cnt + label2*label2_cnt) / (label1_cnt + label2_cnt);

    } else if(funcName.compare("hinge") == 0) {

        best_constant = label2_cnt <= label1_cnt ? -1.: 1.;

    } else if(funcName.compare("logistic") == 0) {

        label1 = -1.; //override {-50, 50} to get proper loss
        label2 = 1.;

        if (label1_cnt <= 0) best_constant = 1.;
        else
            if (label2_cnt <= 0) best_constant = -1.;
            else
                best_constant = log(label2_cnt/label1_cnt);

    } else if(funcName.compare("quantile") == 0 || funcName.compare("pinball") == 0 || funcName.compare("absolute") == 0) {

        float tau = 0.5;
        if(vm.count("quantile_tau"))
          tau = vm["quantile_tau"].as<float>();

        float q = tau*(label1_cnt + label2_cnt);
        if (q < label2_cnt) best_constant = label2;
        else best_constant = label1;
    } else
        return false;


    best_constant_loss = ( all.loss->getLoss(all.sd, best_constant, label1) * label1_cnt +
                           all.loss->getLoss(all.sd, best_constant, label2) * label2_cnt )
            / (label1_cnt + label2_cnt);


    return true;
}

Initially I want create 3 functions: for getting n&m, for getting best constant and getting its loss, but there are too many parameters need to be passed between them. Including log loss function flag, as in this case min_label == -50 and getLoss() throws exemption. So all functions were merged to one.

Notes on Machine Learning

воскресенье, 16 ноября 2014 г.

Fixing 'best constant' and 'best constant loss' in VW

2. As can be seen from main.cc best constant is calculated in a following way (simplified):

3. Let's make sure we understand where this formula came up from.

4. Counting examples.

5. Defining a best constants for all loss functions

6. Best constant loss

7. Implementation

Комментариев нет:

Отправить комментарий

воскресенье, 16 ноября 2014 г.

Fixing 'best constant' and 'best constant loss' in VW

2. As can be seen from main.cc best constant is calculated in a following way (simplified):

3. Let's make sure we understand where this formula came up from.

4. Counting examples.

5. Defining a best constants for all loss functions

6. Best constant loss

7. Implementation

Комментариев нет:

Отправить комментарий

воскресенье, 16 ноября 2014 г.