суббота, 16 мая 2015 г.

Math behind eval_count_of_generated_ft() in VW

In this post I would like to explain math used in eval_count_of_generated_ft() function proposed as a part of new interactions generation mechanism to VW 7.10.1 master branch.



First of all, let's remind how new features are generated:

Imaging interaction of namespaces 'abbc' with 10 features in each. In this case feature generation will be done with following loops:


for x1 =1:10 in a
   for x2 = 1:(10 -1) in b
     for x3 = (x2+1):10 in b
        for x4 = 1:10 in c
             new_feature = x1*x2*x3*x4

This is a pseudocode. FNV hash is used instead of * operation.

As you can see, loops over b are modified to make sure only simple combinations of features from b are used. E.g if you generated feature b1b2 you shouldn't generate feature b2b1 as it has no much sense. You shouldn't also generate b1b1 or any other bxbx.

Now let's do one enhancement. Features in VW might have weights and as it was pointed out by Martin Popel the interaction b1b1 do has sense if b1 weight is != 1.0. It may help to model non-linear behavior. So our enhanced loops will look like



for x1 =1:10 in a
   for x2 = 1:(10 -1) in b
     for x3 = (x2+(weight(x2) != 1.0)?0:1):10 in b
        for x4 = 1:10 in c
             new_feature = x1*x2*x3*x4


This is the idea behind feature interactions generator currently proposed to VW. It generates much less new features for cases like -q aa or --cubic bbb (because most of features generated with old algo for such cases differ only by order of interacting features in them.).


The problem.


There are 2 values that should be updated after new features generation: num_features and total_sum_feat_sq. Both are fields of example class.
First is a total number of features in example. Second is a sum of squares of its features weights. These values are used later in GD algo and must be calculated as fast as possible.


With old feature interactions algo (that generates permutations of features instead of their simple combinations) it was very easy to adjust these values.
It was just:

num_features +=  size(a)*size(b)*size(b)*size(c)
total_sum_feat_sq +=
sum_feat_sq(a)*sum_feat_sq(b)*sum_feat_sq(b)*sum_feat_sq(c)

where sum_feat_sq is a sum of squared weights of all features in a namespace which was precalculated at example parsing stage.

I'm not going to proof that these calculations were right - it's easy to do.


Now we need to think out a new formulas for num_features and
total_sum_feat_sq for currently proposed feature interactions generator. 
There are 2 ways of doing that.

First is to go throw our loops and just increment values in them. It's easy to implement and you may be sure that these values reflect true feature generator behaviour. But it may be slower than analytic approach used in old algo.

Second is to think out analytic formulas that faster to compute.

First approach was added to interactions.cc and commented out as I found these formulas after that. But in case I made a mistake it could be easily uncommented and used instead them. So it plays a role of a backup plan. It could be also used to verify formulas - their results have to be same.

As for formulas - they are not obvious and thus I decided to explain them in this post.



Let's split out task on 4 subtasks:


1. Find a new formula for num_features without generation of feature interactions with itself.

2. Enhance new formula for num_features with case when feature interactions with itself are allowed in case their weights != 1.0.

3.  Find a new formula for total_sum_feat_sq without generation of feature interactions with itself.

4. Enhance new formula for total_sum_feat_sq with case when feature interactions with itself are allowed in case their weights != 1.0.



I. Let's simplify our case.


I assert that formulas for interaction 'abbc' will be in form:

num_features +=  size(a)*size(b^b)*size(c)

total_sum_feat_sq +=
sum_feat_sq(a)*sum_feat_sq(b^b)*sum_feat_sq(c)
 where * is number multiplication and ^ is a namespace interaction.

That means that our new algo behave exactly like old one in case namespaces are different. All we need is to learn how to calculate number of features and their sum_feat_sq in parts where namespaces are interacting with themself. Only in these parts the algo logic is different.

For example, if we consider interaction 'abbbcddefff' (namesaces are always sorted and thus grouped) we can think about it as 'aBcDeF' where B D and F are new compound namespaces. And old formulas would be applicable to such interaction. We only need to learn how to calculate num_features and
total_sum_feat_sq for B, D and F which are 'bbb', 'dd', 'fff'.

From this point we will talk only about new values calculation for interactions of any length between features of a single namespace. Let it be 'AAA..A'.



II. num_features for interaction in case of simple combinations


Consider namespace A with n features {a1,a2,...,an} and its interaction of length k: 'AA...A' k-times.

Of course, k >= 2 and n >= k. If n < k you won't be able to make even one combination without repetition of elements in it.

In this case the number of generated simple combinations is simple C(n,k) = n!/(n-k)!/k! which is easy to calculate.


III. num_features for interaction in case of simple combinations with allowed self interactions.


Let's consider more difficult case. Let's allow interactions of feature with itself in case it's weight is != 1.0. That won't be a simple combinations anymore. And it still won't be permutations. The number of a new features will be somewhere between.

Also our assertion that k must be >= n isn't required if we have at least one feature with weight != 1. Bcs it can be repeated and you can make interaction of any length with it.

It's easy to find num of features for k = 2. It will be just C(n,k) + number of features with weight != 1.0. Because we well get just additional <a1,a1> pairs - one per each feature with weight != 1.0.

But for k > 2 it won't be so easy. Imagine k = 3, n = 5. We'll get following interactions:

123
124 134        234
125 135 145 235 245 345

Here 123 stands for a1*a2*a3.


And if weight(a1,a2) != 1.0 this turn to:

111
112 122        222
113 123        223
114 124 134 224 234
115 125 135 225 235 245 345

Coloring stands for different lengths of subsequence of same index.

So we got following new interactions:

111
112 122 222
113        223
114        224
115        225
This could be also written a bit shorter:

111  222
11122
11223
11224
11225
Will it be like that if we chose a4 and a5? Actually, yes:


144 155
244 255
344 355
444 455
445 555

I'm claiming that indexes of features with weights != 1.0 doesn't matter. Only count of them matters.

Let's take a look at it with k = 4

Regular simple combinations:

1234
1235 1245 1345 2345

In case weight(a1,a2) != 1.0 we'll get

1111
1112 1122                   1222                            2222
1113 1123                   1223                            2223
1114 1124 1134          1224 1234                   2224 2234
1115 1125 1135 1145 1225 1235 1245 1345 2225 2235 2245 2345

New features appeared:

1111
1112 1122                   1222 2222
1113 1123                   1223 2223
1114 1124 1134          1224 2224 2234
1115 1125 1135 1145 1225 2225 2235 2245

Let's rewrite this using following redefinitions.
Let 11 = a, 111 = b, 1111 = c, 22 = d, 222 = e, 2222 = f
and reorder this in following way:

 Group 1             Group 2  Group 3
          a23 d13
          a24 d14
b2 e1 a25 d15
b3 e3 a34 d34    
b4 e4 a35 d35                    c
b5 e5 a45 d45     ad           f

What will happen if we increase m up to 3?
We'll additionally get:
 
1133, 1233, 1333, 1334, 1335, 2233, 2333, 2334, 2335, 3333,3334, 3335, 3345

which is

Group 1         Group 2  Group 3
1333  1233     1133        3333
2333  1334     2233
3334  1335    
3335  2334
          2335
          3345
Let's note that interactions of features a1*a2*a3*a4 we consider equal to a2*a4*a1*a3 etc. Order doesn't matter. Even our hash will be different - we agreed to treat them as equal in new feature generation mechanism as having them different has no sense form model training point of view. Thus 1334 = 1433 = 3314.


Let's use m to indicate how many features with weight != 1.0 we have.

As for Group 3: it's easy to calculate number of features in it. It's equal to m where 0 <= m <= n.

Consider Group 1:
It could be seen that numbers that interacting with b and e are simple combinations of sets {2,3,4,5} and {1,3,4,5} with k = 2. And numbers, interacting with a and d are simple combinations of same sets with k = 1.

I dare say that number of features of group 1 could be calculated in a following way:
* lets calculate m
* num of ft in group 1 is
  m * sum{j=1; j < n-1} ( C(n-1,j) )

Consider Group 2:

This group consist of interactions that contains only features with weight != 1.
Like 1122, 1133, 2233. Looks like C(3,2).

But, consider group 2 for n = 5, k =5, m = 3
It'll be
11111
11122 11133 11222 11233                      22222
                      11223           11333 12233 22233 22333 33333
                      11224           11334                       22334
                      11225           11335                       22335


Looks familiar and looks like C(m,2)*C(n,1) But it's not. Things may become even worse with things like 111112224556789

I could generalize all these cases and propose a formula. My idea is following:
the blue (single interaction) indexes and other colors (multiple interaction) are generated by different combinatoric processes. If I may say saw.

  • Blue numbers are always form a simple combination and give us C(n-m,k-l) where l is number of slots occupied by colored numbers.
  • Colored numbers is combination with repetition. It cold be treated as "filling k-l slots with elements from m groups". And number of ways to do this is C(l+m-1, l) = C(l+m-1, m-1). I refer to this material for detailed explanation.

For example in 111112224556789
11111222455 - is 10 (l) slots filled from m categories (well, let m = 5 and 4 is assumed to be from m).
6789 - a simple combination of length n-l = 15-10 =5 from (n-m) elements.

In other words - let's find all elements with weight != 1.0 and split our set of elements of a given namespace in two sets - set of m elements with this weight and set of n-m elements with weight == 1.0. Both sets behave differently due to using weight in logic of feature generator.
And let's count all sequences that might be generated when different number of slots are given to these sets.


Example:

n = 5, k = 4, m = 2

let l - number of slots allocated to m combinations with repetitions.

l = 0, Num features: C(1,0)*C(3,4) = 1*0 = 0 [C(l+m-1,l)*C(n-m, k-l)]


l = 1, Num features: C(2,1)*C(3,3) = 2*1
         
1345  2345        


l = 2, Num features: C(3,2)*C(3,2) = 3*3



1134          1234          2234
1135 1145 1235 1245 2235 2245

l = 3, Num features: C(4,3)*C(3,1) = 4*3

1113 1123 1222223
1114 1124 1222224
1115 1125 122222

l = 4, Num features: C(5,4)*C(3,0) = 5*1

1111
1112 1122 1222 2222



 IV. Find a new formula for total_sum_feat_sq without generation of feature interactions with itself.

Ok, let's try to find a fast way to compute total_sum_feat_sq.
How it was with old code? 

The weight of newly generated feature is equal to multiplication of all weights of features used to generate it. In case of old permutations algo the sum of squares of all weights was calculated just by multiplication of precomputed sum_feat_sq for each namespace. Let's see how it works:

Consider interaction 'abb' with 2 features in each. It will generate following new features' weights with old algo:

(a1b1b1)^2, (a1b1b2)^2, (a1b2b1)^2, (a1b2b2)^2, (a2b1b1)^2, (a2b1b2)^2, (a2b2b1)^2, (a2b2b2)^2

and we need their sum. 

What if we have sum_feat_sq for a, which is (a1^2+a2^2) and for b is (b1^2+b2^2). Let's multiply them:

(a1^2+a2^2)*(b1^2+b2^2)*(b1^2+b2^2) = [(a1b1)^2 + (a1b2)^2 + (a2b1)^2 + (a2b2)^2]*(b1^2+b2^2) = (a1b1b1)^2 + (a1b2b1)^2 + (a2b1b1)^2 + (a2b2b1)^2 + (a1b1b2)^2 + (a1b2b2)^2 + (a2b1b2)^2 + (a2b2b2)^2

This is exactly what we looking for.

Ok,  how to compute same thing for our new algo?

I'm going to do same assertion I did for number of features. 

a. sum_ft_sq('abbc') =  sum_ft_sq('a')*sum_ft_sq('bb')*sum_ft_sq('c')
And we only need to learn how to find out sum_ft_sq() of block of namespace interacting with itself.

Another assertions I want to made are following:

b. I won't write ()^2 for all features. Due to properties of ^2 I always can redefine a^2 = a' and then redefine it back. So won't write it.
c. As we're working with one namespace - I'll write only indexes. E.g a1a2a3 = 123. There is a real math multiplication operation between indexes so 123 = 321 and 123+124 = 12(3+4). But of course you can't evaluate it bcs it's just indexes of unknown variables. So 123+124 != 247


Now let's think what we would mean by good formula for our case. The idea is to calculate a sum  of weights (squares of them, on practice) generated by our algorithm. So we can just multiply it. But we don't want to iterate indexes backward. We don't want to keep history of feature weight we already seen. We wold like to build this sum having one new feature weight at a time and not storing it somewhere.
Why? Because there may be thousands of features. And because such method could be implemented even in parser.cc where next feature for example is read and forgot at one step.

Consider interaction 'aa'
n = 3, k = 4

It will generate following weights:
12
13 23
14 24 34

That's a very useful form of representation of generated features as on every new line you know one new feature. I call it normal form. And we can group indexes:

1*2
(1+2)*3
(1+2+3)*4

Wow. Looks like we can build our algorythm easily. Lets call it R(2) where 2 is k.
R2 will be:

 sum = 0; i=0;
 for next feature
      {
        result += sum*feature;
        sum += feature;        
       }


At every step it will have following result and sum
0: result = 0, sum = 0
1: result = 0*1=0, sum = 0+1 = 1
2: result = 0+sum*2 = 0+1*2, sum = 1+2
3: result = 0 + 1*2 + (1+2)*3, sum = 1+2+3
4: result = 0+1*2+(1+2)*3+(1+2+3)*4, sum = 1+2+3+4

etc.
It's obvious that at result will store what we need.

Let's name each value generated on i-th iteration and added to result R(2,i). In this case normal form can be represented like that:

R2
R(2,0) 0
R(2,1) 0*1
R(2,2) (0+1)*2
R(2,3) (0+1+2)*3
R(2,4) (0+1+2+3)*4



Ok, is everything so simple.
Let's consider 'aaa'
n = 5, k = 3
It's features will be:

123
124 134        234
125 135 145 235 245 345

In normal form:

12*3
(12+13+23)*4
(12+13+14+23+24+34)*5

which is

R3
R(3,3) 12*3
R(3,4) (12+(1+2)*3)*4
R(3,5) (12+(1+2)*3+(1+2+3)*4)*5

Ok, lets look at
n=5, k = 4

It generates

1234         
1235 1245 1345 2345

Which is

R(4,4) 123*4
R(4,5) (12*3+(12+(1+2)3)4)*5

And the last example:
n=5, k = 5
generates:
12345

R5
R(5,5) 1234*5


Wanna see R(5,6)? Then we need k = 6.
Generated feature weights will be:

12345
12346 12356 12456 13456 23456

R(5,6) (1234+((12+((1+2)3)4)5)6


I spent a loot of time trying to find a rule that allows to calculate R(k, i) by some function of R(k, 1:i-1). There is none.
Luckily, once I wrote different R(k) normal forms close enough to each over to notice that R(k,i) depend on R(k-1, i-1)

In fact, the recurrent formula is:

R(k,i>=k) = sum(R(k-1,1:i-1))*i,
R(k,i<k) = 0

Let me remind you R2 results:

R(2,2) (0+1)*2
R(2,3) (0+1+2)*3
R(2,4) (0+1+2+3)*4

Let's take a look on R3 again

R3
R(3,3) 12*3
R(3,4) (12+(1+2)*3)*4
R(3,5) (12+(1+2)*3+(1+2+3)*4)*5

could be written as

R3
R(3,3) R(2,2)*3
R(3,4) (R(2,2)+R(2,3))*4
R(3,5) (R(2,2)+R(2,3)+R(2,4))*5

R4 will be

R(4,4) 123*4
R(4,5) (12*3+(12+(1+2)3)4)*5

R(4,4) R(3,3)*4
R(4,5) (R(3,3)+R(3,4))*5

And monstrous R5 is

R(5,5) 1234*5
R(5,6) (1234+((12+((1+2)3)4)5)6

R(5,5) R(4,4)*5
R(5,6) (R(4,4) + R(4,5))*6

This formula could be written as following algorithm:

for each new feature i do:
{
  for (j=k:2;j--)
    sR[j] += sR[j-1]*i

  R[1] += i
}
Result = sR[k]
Where sR[2:k] is array of sums of R(k) results initialized with zeros and sR[1] is just a sum of feature weights.

For example for k=5, n = 5 it will be


i sR5 sR4 sR3 sR2 sR1

0 0 0 0 0
1 0 0 0 0 1
2 0 0 0 12 1+2
3 0 0 123 12+(1+2)3 1+2+3
4 0 1234 123+(12+(1+2)3)4 12+(1+2)3+(1+2+3)4 1+2+3+4
5 12345 1234+(123+(12+(1+2)3)4)5 123+(12+(1+2)3)4+(12+(1+2)3+(1+2+3)4)5 12+(1+2)3+(1+2+3)4+(1+2+3+4)5 1+2+3+4+5

This recurrent formula isn't the best I could wish for.  If you have interaction length 6 you have to allocate array length 6 and store all sums of R(1-6). But it's feasible amount of memory.

V. Enhance new formula for total_sum_feat_sq with case when feature interactions with itself are allowed in case their weights != 1.0.


 Ok, but by default we allow feature self interactions  if weight != 1.0. How that'll affect our recurrent formula?
Let's see.
I will mark R() where features with weight != were encountered as R'().

n = 5, k = 2, m = 2

R2

11
12 22
13 23
14 24 34

R'(2,1) (0+1)*1
R'(2,2) (0+1+2)*2
R(2,3) (0+1+2)*3
R(2,4) (0+1+2+3)*4

what if indexes of weights !=0 is not 1 and 2, but 3 and 4?

R(2,1) (0)*1
R(2,2) (0+1)*2
R'(2,3) (0+1+2+3)*3
R'(2,4) (0+1+2+3 +4)*4

What if m=1, and index = 3?

R(2,1) (0)*1
R(2,2) (0+1)*2
R'(2,3) (0+1+2+3)*3
R(2,4) (0+1+2+3)*4

As you can see if case index' weight != it's added to the sum which is multiplied by it. Thus produces self interaction. But it's not stored in this sum, so it disturb only R(2,i) in which this index appeared. That's very convenient - you can forget about it as usual after usage.
But what about R3, R4 etc?

Lets build R3 for
n = 5, k = 3, m = 2 with same indexes m{1,2}

111
112 122               222
113 123               223
114 124 134        224 234
115 125 135 145 225 235 245 345

newly appeared features are highlighted.

Let's build a new normal form:

R'(3,1) (1*1)1
R'(3,2) (11+(1+2)2)2
R(3,3) (11+(1+2)2)3
R(3,4) (11+(1+2)2+(1+2)3)4
R(3,5) (11+(1+2)2+(1+2)3 + (1+2+3)4)5

It's

R'(3,1) (R'(2,1))*1
R'(3,2) (R'(2,1)+R'(2,2))*2
R(3,3) (R'(2,1)+R'(2,2))*3
R(3,4) (R'(2,1)+R'(2,2)+R(2,3))*4
R(3,5) (R'(2,1)+R'(2,2)+R(2,3)+R(2,4))*5


For m = {3,4}

123 133        233        333
124 134 144 234 244 334 344 444    
125 135 145 235 245 335 345 445

R'(3,3) (12+(1+2+3)3)3
R'(3,4) (12+(1+2+3)3+(1+2+3+4)4)4
R(3,5) (12+(1+2+3)3+(1+2+3+4)4)5

R'(3,3) (R(2,2)+R'(2,3))3
R'(3,4) (R(2,2)+R'(2,3)+R'(2,4))4
R(3,5) (R(2,2)+R'(2,3)+R'(2,4))5

Let's increase n to 6:

126 136 146 156 236 246 256 336 346 356 446 456

R(3,6) (12+(1+2+3)3+(1+2+3+4)4+(1+2+3+4)5)6

R(3,6) (R(2,2)+R'(2,3)+R'(2,4)+R(2,5))5


Ok, now let's take a look on R4 with m{1,2}

1111
1112 1122                   1222                            2222
1113 1123                   1223                            2223
1114 1124 1134          1224 1234                   2224 2234
1115 1125 1135 1145 1225 1235 1245 1345 2225 2235 2245 2345

R'(6,1) 111*1
R'(6,2) (111+(11+(1+2)2)2)2
R(6,3) (111+(11+(1+2)2)2)3
R(6,4) (111+(11+(1+2)2)2+(11+(1+2)2)3)4
R(5,6) (111+(11+(1+2)2)2+(11+(1+2)2)3+(11+(1+2)2+(1+2)3)4)5

R'(6,1) R'(3,1)*1
R'(6,2) (R'(3,1)+R'(3,2))2
R(6,3)  (R'(3,1)+R'(3,2))3
R(6,4)  (R'(3,1)+R'(3,2)+R(3,3))4
R(5,6)  (R'(3,1)+R'(3,2)+R(3,3)+R(3,4))5

Well, now it's more clear.

Our enhanced algorithm could be written as:

for each new feature i do:
if (weight(i) == 1.0)
{
  for (j=k:2; j--)
    sR[j] += sR[j-1]*i

  R[1] += i
 } else {

 R[1] += i

  for (j=2:k; j++)
    sR[j] += sR[j-1]*i
}
Result = sR[k]

it looks like case of weight(i) != 1.0 is inversion of a regular case.

On practice it was found that recurrent calculation may have measurement error 1e-5 in comparison with direct summing weight squares.






Комментариев нет:

Отправить комментарий