If you could remember my last post about a self-driving car where I did a basic operation of multiply and sum of two arrays weights(W) and example(X):
1 | -3 | 1 | 4 | -2 | 6 | -3 | 5 | 0 | -1 | 0 | -1 | 2 | 0 | 1 | -7 |
00 | 00 | 12 | 11 | 00 | 15 | 13 | 05 | 07 | 00 | 00 | 00 | 12 | 00 | 10 | 14 |
[w1*x1 + w2*x2...... + w15*x15 + w16*x16]
[ 1*00 + -3*00 +...1*10 + -7*14 ] = 68
If we breakdown this equation, we have to do 16 multiplications and 15 addition, in total 30 CPU operations to complete this one, whereas in GPU it needed only 3 steps, let's see:
- First, Think these two above arrays as matrix of shape(1,16).
- Second, Transpose example matrix which will become (16,1).
- Third, Matrix multiplication of weight matrix and transposed example matrix will give the sum.
1 | -3 | 1 | 4 | -2 | 6 | -3 | 5 | 0 | -1 | 0 | -1 | 2 | 0 | 1 | -7 |
00 | 00 | 12 | 11 | 00 | 15 | 13 | 05 | 07 | 00 | 00 | 00 | 12 | 00 | 10 | 14 |
W × XT
[ 1 , 16 ] × [ 16 , 1 ]
[ 1 , 1 ]
68
Nifty right! just in 3 steps, Let's do this for 6 examples, stacked into rows of a single table:
-
00 00 05 04 00 10 14 03 02 12 11 02 04 04 01 00 -
00 01 04 04 02 11 12 02 03 14 10 00 04 05 00 00 -
00 08 08 00 00 12 12 00 00 10 10 00 00 08 08 00 -
00 00 06 14 00 06 12 10 00 15 20 11 06 11 05 00 -
00 10 10 00 00 11 11 00 00 12 12 00 00 11 11 00 -
10 08 10 00 12 20 12 00 00 15 10 06 00 08 06 00
⇓
1 | -3 | 1 | 4 | -2 | 6 | -3 | 5 | 0 | -1 | 0 | -1 | 2 | 0 | 1 | -7 |
00 | 00 | 05 | 04 | 00 | 10 | 14 | 03 | 02 | 12 | 11 | 02 | 04 | 04 | 01 | 00 |
00 | 01 | 04 | 04 | 02 | 11 | 12 | 02 | 03 | 14 | 10 | 00 | 04 | 05 | 00 | 00 |
00 | 08 | 08 | 00 | 00 | 12 | 12 | 00 | 00 | 10 | 10 | 00 | 00 | 08 | 08 | 00 |
00 | 00 | 06 | 14 | 00 | 06 | 12 | 10 | 00 | 15 | 20 | 11 | 06 | 11 | 05 | 00 |
00 | 10 | 10 | 00 | 00 | 11 | 11 | 00 | 00 | 12 | 12 | 00 | 00 | 11 | 11 | 00 |
10 | 08 | 10 | 00 | 12 | 20 | 12 | 00 | 00 | 15 | 10 | 06 | 00 | 08 | 06 | 00 |
×
=
39 | 45 | 50 | 91 | 52 | 69 |
W × XT
[ 1 , 16 ] × [ 16 , 6 ]
[ 1 , 6 ]
Still, in 3 steps, whetever be the number of examples. if you do the same thing with CPU it 180 steps as one example takes 30 steps and there are 6 of them so.