One morning I see in my timeline this tweet. It exlaims about transformer model abilities
NN learns how to learn linear regression, decision trees, 2-layer ReLU nets 😲 furthermore: outperforms XGBoost, does Lasso in one-pass, seems not to rely on nearest-neighbor.
It refers to this work. I look carefully through the article. The example looks simple, and I want to play with linear approximation and find its limitation. Good thing they published model and training scripts.
At work we recently deployed POS (point of sale) software written in Python. Web server, DB connector, abstract classes, function decorators. It is great. Python is great. But when I read implementation of a ML algorithm from this paper I’m starting to hate Python.
Prerequisites
To go through evaluation process you need to run jupyter notebook
It will be handy if CUDA 11.3 is already installed. I wrote about it earlier. Then to install PyTorch you will need to run
pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113
Also the following modules will be required:
transformers
sklearn
numpy
- vectors and matrices (@
operator which is a short formatmul
)xgboost
munch
- config readingtqdm
- fancy progressbar for terminal
Understand Models in PyTorch
The core of any neural network model (and apparently Transformer as well) is Module
What can you say about it's trivial example?
import torch.nn as nn
import torch.nn.functional as F
class Model(nn.Module):
def __init__(self):
super().__init__()
self.conv1 = nn.Conv2d(1, 20, 5)
self.conv2 = nn.Conv2d(20, 20, 5)
def forward(self, x):
x = F.relu(self.conv1(x))
return F.relu(self.conv2(x))
Clearly it has something to do with 2D convolutions that process 1 channel on the first layer and 20 channels on second. Channels? Easy peasy. Kernel size is 5. Signal between layers simplified by rectified linear unit (ReLU).
Still nothing makes sense? Okay, let me tell this one more time. Given two dimensional data (in the case of stock price graph it's stock price VS closing time) we want to catch patterns and correlations and apply them in the future. This data is on the first layer. Then
Tensor operations
Extract specific elements from one tensor when another tensor works as a mask
>>> import torch
>>> q = torch.tensor([[1,2,3],[4,5,6],[7,8,9],[10,11,12]])
>>> q
tensor([[ 1, 2, 3],
[ 4, 5, 6],
[ 7, 8, 9],
[10, 11, 12]])
>>> a = torch.tensor([[0,1,0],[1,0,0],[0,0,1],[0,1,0]])
>>> a
tensor([[0, 1, 0],
[1, 0, 0],
[0, 0, 1],
[0, 1, 0]])
>>> q.gather(1,a)
tensor([[ 1, 2, 1],
[ 5, 4, 4],
[ 7, 7, 8],
[10, 11, 10]])
Now, if rows in vector a
have in the place to indicate what index we need to use in tensor q
(something like one-hot encoding) then
>>> b = a.max(1).indices.view(4,1)
>>> b
tensor([[1],
[0],
[2],
[1]])
>>> q.gather(1,b)
tensor([[ 2],
[ 4],
[ 9],
[11]])
If we need to update rows with new values and remove old rows (that is similar to a shift
operation on arrays but in this case regarding to a specific axis)
>>> a = torch.tensor([[1,2],[3,4],[5,6]])
>>> a
tensor([[1, 2],
[3, 4],
[5, 6]])
>>> b = torch.roll(a,-1,0)
>>> b
tensor([[3, 4],
[5, 6],
[1, 2]])
>>> b[2,:] = torch.tensor([7,8])
>>> b
tensor([[3, 4],
[5, 6],
[7, 8]])
In case if a sequence of events stared in one array, and we have several sequences like this in one tensor, and we want to take sub intervals from these sequences and feed them in batches where these subintervals form 2D tensors
>>> a = torch.tensor([[1,2,3,4],[5,6,7,8]])
>>> a
tensor([[1, 2, 3, 4],
[5, 6, 7, 8]])
>>> a.view(2,2,2)
tensor([[[1, 2],
[3, 4]],
[[5, 6],
[7, 8]]])
>>> a.view(2,2,2).permute(1,0,2)
tensor([[[1, 2],
[5, 6]],
[[3, 4],
[7, 8]]])
Two vectors make a matrix
# ( 1 ) * (1 0) = [ 1 0
# 0 0 0 ]
>>> a = torch.tensor([1,0]).unsqueeze(1)
>>> torch.mm(a, a.T)
tensor([[1, 0],
[0, 0]])
When for any reason one column (or row) must be copied several times to create a matrix that can be then processed in one go then keep in mind that these are the same
>>> torch.stack([torch.tensor([1,2,3]) for i in range(2)])
tensor([[1, 2, 3],
[1, 2, 3]])
>>> torch.tensor([1,2,3]).unsqueeze(0).repeat(2,1)
tensor([[1, 2, 3],
[1, 2, 3]])
The same as these basics
>>> [2 for i in range(5)]
[2, 2, 2, 2, 2]
>>> [2]*5
[2, 2, 2, 2, 2]
What is a difference between (vector * matrix) and (matrix * [vector]T)?
In the case of (vector * matrix), vector is a row and its size must match with the amount of rows in the matrix. The resulting size will be the same as the matrix.
In the case of (matrix * [vector]T), vector is a column and its size must match with the amount of columns in the matrix. The resulting size will be the same as the matrix.