The smallest Transformer - N | Robots Science Gamedev

Published: June 03, 2023Updated: May 26, 2024

Table of contents

Questions
Reference
Input data and goal
Code
Homework

Questions

Are embeddings required for any type of data? Why sin/cos?
What stacks add to the whole picture? How are they different?
Why 2 parts are called encoder and encoder if they both take input data similarly and process data with almost exact layers? (Only if the first layer of the decoder consisting of the masked multi-head attention is converting encoder into decoder. Or because data from encoder is forwarded into the decoder?). I think that names of these two parts should reflect time-relative context.

The models proposed recently for neural machine translation often belong to a family of encoder-decoders and consists of an encoder that encodes a source sentence into a fixed-length vector from which a decoder generates a translation.

Neural Machine Translation by Jointly Learning to Align and Translate

Based on this tutorial

I am trying to understand the difference in training data between sequence2sequence and bidirectional Transformers (like BERT, code)

Reference

White papers that we are going to use

Original Transformer with code
Illustrated Transformer
The Annotated Transformer version 1 version 2 (PyTorch)
Embedding is just a way of converting sentences into vectors. More in my notes about embeddings.
Multi-head attention is implemented in TensorFlow: docs, code. I checked in code how that layer includes separate weights for query, value, and key. For multiplication of tensors they use Einstein summation - einsum (three dots, ellipsis, to support any higher dimensions).

Input data and goal

Let's write a minimal Transformer model implementation in Rust writing feedforward layers and backpropagation, and decoder-encoder and multi-head attention from scratch. For training we will use a simple sequence 1 0 1 0 1 0....

Code

So let's start coding our simple version. In its core the transformer has encoder-decoder structure.

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F


class Transformer(nn.Module):
    def __init__(self, input_size, hidden_size, num_heads, num_layers):
        super(Transformer, self).__init__()
        self.embedding = nn.Embedding(input_size, hidden_size)
        self.encoder = Encoder(hidden_size, num_heads, num_layers)
        self.decoder = Decoder(hidden_size, num_heads, num_layers)
        self.output_layer = nn.Linear(hidden_size, input_size)

    def forward(self, src, trg):
        src = self.embedding(src)
        trg = self.embedding(trg)
        enc_output = self.encoder(src)
        dec_output = self.decoder(trg, enc_output)
        output = self.output_layer(dec_output)
        return output

We start with the Encoder which is

class Encoder(nn.Module):
    def __init__(self, hidden_size, num_heads, num_layers):
        super(Encoder, self).__init__()
        # use nn.ModuleList instead of nn.Sequential because `num_layers` is an input parameter
        self.layers = nn.ModuleList([
            EncoderLayer(hidden_size, num_heads) for _ in range(num_layers)
        ])

    def forward(self, x):
        for layer in self.layers:
            x = layer(x)
        return x

# Define the Encoder Layer
class EncoderLayer(nn.Module):
    def __init__(self, hidden_size, num_heads):
        super(EncoderLayer, self).__init__()
        self.self_attention = MultiHeadAttention(hidden_size, num_heads)
        self.feedforward = FeedForward(hidden_size)

    def forward(self, x):
        att_output = self.self_attention(x, x, x)
        x = att_output + x
        x = self.feedforward(x)
        return x

# Define the Multi-Head Attention mechanism
class MultiHeadAttention(nn.Module):
    def __init__(self, hidden_size, num_heads):
        super(MultiHeadAttention, self).__init__()
        self.num_heads = num_heads
        self.head_size = hidden_size // num_heads
        self.q_linear = nn.Linear(hidden_size, hidden_size)
        self.k_linear = nn.Linear(hidden_size, hidden_size)
        self.v_linear = nn.Linear(hidden_size, hidden_size)
        self.out_linear = nn.Linear(hidden_size, hidden_size)

    def forward(self, query, key, value):
        batch_size = query.shape[0]

        Q = self.q_linear(query).view(batch_size, -1, self.num_heads, self.head_size)
        K = self.k_linear(key).view(batch_size, -1, self.num_heads, self.head_size)
        V = self.v_linear(value).view(batch_size, -1, self.num_heads, self.head_size)

        Q = Q.permute(0, 2, 1, 3)
        K = K.permute(0, 2, 1, 3)
        V = V.permute(0, 2, 1, 3)

        att_scores = torch.matmul(Q, K.permute(0, 1, 3, 2)) / self.head_size**0.5
        att_scores = F.softmax(att_scores, dim=-1)

        att_output = torch.matmul(att_scores, V)
        att_output = att_output.permute(0, 2, 1, 3).contiguous().view(batch_size, -1, self.num_heads * self.head_size)

        att_output = self.out_linear(att_output)
        return att_output

# Define the Feed-Forward Layer
class FeedForward(nn.Module):
    def __init__(self, hidden_size):
        super(FeedForward, self).__init__()
        self.fc1 = nn.Linear(hidden_size, 4 * hidden_size)
        self.fc2 = nn.Linear(4 * hidden_size, hidden_size)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return x

# Initialize the model and define a simple training loop
input_size = 2  # 0 and 1 in the sequence
hidden_size = 256
num_heads = 4
num_layers = 2
seq_length = 10

model = Transformer(input_size, hidden_size, num_heads, num_layers)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Generate a simple input sequence (1 0 1 0 1 0...)
input_seq = torch.tensor([1, 0] * (seq_length // 2), dtype=torch.long)
target_seq = input_seq[1:]  # Shift the target by one time step

# Training loop
for epoch in range(1000):
    optimizer.zero_grad()
    output = model(input_seq.unsqueeze(0), target_seq.unsqueeze(0))
    loss = criterion(output.view(-1, input_size), target_seq)
    loss.backward()
    optimizer.step()

    if (epoch + 1) % 100 == 0:
        print(f'Epoch [{epoch + 1}/1000], Loss: {loss.item():.4f}')

# Test the model
with torch.no_grad():
    test_input = torch.tensor([1, 0] * (seq_length // 2), dtype=torch.long)
    predicted_output = model(test_input.unsqueeze(0), test_input[:seq_length - 1].unsqueeze(0))
    predicted_sequence = torch.argmax(predicted_output, dim=-1).squeeze().tolist()
    print("Input Sequence:", test_input.tolist())
    print("Predicted Sequence:", predicted_sequence)

Homework

What is a usual window size for attention in current models? See here why it's theoretically infinite
How Transformers solve problem with overfitting usually caused by softmax?