You can find this in many scientific papers, common phrases indicating that the proper explanation is skipped, here sorted in the growing order of complexity:

- Rather trivial
- Then clearly
- Simple (small) algebraic transformation
- Tedious exercise leave up to the reader

## Positional encoding

From: Attention is all you need this reprint with comments

where pos is the position and i is the dimension. That is, each dimension of the positional encoding corresponds to a sinusoid. The wavelengths form a geometric progression from $2\pi$ to $10000 \cdot 2\pi$. We chose this function because we hypothesized it would allow the model to easily learn to attend by relative positions, since for any fixed offset $k$, $PE_{pos+k}$ can be represented as a linear function of $PE_{pos}$. We chose the sinusoidal version because it may allow the model to extrapolate to sequence lengths longer than the ones encountered during training.

Jupyter notebook to illustrate it

### What?

Maybe it's just me but I don't understand how sin/cos functions can save positional relations. I can understand convolutional operations how they transform data based on their positions.

Why $sin$ for even indices and $cos$ for odd ones?

### Explanation

From: Neural machine translation with a Transformer

Since the model doesn't contain any recurrent or convolutional layers, it needs some way to identify word order, otherwise it would see the input sequence as a *bag of words* instance, `how are you`

, `how you are`

, `you how are`

would be indistinguishable.

The position encoding function vibrates along the position axis at different frequencies depending on their location along the depth of the embedding vector.