Published: Updated:

Table of contents

BAM can link together data of different types. Associations...

Patterns pairs

From one side the model requires to use bipolar patterns - arrays of -1 and 1. But I need to store words and sentences. How do I decode them into that format? I looked into byte pair encoding (BPE) used in GPT-2. It basically tokenized by subwords.

How letters and words stored in neurons? Should I look into psychological research or just made up some decoder?

  • Cannot achieve clusters and hierarchy with just one layer in BAM.
  • Sadly words are triggered together with other neuron activity (even with electrodes placed in the brain, 'electrocorticography' - official name of the technique, we can't get much understanding)
  • Every letter and every word stored in separate neurons.

GPT has a vocabulary size of 40,478 since they have 478 base characters and chose to stop training after 40,000 merges.

GPT-2 uses bytes as the base vocabulary, which is a clever trick to force the base vocabulary to be of size 256 while ensuring that every base character is included in the vocabulary. GPT-2 has a vocabulary size of 50,257, which corresponds to the 256 bytes base tokens, a special end-of-text token and the symbols learned with 50,000 merges.

From https://huggingface.co/docs/transformers/tokenizer_summary

Here is how BPE encoding looks like. You can see it as a dictionary used by GPT. It has subwords and single letters. But the amount of subwords is limited by researchers.

BPE encoding

As a result they have flat vocabulary. By "flat" I mean all symbols are independent. Strange prefix \u0120 (sometimes printed as Ġ, [1], [2], [3]) denotes a space, which means that nothing can be prepended to symbols containing it.

But instead of being flat we need a hierarchical network that starts with single letters and combines them into subwords and words.

Next, firing of inputs is not "flat" as well. We will not present a word "hello" as a simultaneous input to first layer neurons

Papers

Rate this page