SentencePiece

import lorem

from ai_notes.tokenisation.bpe import BPE, NaiveBPE

naive_bpe = NaiveBPE(vocab_size=300)
naive_bpe.train(lorem.text())
encoded_text = naive_bpe.encode("hello world")

print(encoded_text)

[104, 101, 108, 108, 111, 32, 119, 258, 108, 100]

Dummy whitespace

In our byte_pair_encoding_from_scratch.ipynb we used a regex to split text into chunks. Merges can can never take place across those hard world boundaries.

SentencePiece runs on the entire input as a single character stream and lets merges happen freely across any adjacent characters.

It replaces every space character with the special \u2581 (▁) character (which resembles an underscore but is much rarer so it won’t collide with real underscores).

Because ▁ can merge with any other character, word boundary information is carried inside the token itself. We end up with merges like ▁ + h -> ▁he.

We prepend ▁ to the start of an input as well. This helps the model avoid encoding a word differently just because it happened to appear at the start of the input.

It’s pretty easy to implement this in python. We can take our NaiveBPE implementation and just tweak the get_tokens and decode methods.

We add these two lines to the start of the get_tokens method:

text = f" {text}"
text = text.replace(" ", self.space_token)

And this line to the end of the decode method:

text = text.replace(self.space_token, " ").strip()

SentencePieceBPE


def SentencePieceBPE(
    vocab_size:int
):

Initialize self. See help(type(self)) for accurate signature.

sentencepiece_bpe = SentencePieceBPE(vocab_size=300)
sentencepiece_bpe.train(lorem.text())
encoded_text = sentencepiece_bpe.encode("hello world")

encoded_text

[257, 104, 101, 108, 108, 111, 257, 119, 259, 108, 100]

decoded_text = sentencepiece_bpe.decode(encoded_text)
decoded_text

'hello world'

Byte fallback

TODO:

Character coverage
Special tokens
Unicode normalization
Subword regularization / stochastic segmentation
Unigram language model