# SentencePiece


<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->

``` python
import lorem

from ai_notes.tokenisation.bpe import BPE, NaiveBPE
```

``` python
naive_bpe = NaiveBPE(vocab_size=300)
naive_bpe.train(lorem.text())
encoded_text = naive_bpe.encode("hello world")
```

``` python
print(encoded_text)
```

    [104, 101, 108, 108, 111, 32, 119, 258, 108, 100]

## Dummy whitespace

In our
[byte_pair_encoding_from_scratch.ipynb](./byte_pair_encoding_from_scratch.ipynb)
we used a regex to split text into chunks. Merges can can never take
place across those hard world boundaries.

SentencePiece runs on the entire input as a single character stream and
lets merges happen freely across any adjacent characters.

It replaces every space character with the special `\u2581` (▁)
character (which resembles an underscore but is much rarer so it won’t
collide with real underscores).

Because ▁ can merge with any other character, word boundary information
is carried inside the token itself. We end up with merges like `▁` + `h`
-\> `▁he`.

We prepend ▁ to the start of an input as well. This helps the model
avoid encoding a word differently just because it happened to appear at
the start of the input.

It’s pretty easy to implement this in python. We can take our NaiveBPE
implementation and just tweak the `get_tokens` and `decode` methods.

We add these two lines to the start of the `get_tokens` method:

``` python
text = f" {text}"
text = text.replace(" ", self.space_token)
```

And this line to the end of the `decode` method:

``` python
text = text.replace(self.space_token, " ").strip()
```

------------------------------------------------------------------------

### SentencePieceBPE

``` python

def SentencePieceBPE(
    vocab_size:int
):

```

*Initialize self. See help(type(self)) for accurate signature.*

``` python
sentencepiece_bpe = SentencePieceBPE(vocab_size=300)
sentencepiece_bpe.train(lorem.text())
encoded_text = sentencepiece_bpe.encode("hello world")
```

``` python
encoded_text
```

    [257, 104, 101, 108, 108, 111, 257, 119, 259, 108, 100]

``` python
decoded_text = sentencepiece_bpe.decode(encoded_text)
decoded_text
```

    'hello world'

## Byte fallback

# TODO:

- Character coverage
- Special tokens
- Unicode normalization
- Subword regularization / stochastic segmentation
- Unigram language model
