tokenization
decode_vocab(tokenizer, byte2str_fallback='tokenizer')
Convert tokenizer vocabulary into byte and string representations.
Warning
The byte representation is the canonical form. Each element in byte_vocab is a Token object that contains both the token_id and byte_string. The string representation is provided for convenience but may not decode properly for all tokens, especially those containing invalid UTF-8 sequences.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
tokenizer
|
A Hugging Face tokenizer instance |
required | |
byte2str_fallback
|
str
|
Strategy for converting invalid UTF-8 bytes to strings. Options:
|
'tokenizer'
|
Returns:
| Type | Description |
|---|---|
tuple
|
(byte_vocab, str_vocab) where byte_vocab is a list of Token objects and str_vocab is a list of strings |
Source code in genlm/backend/tokenization/vocab.py
Token
Bases: bytes
A vocabulary token carrying both a token ID and its byte representation.
Subclasses bytes so that existing code using byte operations (b"".join,
len, indexing, .decode()) continues to work. Equality and hashing
between Token objects use token_id, not byte content.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
token_id
|
int
|
The unique identifier for this token in the vocabulary. |
required |
byte_string
|
bytes
|
The byte representation of this token. |
required |
Source code in genlm/backend/tokenization/token.py
byte_string
property
The byte representation of this token (as plain bytes).