vocab
Functions to get and check HuggingFace tokenizer vocabularies
decode_vocab(tokenizer, byte2str_fallback='tokenizer')
Convert tokenizer vocabulary into byte and string representations.
Warning
The byte representation is the canonical form. Each element in byte_vocab is a Token object that contains both the token_id and byte_string. The string representation is provided for convenience but may not decode properly for all tokens, especially those containing invalid UTF-8 sequences.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
tokenizer
|
A Hugging Face tokenizer instance |
required | |
byte2str_fallback
|
str
|
Strategy for converting invalid UTF-8 bytes to strings. Options:
|
'tokenizer'
|
Returns:
| Type | Description |
|---|---|
tuple
|
(byte_vocab, str_vocab) where byte_vocab is a list of Token objects and str_vocab is a list of strings |
Source code in genlm/backend/tokenization/vocab.py
bytes_to_strs(tokenizer, byte_vocab, byte2str_fallback)
Convert byte representations to UTF-8 strings.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
tokenizer
|
A Hugging Face tokenizer instance |
required | |
byte_vocab
|
list[bytes]
|
List of byte representations of tokens |
required |
byte2str_fallback
|
str
|
Strategy for converting invalid UTF-8 bytes to strings: - 'tokenizer': Use tokenizer's convert_ids_to_tokens (default) - 'latin1': Decode using latin1 encoding - 'replace': Use Unicode replacement character '�' |
required |
Returns:
| Type | Description |
|---|---|
list[str]
|
List of string representations of tokens |
Note
May produce duplicate strings for different token IDs. A warning is issued if duplicates are found.