vocab
Functions to get and check HuggingFace tokenizer vocabularies
decode_vocab(tokenizer, byte2str_fallback='tokenizer')
Convert tokenizer vocabulary into byte and string representations.
Warning
The byte representation is the canonical form. The string representation is provided for convenience but may not decode properly for all tokens, especially those containing invalid UTF-8 sequences.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
tokenizer
|
A Hugging Face tokenizer instance |
required | |
byte2str_fallback
|
str
|
Strategy for converting invalid UTF-8 bytes to strings. Options:
|
'tokenizer'
|
Returns:
Type | Description |
---|---|
tuple
|
(byte_vocab, str_vocab) |
Source code in genlm/backend/tokenization/vocab.py
bytes_to_strs(tokenizer, byte_vocab, byte2str_fallback)
Convert byte representations to UTF-8 strings.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
tokenizer
|
A Hugging Face tokenizer instance |
required | |
byte_vocab
|
list[bytes]
|
List of byte representations of tokens |
required |
byte2str_fallback
|
str
|
Strategy for converting invalid UTF-8 bytes to strings: - 'tokenizer': Use tokenizer's convert_ids_to_tokens (default) - 'latin1': Decode using latin1 encoding - 'replace': Use Unicode replacement character '�' |
required |
Returns:
Type | Description |
---|---|
list[str]
|
List of string representations of tokens |
Note
May produce duplicate strings for different token IDs. A warning is issued if duplicates are found.