bytes
Functions to get the byte vocabulary from a HuggingFace tokenizer
get_byte_vocab(tokenizer)
Extract byte vocabulary from a tokenizer using various methods.
This function attempts to extract the byte representation of each token in the vocabulary using multiple methods, trying each in sequence until one succeeds:
- If the tokenizer has a byte_decoder attribute, attempt to use that directly
- If the tokenizer has an sp_model (SentencePiece) attribute, use that
- Try encoding the token strings directly
- Fall back to using the default GPT2 byte decoder
Parameters:
Name | Type | Description | Default |
---|---|---|---|
tokenizer
|
A Hugging Face tokenizer instance. |
required |
Returns:
Type | Description |
---|---|
list[byte]
|
List of byte representations of tokens. |
Raises:
Type | Description |
---|---|
ByteVocabError
|
If vocabulary cannot be decoded using any of the available methods. |
Source code in genlm/backend/tokenization/bytes.py
get_byte_tokens_from_byte_decoder(tokenizer, byte_decoder)
Convert tokens to bytes using a byte decoder mapping.
Special tokens are handled by directly encoding their string representation.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
tokenizer
|
A Hugging Face tokenizer instance |
required | |
byte_decoder
|
dict
|
Dictionary mapping characters to bytes |
required |
Returns:
Name | Type | Description |
---|---|---|
byte_tokens |
list[byte]
|
List of byte representations for each token |
Source code in genlm/backend/tokenization/bytes.py
get_byte_tokens_from_sp(tokenizer)
Convert tokens to their byte representations using a SentencePiece model.
Uses the SentencePiece model's id_to_piece method to get the raw byte representation of each token, handling special tokens separately. Converts any hex-encoded bytes (in <0xXX> format) to their actual byte values and replaces the SentencePiece prefix space marker with a regular space.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
tokenizer
|
A Hugging Face tokenizer instance with a SentencePiece model |
required |
Returns:
Name | Type | Description |
---|---|---|
byte_tokens |
list[byte]
|
List of byte representations for each token in the vocabulary |
Note
Special tokens are handled by directly encoding their string representation, while normal tokens go through the SentencePiece conversion process.
Source code in genlm/backend/tokenization/bytes.py
check_byte_decoder(tokenizer, byte_decoder)
Verify that a byte decoder can properly handle all tokens.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
tokenizer
|
A Hugging Face tokenizer instance |
required | |
byte_decoder
|
dict
|
Dictionary mapping characters to bytes |
required |
Raises:
Type | Description |
---|---|
ByteDecoderError
|
If byte decoder fails validation checks |