token
Token class representing a vocabulary token with its ID and byte representation.
Token subclasses bytes for backwards compatibility so that b"".join(tokens)
works. Equality and hashing between Token objects are based on token_id (not
byte content), because multiple tokens can share the same byte string.
Token
Bases: bytes
A vocabulary token carrying both a token ID and its byte representation.
Subclasses bytes so that existing code using byte operations (b"".join,
len, indexing, .decode()) continues to work. Equality and hashing
between Token objects use token_id, not byte content.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
token_id
|
int
|
The unique identifier for this token in the vocabulary. |
required |
byte_string
|
bytes
|
The byte representation of this token. |
required |
Source code in genlm/backend/tokenization/token.py
byte_string
property
The byte representation of this token (as plain bytes).