set
SetSampler
Bases: ABC
Base class for set samplers.
A set sampler samples a weighted set of tokens from a the vocabulary of a target
potential.
Given a context of tokens \(x_1, \ldots, x_{n-1}\) in the target potential's vocabulary and a sampled set of tokens \(S \subseteq \textsf{target.vocab_eos}\), the log-weight associated with each token \(x_n\) must correspond to:
where \(\Pr(x_n \in S)\) is the probability the token was included in a sampled set.
Attributes:
Name | Type | Description |
---|---|---|
target |
Potential
|
The target potential with respect to which the set's weights are computed. |
Source code in genlm/control/sampler/set.py
sample_set(context)
abstractmethod
async
TrieSetSampler
Bases: SetSampler
TrieSetSampler is a specialized set sampler that utilizes a trie data structure to efficiently sample a weighted set of tokens.
This sampler is designed to work with two potentials:
-
a potential over a vocabulary of iterables (
iter_potential
) and -
a potential over a vocabulary of items which are the elements of the iterables (
item_potential
).
For example, if iter_potential
is a potential over byte sequences, then item_potential
is a potential over bytes.
The target potential is the product of iter_potential
and the item_potential
coerced to operate on the token type of iter_potential
. Thus,
TrieSetSampler
s sample tokens from the iter_potential
's vocabulary.
Source code in genlm/control/sampler/set.py
__init__(iter_potential, item_potential)
Initialize the TrieSetSampler
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
iter_potential
|
Potential
|
The potential defined over a vocabulary of iterables. |
required |
item_potential
|
Potential
|
The potential defined over a vocabulary of items. |
required |
Raises:
Type | Description |
---|---|
ValueError
|
If the token type of |
Source code in genlm/control/sampler/set.py
sample_set(context)
async
Sample a weighted set of tokens given a context.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
context
|
list
|
The sequence to condition on. |
required |
Returns:
Type | Description |
---|---|
(LazyWeights, float)
|
A weighted set of tokens and the log-probability of the sampled set. |
Raises:
Type | Description |
---|---|
NotImplementedError
|
If the method is not implemented in subclasses. |
Source code in genlm/control/sampler/set.py
cleanup()
async
Cleanup the TrieSetSampler. It is recommended to call this method at the end of usage.
EagerSetSampler
Bases: TrieSetSampler
A trie-based set sampler that implements an eager sampling strategy for generating a set of tokens.
An EagerSetSampler
samples tokens by incrementally sampling items from the item-wise product of the iter_potential
and item_potential
.
The sampled set is the set of sequences of items that correspond to valid tokens in iter_potential
's vocabulary.
Source code in genlm/control/sampler/set.py
sample_set(context, draw=None)
async
Sample a set of tokens given a context.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
context
|
list
|
A sequence of tokens in the |
required |
Returns:
Type | Description |
---|---|
(LazyWeights, float)
|
A weighted set of tokens and the log-probability of the sampled set. |
Source code in genlm/control/sampler/set.py
TopKSetSampler
Bases: TrieSetSampler
A trie-based set sampler that lazily enumerates the top K tokens by weight in the target, and samples an additional "wildcard" token to ensure absolute continuity.
Warning
This sampler is not guaranteed to be correct if the item_potential
's
prefix weights do not monotonically decrease with the length of the context.
That is, \(\textsf{item_potential.prefix}(x) \leq \textsf{item_potential.prefix}(xy)\) for all sequences of items \(x, y\).
Source code in genlm/control/sampler/set.py
183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 |
|
__init__(iter_potential, item_potential, K)
Initialize the TopKSetSampler.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
iter_potential
|
Potential
|
The potential defined over a vocabulary of iterables. |
required |
item_potential
|
Potential
|
The potential defined over a vocabulary of items. |
required |
K
|
int | None
|
The number of top tokens to enumerate. If None, all tokens are enumerated. |
required |
Source code in genlm/control/sampler/set.py
sample_set(context, draw=None)
async
Sample a set of tokens given a context.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
context
|
list
|
A sequence of tokens in the |
required |
Returns:
Type | Description |
---|---|
(LazyWeights, float)
|
A weighted set of tokens and the log-probability of the sampled set. |