lm
LM
Language model base class that defines a probability distribution over strings.
A language model p: V* -> [0,1] defines a probability distribution over strings from a vocabulary V of tokens. Every language model admits a left-to-right factorization:
p(x_1 x_2 ... x_T) = p(x_1|ε) p(x_2|x_1) ... p(x_T|x_1...x_{T-1}) p(EOS|x_1...x_T)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
V
|
Vocabulary of symbols |
required | |
eos
|
Distinguished end-of-sequence symbol |
required |
Attributes:
| Name | Type | Description |
|---|---|---|
V |
Vocabulary set |
|
eos |
End-of-sequence symbol |
Notes
Subclasses must implement p_next(xs) which returns p(·|x_1...x_T).
Source code in genlm/grammar/lm.py
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 | |
__call__(context)
Compute the probability of a complete string.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
context
|
Sequence of tokens ending with eos token |
required |
Returns:
| Name | Type | Description |
|---|---|---|
float |
Probability of the sequence |
Raises:
| Type | Description |
|---|---|
AssertionError
|
If context doesn't end with eos or contains invalid tokens |
Source code in genlm/grammar/lm.py
__init__(V, eos)
Initialize language model with vocabulary and end-of-sequence token.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
V
|
Vocabulary set of tokens |
required | |
eos
|
End-of-sequence token |
required |
clear_cache()
logp(context)
Compute the log probability of a complete string.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
context
|
Sequence of tokens ending with eos token |
required |
Returns:
| Type | Description |
|---|---|
float
|
Log probability of the sequence |
Raises:
| Type | Description |
|---|---|
AssertionError
|
If context doesn't end with eos |
Source code in genlm/grammar/lm.py
logp_next(context)
Compute the log conditional distribution over the next token given the prefix.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
context
|
Sequence of tokens representing the prefix |
required |
Returns:
| Type | Description |
|---|---|
|
Log probabilities for each possible next token |
Raises:
| Type | Description |
|---|---|
NotImplementedError
|
Must be implemented by subclasses |
Source code in genlm/grammar/lm.py
p_next(context)
Compute the conditional distribution over the next token given the prefix.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
context
|
Sequence of tokens representing the prefix |
required |
Returns:
| Type | Description |
|---|---|
|
Probabilities for each possible next token |
Raises:
| Type | Description |
|---|---|
NotImplementedError
|
Must be implemented by subclasses |
Source code in genlm/grammar/lm.py
p_next_async(context)
async
Asynchronously compute the conditional distribution over the next token.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
context
|
Sequence of tokens representing the prefix |
required |
Returns:
| Type | Description |
|---|---|
|
Probabilities for each possible next token |
Source code in genlm/grammar/lm.py
p_next_seq(context, extension)
Compute probability of an extension sequence given a context.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
context
|
Sequence of tokens representing the prefix |
required | |
extension
|
Sequence of tokens to compute probability for |
required |
Returns:
| Type | Description |
|---|---|
float
|
Probability of the extension sequence given the context |
Raises:
| Type | Description |
|---|---|
AssertionError
|
If extension is empty |
Source code in genlm/grammar/lm.py
sample(ys=(), draw=sample_dict, prob=True, verbose=0, max_tokens=np.inf, join=lambda ys, y: ys + (y,))
Sample a sequence from the language model.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ys
|
Initial sequence of tokens (default: empty tuple) |
()
|
|
draw
|
Function to sample from probability distribution (default: sample_dict) |
sample_dict
|
|
prob
|
Whether to return probability along with sequence (default: True) |
True
|
|
verbose
|
Verbosity level for printing tokens (default: 0) |
0
|
|
max_tokens
|
Maximum number of tokens to generate (default: infinity) |
inf
|
|
join
|
Function to join new token with existing sequence (default: tuple concatenation) |
lambda ys, y: ys + (y,)
|
Returns:
| Type | Description |
|---|---|
|
If prob=True: Tuple of (generated sequence, probability) |
|
|
If prob=False: Generated sequence |