Skip to content

GenLM Grammar Documentation

This is a Python library for working with weighted context-free grammars (WCFGs) and finite state machines (FSAs). It provides implementations of various parsing algorithms and language model capabilities.

Core Components

Grammar Types

  • CFG: Context-free grammar implementation with support for:
    • Grammar normalization and transformation
    • Conversion to a character-level grammar

Language Models

  • LM: Base language model class
  • BoolCFGLM: Boolean-weighted CFG language model using Earley or CKY parsing
  • CKYLM: CKY-based parsing for weighted CFGs
  • EarleyLM: Earley-based parsing implementation for weighted CFGs

Parsing Algorithms

  • Earley Parser: Earley parsing algorithm with rescaling for numerical stability
  • IncrementalCKY: Incremental version of CKY with chart caching

Finite State Machines

  • FST: Weighted finite-state transducer implementation
  • WFSA: Weighted finite-state automaton base class

Mathematical Components

  • Semiring: Abstract semiring implementations including:
    • Boolean
    • Float
    • Log
    • Expectation
  • Chart: Weighted chart data structure with semiring operations
  • WeightedGraph: Graph implementation for solving algebraic path problems

Utilities

  • LarkStuff: Interface for converting Lark grammars to genlm-cfg format
  • format_table: Utility functions for formatting and displaying tables

Key Features

  • Support for various weighted grammar formalisms
  • Multiple parsing algorithm implementations
  • Efficient chart caching and incremental parsing
  • Composition operations between FSTs and CFGs
  • Semiring abstractions for different weight types
  • Visualization capabilities for debugging and analysis

Common Operations

Creating a Grammar

from genlm.grammar.cfg import CFG
from genlm.grammar.semiring import Float

# Create from string representation
cfg = CFG.from_string(grammar_string, semiring=Float)

Using a Language Model

from genlm.grammar.cfglm import BoolCFGLM

# Create language model from genlm.grammar
lm = BoolCFGLM(cfg, alg='earley')  # or alg='cky'

# Get next token weights
probs = lm.p_next(context)