Molecular Synthesis¶

This example shows how to evaluate a genlm.control model on the molecular synthesis domain.

Task: Produce drug-like compounds using the SMILES notation (Weininger, 1988).
Data: Few-shot prompts created by repeatedly selecting 20 random samples from the GDB-17 database (Ruddigkeit et al., 2012).

Setup¶

First, install the dependencies for this domain. In the root directory, run:

pip install -e .[molecules]

Second, download the GDB17_sample.txt file, which contains 30 molecules.

This file is taken from the GDB17 dataset, which can be downloaded from https://gdb.unibe.ch/downloads/. For a full evaluation, download the GDB-17-Set (50 million) file.

Usage¶

Initialize the dataset and evaluator¶

In [1]:

Copied!





from genlm.eval.domains.molecular_synthesis import (
    MolecularSynthesisDataset,
    MolecularSynthesisEvaluator,
)
from genlm.eval.domains.molecular_synthesis import (
    MolecularSynthesisDataset,
    MolecularSynthesisEvaluator,
)

/opt/homebrew/Caskroom/miniconda/base/envs/genlm/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

In [2]:

Copied!





# Sample 5 instances each with 5 molecules to use as few-shot examples
dataset = MolecularSynthesisDataset.from_smiles(
    "../../../assets/molecular_synthesis/GDB17_sample.txt", n_molecules=5, n_instances=5
)

evaluator = MolecularSynthesisEvaluator()
# Sample 5 instances each with 5 molecules to use as few-shot examples
dataset = MolecularSynthesisDataset.from_smiles(
    "../../../assets/molecular_synthesis/GDB17_sample.txt", n_molecules=5, n_instances=5
)

evaluator = MolecularSynthesisEvaluator()

Define a model adaptor¶

A model adaptor is an async callable that takes a PatternMatchingInstance and returns a ModelOutput. For this example, we'll use a genlm.control.PromptedLLM constrained to generate valid SMILES (via the PartialSMILES potential) to generate responses.

In [3]:

Copied!





from genlm.control import PromptedLLM, AWRS
from genlm.eval import ModelOutput, ModelResponse
from genlm.eval.domains.molecular_synthesis import (
    default_prompt_formatter,
    PartialSMILES,
)

# Load an LLM
LLM = PromptedLLM.from_name("gpt2", eos_tokens=[b"\n", b"\n\n"])


async def model(instance, output_dir, replicate):
    # Set the prompt for the LLM.
    LLM.prompt_ids = default_prompt_formatter(
        LLM.model.tokenizer, instance, use_chat_format=False
    )

    # Define a potential that ensures the generated molecules are valid SMILES
    potential = PartialSMILES().coerce(LLM, f=b"".join)

    # Define an adaptive weighted rejection sampler to sample tokens from the constrained model.
    sampler = AWRS(LLM, potential)

    # Run SMC to sample sequences from the constrained model.
    sequences = await sampler.smc(
        n_particles=5,
        ess_threshold=0.5,
        max_tokens=100,
    )

    # Return the sampled sequences and their probabilities as a ModelOutput.
    return ModelOutput(
        responses=[
            ModelResponse(response=sequence, weight=prob)
            for sequence, prob in sequences.decoded_posterior.items()
        ],
    )
from genlm.control import PromptedLLM, AWRS
from genlm.eval import ModelOutput, ModelResponse
from genlm.eval.domains.molecular_synthesis import (
    default_prompt_formatter,
    PartialSMILES,
)

# Load an LLM
LLM = PromptedLLM.from_name("gpt2", eos_tokens=[b"\n", b"\n\n"])


async def model(instance, output_dir, replicate):
    # Set the prompt for the LLM.
    LLM.prompt_ids = default_prompt_formatter(
        LLM.model.tokenizer, instance, use_chat_format=False
    )

    # Define a potential that ensures the generated molecules are valid SMILES
    potential = PartialSMILES().coerce(LLM, f=b"".join)

    # Define an adaptive weighted rejection sampler to sample tokens from the constrained model.
    sampler = AWRS(LLM, potential)

    # Run SMC to sample sequences from the constrained model.
    sequences = await sampler.smc(
        n_particles=5,
        ess_threshold=0.5,
        max_tokens=100,
    )

    # Return the sampled sequences and their probabilities as a ModelOutput.
    return ModelOutput(
        responses=[
            ModelResponse(response=sequence, weight=prob)
            for sequence, prob in sequences.decoded_posterior.items()
        ],
    )

/opt/homebrew/Caskroom/miniconda/base/envs/genlm/lib/python3.11/site-packages/genlm/backend/tokenization/vocab.py:98: UserWarning: Duplicate tokens found in string vocabulary. This may lead to downstream issues with the string vocabulary; we recommend using the byte vocabulary.
  warnings.warn(

Run the evaluation¶

In [4]:

Copied!





from genlm.eval import run_evaluation

results = await run_evaluation(
    dataset=dataset,
    model=model,
    evaluator=evaluator,
    max_instances=2,
    n_replicates=1,
    verbosity=1,
    # output_dir="molecular_synthesis_results", optionally save the results to a directory
)
from genlm.eval import run_evaluation

results = await run_evaluation(
    dataset=dataset,
    model=model,
    evaluator=evaluator,
    max_instances=2,
    n_replicates=1,
    verbosity=1,
    # output_dir="molecular_synthesis_results", optionally save the results to a directory
)

Instance instance_id=0 molecules=['BrC1=C2C3CC33C(NCS3(=O)=O)C2=CC=C1\n', 'BrC1=C2C3C4COC(=NCC2=NSC1=O)C34\n', 'BrC1=C2C3=C4C(CC3CCC2=O)C(=N)NC4=N1\n', 'BrC1=C2C3C4C3N(CC4C#C)C2=NC(=O)S1\n', 'BrC1=C2C3C4CC(C3CC2=NC(=N)O1)C(=O)O4\n']
Mean weighted accuracy (instance): 0.6121801531912207
Mean weighted accuracy (total): 0.6121801531912207

Instance instance_id=1 molecules=['BrC1=C2C3CC3C=CCC#CC1=CSC2=N', 'BrC1=C2C3CC3C3=C(C=NS3)N2C(=N)C=N1\n', 'BrC1=C2C3C4NC4C(C3C#C)C2=NSC1=O\n', 'BrC1=C2C3C4CC4C(C3C=O)C2=NNS1(=O)=O\n', 'BrC1=C2C3C4NC=NC4C3OC2=CSC1=N\n']
Mean weighted accuracy (instance): 0.0
Mean weighted accuracy (total): 0.3060900765956103

In [5]:

Copied!

results.keys()
results.keys()

Out[5]:

dict_keys(['average_weighted_accuracy', 'n_instances', 'all_instance_results', 'all_instance_outputs'])

References¶

Lars Ruddigkeit, Ruud Van Deursen, Lorenz C Blum, and Jean-Louis Reymond. Enu- meration of 166 billion organic small molecules in the chemical universe database gdb-17. Journal of chemical information and modeling, 52(11):2864–2875, 2012. URL https://pubs.acs.org/doi/pdf/10.1021/ci300415d.

David Weininger. Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules. Journal of chemical information and computer sciences, 28 (1):31–36, 1988.