Molecular Synthesis¶
This example shows how to evaluate a genlm.control model on the molecular synthesis domain.
- Task: Produce drug-like compounds using the SMILES notation (Weininger, 1988).
- Data: Few-shot prompts created by repeatedly selecting 20 random samples from the GDB-17 database (Ruddigkeit et al., 2012).
Setup¶
First, install the dependencies for this domain. In the root directory, run:
pip install -e .[molecules]
Second, download the GDB17_sample.txt file, which contains 30 molecules.
This file is taken from the GDB17 dataset, which can be downloaded from https://gdb.unibe.ch/downloads/. For a full evaluation, download the GDB-17-Set (50 million) file.
Usage¶
Initialize the dataset and evaluator¶
from genlm.eval.domains.molecular_synthesis import (
MolecularSynthesisDataset,
MolecularSynthesisEvaluator,
)
/opt/homebrew/Caskroom/miniconda/base/envs/genlm/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm
# Sample 5 instances each with 5 molecules to use as few-shot examples
dataset = MolecularSynthesisDataset.from_smiles(
"../../../assets/molecular_synthesis/GDB17_sample.txt", n_molecules=5, n_instances=5
)
evaluator = MolecularSynthesisEvaluator()
Define a model adaptor¶
A model adaptor is an async callable that takes a PatternMatchingInstance and returns a ModelOutput. For this example, we'll use a genlm.control.PromptedLLM constrained to generate valid SMILES (via the PartialSMILES potential) to generate responses.
from genlm.control import PromptedLLM, AWRS
from genlm.eval import ModelOutput, ModelResponse
from genlm.eval.domains.molecular_synthesis import (
default_prompt_formatter,
PartialSMILES,
)
# Load an LLM
LLM = PromptedLLM.from_name("gpt2", eos_tokens=[b"\n", b"\n\n"])
async def model(instance, output_dir, replicate):
# Set the prompt for the LLM.
LLM.prompt_ids = default_prompt_formatter(
LLM.model.tokenizer, instance, use_chat_format=False
)
# Define a potential that ensures the generated molecules are valid SMILES
potential = PartialSMILES().coerce(LLM, f=b"".join)
# Define an adaptive weighted rejection sampler to sample tokens from the constrained model.
sampler = AWRS(LLM, potential)
# Run SMC to sample sequences from the constrained model.
sequences = await sampler.smc(
n_particles=5,
ess_threshold=0.5,
max_tokens=100,
)
# Return the sampled sequences and their probabilities as a ModelOutput.
return ModelOutput(
responses=[
ModelResponse(response=sequence, weight=prob)
for sequence, prob in sequences.decoded_posterior.items()
],
)
/opt/homebrew/Caskroom/miniconda/base/envs/genlm/lib/python3.11/site-packages/genlm/backend/tokenization/vocab.py:98: UserWarning: Duplicate tokens found in string vocabulary. This may lead to downstream issues with the string vocabulary; we recommend using the byte vocabulary. warnings.warn(
Run the evaluation¶
from genlm.eval import run_evaluation
results = await run_evaluation(
dataset=dataset,
model=model,
evaluator=evaluator,
max_instances=2,
n_replicates=1,
verbosity=1,
# output_dir="molecular_synthesis_results", optionally save the results to a directory
)
Instance instance_id=0 molecules=['BrC1=C2C3CC33C(NCS3(=O)=O)C2=CC=C1\n', 'BrC1=C2C3C4COC(=NCC2=NSC1=O)C34\n', 'BrC1=C2C3=C4C(CC3CCC2=O)C(=N)NC4=N1\n', 'BrC1=C2C3C4C3N(CC4C#C)C2=NC(=O)S1\n', 'BrC1=C2C3C4CC(C3CC2=NC(=N)O1)C(=O)O4\n'] Mean weighted accuracy (instance): 0.6121801531912207 Mean weighted accuracy (total): 0.6121801531912207 Instance instance_id=1 molecules=['BrC1=C2C3CC3C=CCC#CC1=CSC2=N', 'BrC1=C2C3CC3C3=C(C=NS3)N2C(=N)C=N1\n', 'BrC1=C2C3C4NC4C(C3C#C)C2=NSC1=O\n', 'BrC1=C2C3C4CC4C(C3C=O)C2=NNS1(=O)=O\n', 'BrC1=C2C3C4NC=NC4C3OC2=CSC1=N\n'] Mean weighted accuracy (instance): 0.0 Mean weighted accuracy (total): 0.3060900765956103
results.keys()
dict_keys(['average_weighted_accuracy', 'n_instances', 'all_instance_results', 'all_instance_outputs'])
References¶
Lars Ruddigkeit, Ruud Van Deursen, Lorenz C Blum, and Jean-Louis Reymond. Enu- meration of 166 billion organic small molecules in the chemical universe database gdb-17. Journal of chemical information and modeling, 52(11):2864–2875, 2012. URL https://pubs.acs.org/doi/pdf/10.1021/ci300415d.
David Weininger. Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules. Journal of chemical information and computer sciences, 28 (1):31–36, 1988.