Instruct-Validate-Repair

How can we ensure that the output from a language model is actually good? Good generative programmers don’t leave this up to chance — instead, they use pre-conditions to ensure that inputs to the LLM are as expected and then check post-conditions to ensure that the LLM’s outputs are fit-for-purpose. Suppose that in this case we want to ensure that the email has a salutation and contains only lower-case letters. We can capture these post-conditions by specifying requirements on the m.instruct call:

# file: https://github.com/generative-computing/mellea/blob/main/docs/examples/tutorial/simple_email.py#L33-L53
import mellea

def write_email_with_requirements(m: mellea.MelleaSession, name: str, notes: str) -> str:
  email = m.instruct(
      "Write an email to {{name}} using the notes following: {{notes}}.",
      requirements=[
          "The email should have a salutation",
          "Use only lower-case letters",
      ],
      user_variables={"name": name, "notes": notes},
  )
  return str(email)

m = mellea.start_session()
print(write_email_with_requirements(
  m,
  name="Olivia",
  notes="Olivia helped the lab over the last few weeks by organizing intern events, advertising the speaker series, and handling issues with snack delivery.",
))

We just added two requirements to the instruction which will be added to the model request. But we don’t check yet if these requirements are satisfied. Let’s add a strategy for validating the requirements:

# file: https://github.com/generative-computing/mellea/blob/main/docs/examples/tutorial/simple_email.py#L57-L84
import mellea
from mellea.stdlib.sampling import RejectionSamplingStrategy

def write_email_with_strategy(m: mellea.MelleaSession, name: str, notes: str) -> str:
    email_candidate = m.instruct(
        "Write an email to {{name}} using the notes following: {{notes}}.",
        requirements=[
            "The email should have a salutation",
            "Use only lower-case letters",
        ],
        strategy=RejectionSamplingStrategy(loop_budget=5),
        user_variables={"name": name, "notes": notes},
        return_sampling_results=True,
    )
    if email_candidate.success:
        return str(email_candidate.result)
    else:
        print("Expect sub-par result.")
        return email_candidate.sample_generations[0].value

m = mellea.start_session()
print(
    write_email_with_strategy(
        m,
        "Olivia",
        "Olivia helped the lab over the last few weeks by organizing intern events, advertising the speaker series, and handling issues with snack delivery.",
    )
)

A couple of things happened here. First, we added a sampling strategy to the instruction. This strategy (RejectionSamplingStrategy()) checks if all requirements are met. If any requirement fails, then the sampling strategy will sample a new email from the LLM. This process will repeat until the loop_budget on retries is consumed or all requirements are met. Even with retries, sampling might not generate results that fulfill all requirements (email_candidate.success==False). Mellea forces you to think about what it means for an LLM call to fail; in this case, we handle the situation by simply returning the first sample as the final result.

When using the return_sampling_results=True parameter, the instruct() function returns a SamplingResult object (not a ModelOutputThunk) which carries the full history of sampling and validation results for each sample.

Validating Requirements

Now that we defined requirements and sampling we should have a look into how requirements are validated. The default validation strategy is LLM-as-a-judge. Let’s look on how we can customize requirement definitions:

# file: https://github.com/generative-computing/mellea/blob/main/docs/examples/tutorial/instruct_validate_repair.py#L1-L10
from mellea.stdlib.requirement import req, check, simple_validate

requirements = [
    req("The email should have a salutation"),  # == r1
    req("Use only lower-case letters", validation_fn=simple_validate(lambda x: x.lower() == x)),  # == r2
    check("Do not mention purple elephants.")  # == r3
]

Here, the first requirement (r1) will be validated by LLM-as-a-judge on the output (last turn) of the instruction. This is the default behavior, since nothing else is specified. The second requirement (r2) simply uses a function that takes the output of a sampling step and returns a boolean value indicating (un-)successful validation. While the validation_fn parameter requires to run validation on the full session context (see Context Management), Mellea provides a wrapper for simpler validation functions (simple_validate(fn: Callable[[str], bool])) that take the output string and return a boolean as seen in this case. The third requirement is a check(). Checks are only used for validation, not for generation. Checks aim to avoid the “do not think about B” effect that often primes models (and humans) to do the opposite and “think” about B.

LLMaJ is not presumtively robust. Whenever possible, implement requirement validation using plain old Python code. When a model is necessary, it can often be a good idea to train a calibrated model specifically for your validation problem. Adapters explains how to use Mellea’s m tune subcommand to train your own LoRAs for requirement checking (and for other types of Mellea components as well).

Instruct - Validate - Repair

Now, we bring it all together into a first generative program using the instruct-validate-repair pattern:

# file: https://github.com/generative-computing/mellea/blob/main/docs/examples/tutorial/instruct_validate_repair.py#L13-L37
import mellea
from mellea.stdlib.requirement import req, check, simple_validate
from mellea.stdlib.sampling import RejectionSamplingStrategy

def write_email(m: mellea.MelleaSession, name: str, notes: str) -> str:
    email_candidate = m.instruct(
        "Write an email to {{name}} using the notes following: {{notes}}.",
        requirements=[
            req("The email should have a salutation"),  # == r1
            req(
                "Use only lower-case letters",
                validation_fn=simple_validate(lambda x: x.lower() == x),
            ),  # == r2
            check("Do not mention purple elephants."),  # == r3
        ],
        strategy=RejectionSamplingStrategy(loop_budget=5),
        user_variables={"name": name, "notes": notes},
        return_sampling_results=True,
    )
    if email_candidate.success:
        return str(email_candidate.result)
    else:
        return email_candidate.sample_generations[0].value


m = mellea.start_session()
print(write_email(m, "Olivia",
                  "Olivia helped the lab over the last few weeks by organizing intern events, advertising the speaker series, and handling issues with snack delivery."))

The instruct() method is a convenience function that creates and then generates from an Instruction Component, req() similarly wraps the Requirement Component, etc. Quickstart will takes us one level deeper into understanding what happens under the hood when you call m.instruct().

ModelOptions

Most LLM apis allow you to specify options to modify the request: temperature, max_tokens, seed, etc… Mellea supports specifying these options during backend initialization and when calling session-level functions with the model_options parameter. Mellea supports many different types of inference engines (ollama, openai-compatible vllm, huggingface, etc.). These inference engines, which we call Backends, provide different and sometimes inconsistent dict keysets for specifying model options. For the most common options among model providers, Mellea provides some engine-agnostic options, which can be used by typing ModelOption.<TAB> in your favorite IDE; for example, temperature can be specified as {"{ModelOption.TEMPERATURE": 0} and this will “just work” across all inference engines. You can add any key-value pair supported by the backend to the model_options dictionary, and those options will be passed along to the inference engine *even if a Mellea-specific ModelOption.<KEY> is defined for that option. This means you can safely copy over model option parameters from exiting codebases as-is:

# file: https://github.com/generative-computing/mellea/blob/main/docs/examples/tutorial/model_options_example.py#L1-L16
import mellea
from mellea.backends.types import ModelOption
from mellea.backends.ollama import OllamaModelBackend
from mellea.backends import model_ids

m = mellea.MelleaSession(backend=OllamaModelBackend(
    model_id=model_ids.IBM_GRANITE_3_2_8B,
    model_options={ModelOption.SEED: 42}
))

answer = m.instruct(
    "What is 2x2?",
    model_options={
        "temperature": 0.5,
        "num_predict": 5,
    },
)

print(str(answer))

You can always update the model options of a given backend; however, Mellea offers a few additional approaches to changing the specified options.

Specifying options during m.* calls. Options specified here will update the model options previously specified for that call only. If you specify an already existing key (with either the ModelOption.OPTION version or the native name for that option for the given api), the value will be the one associated with the new key. If you specify the same key in different ways (ie ModelOption.TEMPERATURE and temperature), the ModelOption.OPTION key will take precedence.

# options passed during backend initialization
backend_model_options = {
    "seed": "1",
    ModelOption.MAX_NEW_TOKENS: 1,
    "temperature": 1,
}

# options passed during m.*
instruct_model_options = {
    "seed": "2",
    ModelOption.SEED: "3",
    "num_predict": 2,
}

# options passed to the model provider API
final_options = {
    "temperature": 1,
    "seed": 3,
    "num_predict": 2
}

Pushing and popping model state. Sessions offer the ability to push and pop model state. This means you can temporarily change the model_options for a series of calls by pushing a new set of model_options and then revert those changes with a pop.

Conclusion

We have now worked up from a simple “Hello, World” example to our first generative programming design pattern: Instruct - Validate - Repair (IVR). When LLMs work well, the software developer experiences the LLM as a sort of oracle that can handle most any input and produce a sufficiently desirable output. When LLMs do not work at all, the software developer experiences the LLM as a naive markov chain that produces junk. In both cases, the LLM is just sampling from a distribution. The crux of generative programming is that most applications find themselves somewhere in-between these two extremes — the LLM mostly works, enough to demo a tantilizing MVP. But failure modes are common enough and severe enough that complete automation is beyond the developer’s grasp. Traditional software deals with failure modes by carefully describing what can go wrong and then providing precise error handling logic. When working with LLMs, however, this approach suffers a Sysiphean curse. There is always one more failure mode, one more special case, one more new feature request. In the next chapter, we will explore how to build generative programs that are compositional and that grow gracefully.

Introduction

Core Concepts

Instruct-Validate-Repair

Validating Requirements

Instruct - Validate - Repair

ModelOptions

Conclusion

Introduction

Core Concepts

​Validating Requirements

​Instruct - Validate - Repair

​ModelOptions

​Conclusion

Validating Requirements

Instruct - Validate - Repair

ModelOptions

Conclusion