Testing Experts

Experts are probabilistic — the same query can produce different results. This guide covers strategies for testing effectively.

Signal quality

Not all tests produce equal confidence. The Hard Signal Framework distinguishes two kinds of verification:

Soft signal test: Run the Expert, have an LLM evaluate the output, check if it “looks right.” The result depends on evaluator judgment and may vary across runs.
Hard signal test: Run the Expert, execute the output artifact — compile it, run its tests, take a screenshot and diff it. The result is deterministic and independent of LLM judgment.

Strategy	Signal type	Why
Manual observation (`perstack start`)	Soft	Human judgment evaluates output
LLM-as-judge evaluation	Soft	Evaluator judgment is as fallible as the generator’s
Soft review gate (requirements alignment)	Soft (valuable)	Catches semantic drift before expensive hard verification
Checkpoint replay	Hard (runtime)	Deterministic state replay, same input → same output
Mock testing (deterministic tool assertions)	Hard (runtime)	Verifies tool execution sequence deterministically
Built-in verifier delegate	Hard (full)	Separate Expert runs checks with `exec`, no shared context with generator
E2E with artifact execution (compile, test, run)	Hard (full)	Verifies the actual artifact with a deterministic procedure

Aim for hard signals wherever possible. When you must use soft signals — for example, evaluating natural language quality, checking requirements alignment, or reviewing semantic correctness — place them before the hard signal checks, not after. A soft review gate catches semantic drift early; the hard verifier provides the final pass/fail. See combining soft and hard signals for the full pattern.

Local testing

Run your Expert locally before publishing:

npx perstack start my-expert "test query"

Test different scenarios

Happy path — expected inputs and workflows
Edge cases — unusual inputs, empty data, large files
Error handling — missing files, invalid formats, network failures
Delegation — if your Expert delegates, test the full chain

Inspect execution

Use JSON output to see exactly what happened:

npx perstack run my-expert "query"

Each event shows:

Tool calls and results
Checkpoint state
Timing information

Checkpoint-based testing

Checkpoints enable deterministic replay of the runtime portion. Checkpoints are stored in perstack/jobs/{jobId}/runs/{runId}/ — see Runtime for the full directory structure.

Resume from checkpoint

Continue a paused run:

npx perstack run my-expert --continue

This resumes from the last checkpoint — useful for:

Debugging a specific step
Testing recovery behavior
Iterating on long-running tasks

Replay for debugging

Examine checkpoints to understand what the Expert “saw” at each step:

Message history
Tool call decisions
Intermediate state

Testing with mocks

For automated testing, mock the LLM to get deterministic behavior:

import { run } from "@perstack/runtime"

const result = await run(params, {
  // Mock eventListener for assertions
  eventListener: (event) => {
    if (event.type === "callTools") {
      expect(event.toolCalls[0].toolName).toBe("expectedTool")
    }
  }
})

The runtime is deterministic — only LLM responses are probabilistic. Mock the LLM layer for unit tests; use real LLMs for integration tests.

Testing checklist

Before publishing:

Works with typical queries
Handles edge cases gracefully
Delegates correctly (if applicable)
Skills work as expected
Error messages are helpful
Description accurately reflects behavior
At least one hard signal test exists (compiler, e2e, screenshot diff)
Verification is independent of the LLM that generated the output

What’s next

Hard Signals — the framework behind signal quality
Best Practices — design guidelines