Skip to content
Home
GitHub

Testing Experts

Experts are probabilistic — the same query can produce different results. This guide covers strategies for testing effectively.

Not all tests produce equal confidence. The Hard Signal Framework distinguishes two kinds of verification:

  • Soft signal test: Run the Expert, have an LLM evaluate the output, check if it “looks right.” The result depends on evaluator judgment and may vary across runs.
  • Hard signal test: Run the Expert, execute the output artifact — compile it, run its tests, take a screenshot and diff it. The result is deterministic and independent of LLM judgment.
StrategySignal typeWhy
Manual observation (perstack start)SoftHuman judgment evaluates output
LLM-as-judge evaluationSoftEvaluator judgment is as fallible as the generator’s
Soft review gate (requirements alignment)Soft (valuable)Catches semantic drift before expensive hard verification
Checkpoint replayHard (runtime)Deterministic state replay, same input → same output
Mock testing (deterministic tool assertions)Hard (runtime)Verifies tool execution sequence deterministically
Built-in verifier delegateHard (full)Separate Expert runs checks with exec, no shared context with generator
E2E with artifact execution (compile, test, run)Hard (full)Verifies the actual artifact with a deterministic procedure

Aim for hard signals wherever possible. When you must use soft signals — for example, evaluating natural language quality, checking requirements alignment, or reviewing semantic correctness — place them before the hard signal checks, not after. A soft review gate catches semantic drift early; the hard verifier provides the final pass/fail. See combining soft and hard signals for the full pattern.

Run your Expert locally before publishing:

Terminal window
npx perstack start my-expert "test query"
  • Happy path — expected inputs and workflows
  • Edge cases — unusual inputs, empty data, large files
  • Error handling — missing files, invalid formats, network failures
  • Delegation — if your Expert delegates, test the full chain

Use JSON output to see exactly what happened:

Terminal window
npx perstack run my-expert "query"

Each event shows:

  • Tool calls and results
  • Checkpoint state
  • Timing information

Checkpoints enable deterministic replay of the runtime portion. Checkpoints are stored in perstack/jobs/{jobId}/runs/{runId}/ — see Runtime for the full directory structure.

Continue a paused run:

Terminal window
npx perstack run my-expert --continue

This resumes from the last checkpoint — useful for:

  • Debugging a specific step
  • Testing recovery behavior
  • Iterating on long-running tasks

Examine checkpoints to understand what the Expert “saw” at each step:

  • Message history
  • Tool call decisions
  • Intermediate state

For automated testing, mock the LLM to get deterministic behavior:

import { run } from "@perstack/runtime"
const result = await run(params, {
// Mock eventListener for assertions
eventListener: (event) => {
if (event.type === "callTools") {
expect(event.toolCalls[0].toolName).toBe("expectedTool")
}
}
})

The runtime is deterministic — only LLM responses are probabilistic. Mock the LLM layer for unit tests; use real LLMs for integration tests.

Before publishing:

  • Works with typical queries
  • Handles edge cases gracefully
  • Delegates correctly (if applicable)
  • Skills work as expected
  • Error messages are helpful
  • Description accurately reflects behavior
  • At least one hard signal test exists (compiler, e2e, screenshot diff)
  • Verification is independent of the LLM that generated the output