Testing Experts
Experts are probabilistic — the same query can produce different results. This guide covers strategies for testing effectively.
Signal quality
Section titled “Signal quality”Not all tests produce equal confidence. The Hard Signal Framework distinguishes two kinds of verification:
- Soft signal test: Run the Expert, have an LLM evaluate the output, check if it “looks right.” The result depends on evaluator judgment and may vary across runs.
- Hard signal test: Run the Expert, execute the output artifact — compile it, run its tests, take a screenshot and diff it. The result is deterministic and independent of LLM judgment.
| Strategy | Signal type | Why |
|---|---|---|
Manual observation (perstack start) | Soft | Human judgment evaluates output |
| LLM-as-judge evaluation | Soft | Evaluator judgment is as fallible as the generator’s |
| Soft review gate (requirements alignment) | Soft (valuable) | Catches semantic drift before expensive hard verification |
| Checkpoint replay | Hard (runtime) | Deterministic state replay, same input → same output |
| Mock testing (deterministic tool assertions) | Hard (runtime) | Verifies tool execution sequence deterministically |
| Built-in verifier delegate | Hard (full) | Separate Expert runs checks with exec, no shared context with generator |
| E2E with artifact execution (compile, test, run) | Hard (full) | Verifies the actual artifact with a deterministic procedure |
Aim for hard signals wherever possible. When you must use soft signals — for example, evaluating natural language quality, checking requirements alignment, or reviewing semantic correctness — place them before the hard signal checks, not after. A soft review gate catches semantic drift early; the hard verifier provides the final pass/fail. See combining soft and hard signals for the full pattern.
Local testing
Section titled “Local testing”Run your Expert locally before publishing:
npx perstack start my-expert "test query"Test different scenarios
Section titled “Test different scenarios”- Happy path — expected inputs and workflows
- Edge cases — unusual inputs, empty data, large files
- Error handling — missing files, invalid formats, network failures
- Delegation — if your Expert delegates, test the full chain
Inspect execution
Section titled “Inspect execution”Use JSON output to see exactly what happened:
npx perstack run my-expert "query"Each event shows:
- Tool calls and results
- Checkpoint state
- Timing information
Checkpoint-based testing
Section titled “Checkpoint-based testing”Checkpoints enable deterministic replay of the runtime portion. Checkpoints are stored in perstack/jobs/{jobId}/runs/{runId}/ — see Runtime for the full directory structure.
Resume from checkpoint
Section titled “Resume from checkpoint”Continue a paused run:
npx perstack run my-expert --continueThis resumes from the last checkpoint — useful for:
- Debugging a specific step
- Testing recovery behavior
- Iterating on long-running tasks
Replay for debugging
Section titled “Replay for debugging”Examine checkpoints to understand what the Expert “saw” at each step:
- Message history
- Tool call decisions
- Intermediate state
Testing with mocks
Section titled “Testing with mocks”For automated testing, mock the LLM to get deterministic behavior:
import { run } from "@perstack/runtime"
const result = await run(params, { // Mock eventListener for assertions eventListener: (event) => { if (event.type === "callTools") { expect(event.toolCalls[0].toolName).toBe("expectedTool") } }})The runtime is deterministic — only LLM responses are probabilistic. Mock the LLM layer for unit tests; use real LLMs for integration tests.
Testing checklist
Section titled “Testing checklist”Before publishing:
- Works with typical queries
- Handles edge cases gracefully
- Delegates correctly (if applicable)
- Skills work as expected
- Error messages are helpful
- Description accurately reflects behavior
- At least one hard signal test exists (compiler, e2e, screenshot diff)
- Verification is independent of the LLM that generated the output
What’s next
Section titled “What’s next”- Hard Signals — the framework behind signal quality
- Best Practices — design guidelines