Skip to content
Home
GitHub

Hard Signals

Agentic AI output quality is determined by the strength of verification signals in the loop.

When an agent generates output, something needs to check whether that output is correct. The nature of that check — the signal — determines whether the agent can self-correct and converge on a good result, or oscillate indefinitely.

There are two kinds of verification signals:

Soft signals depend on LLM judgment. Code review by another LLM. Self-evaluation (“does this look right?”). Ad hoc manual testing where a human eyeballs the output. These signals are valuable when the verification requires semantic understanding that no deterministic check can provide — but they share a fundamental limitation: the verifier’s judgment can vary across runs.

Hard signals are independent of LLM judgment. Compiler errors. E2E test pass/fail. Screenshot diffs against a baseline. These signals are binary, deterministic, and unambiguous. The agent either produced working code or it didn’t.

The practical difference:

Soft signalHard signal
ExampleLLM reviews generated codeCompiler rejects generated code
Depends onEvaluator’s judgmentArtifact’s actual behavior
Same input, different runsMay produce different verdictsAlways produces the same verdict
Agent behaviorOscillates — “looks fine” one iteration, “has issues” the nextConverges — broken means broken, fixed means fixed

When you build a loop with soft signals, the agent appears to work during demos but fails unpredictably in production. When you build a loop with hard signals, the agent detects breakage reliably and self-corrects.

A signal is hard when all three conditions hold. They are a logical AND — if any one is missing, the signal degrades to soft.

Verify the actual artifact, not a proxy.

Ground truthProxy
Screenshot of the rendered pageDOM tree structure
Full user flow from login to checkoutReturn value of a single function
Application starts and serves requestsCompilation passes
E2E test exercises the real systemUnit test with mocked dependencies

Proxies are cheaper to check, but they can pass while the actual artifact is broken. A function can return the right value while the UI that calls it is unusable. Code can compile while the application crashes on startup.

Ground truth means going to the source: does the thing actually work?

The generator and the verifier must be structurally separated. The verifier sees only the artifact — not the generator’s intent, reasoning, or intermediate state.

Why this matters: when the same context window generates output and then evaluates it, the evaluation validates against intent (“I meant to do X, and this does X”) rather than against the artifact (“does this actually work?”). The generator’s reasoning biases the verification.

Context separation breaks this loop. The verifier has no access to why the artifact was created — it can only judge what the artifact is.

In practice:

  • A compiler checking generated code has no knowledge of the LLM’s reasoning
  • An E2E test running against a deployed app doesn’t know what the agent intended
  • A screenshot diff compares pixels, not intentions
  • A delegated Expert receiving only a query (not the parent’s message history) evaluates independently

The verification procedure must be fixed. Same artifact in, same verdict out. Every time.

If the verification itself varies — different results on different runs, or results that depend on evaluator mood — then it produces noise, not signal. An agent cannot correct course based on noise.

Deterministic verification:

  • Compiler: same code always produces the same errors (or none)
  • Test suite: same implementation always passes or fails the same tests
  • Screenshot diff: same render always produces the same pixel comparison

Non-deterministic verification:

  • LLM evaluation: same code may get “looks good” or “has issues” depending on the run
  • Manual review: depends on the reviewer’s attention and context
  • Flaky tests: sometimes pass, sometimes fail on the same code

When verification is deterministic, the agent gets a stable signal it can act on. When verification is non-deterministic, the agent cannot distinguish real problems from noise.

Perstack’s architecture is designed to maximize hard signal opportunities. Each architectural decision maps to one of the three conditions:

When Expert A delegates to Expert B, the delegate runs in a completely separate context — empty message history, its own instruction, no access to the parent’s reasoning. When Expert A receives the result, it evaluates only the returned artifact.

This is context separation by construction. The runtime enforces it — you cannot accidentally leak the generator’s context to the verifier.

The runtime draws a clear boundary between probabilistic (LLM reasoning) and deterministic (state management). Events are recorded deterministically. Checkpoints capture complete state. Replaying from a checkpoint produces identical results.

This means any verification process built on the runtime’s state — event stream analysis, checkpoint comparison, artifact diffing — inherits determinism automatically.

Experts write artifacts to the workspace — files, code, configurations. The sandbox isolates the execution environment. Together, they create a controlled space where artifacts can be built and verified against ground truth.

An Expert that generates code can write it to the workspace. Another Expert (or an external process) can compile it, run its tests, or start the application — verifying the actual artifact, not a proxy.

Observability → Deterministic audit trail

Section titled “Observability → Deterministic audit trail”

The full event stream is a deterministic record of everything that happened. Same execution always produces the same events. An external verifier can process these events, compare outputs against baselines, and produce a verdict — without any LLM involvement.

The difference between soft and hard signal loops compounds over time:

Soft signal loop: Agent generates output → LLM evaluates (“looks good”) → agent generates more → LLM evaluates (“actually, this part is wrong”) → agent fixes → LLM evaluates (“looks good now, but this other part…”) → oscillation. The agent appears productive but never converges.

Hard signal loop: Agent generates code → compiler rejects it → agent reads the error → agent fixes the specific issue → compiler accepts → tests fail → agent reads the failure → agent fixes → tests pass → done. Each iteration makes measurable progress because the signal is unambiguous.

Hard signals don’t make agents smarter. They make the feedback loop trustworthy. An agent with mediocre reasoning but hard signals in the loop will outperform a brilliant agent with only soft signals — because the first one knows when it’s wrong.

Hard signals establish a quality floor — the minimum bar that the output must clear. The code compiles. The tests pass. The app starts. The screenshot matches the baseline. If any of these fail, the agent keeps iterating. The floor is the harness’s responsibility.

The quality ceiling is determined by domain knowledge — the constraints, rules, and context baked into the Expert’s instruction. Does the generated API follow the team’s naming conventions? Does the game have balanced difficulty progression? Does the customer support agent know the company’s refund policy? No hard signal can verify these. They are the Expert author’s responsibility.

This separation matters because it defines who owns what:

Quality floorQuality ceiling
What determines itHard signals in the verification loopDomain knowledge in the instruction
Who owns itThe harness (Perstack)The Expert author (you)
How it failsArtifact doesn’t workArtifact works but doesn’t solve the right problem
How it improvesBetter verification signalsBetter domain constraints

Architecture — monolithic or micro-agent — does not determine quality. A monolithic agent with hard signals will produce better output than a micro-agent team without them. What makes quality a system property is the combination: the harness provides the verification floor, and the author provides the knowledge ceiling. Neither alone is sufficient.

Soft signals are not useless — they are essential when the verification requires semantic judgment that no deterministic check can provide.

Some questions only an LLM can answer:

  • “Does this instruction faithfully reflect the domain constraints from the requirements?”
  • “Is this generated content appropriate for the target audience?”
  • “Does this API design follow the conventions of the existing codebase?”

These are inherently qualitative evaluations. Trying to force them into binary checks would lose the nuance that makes them valuable. The key is where you place them in the loop and what you combine them with.

The most effective architecture uses soft signals as an early gate and hard signals as the final authority:

write → review (soft) → test → verify (hard)
↑ |
└── fix ←────────────────┘

The soft gate catches semantic misalignment early — before the expensive test-verify cycle runs. The hard verifier provides the final pass/fail decision. Neither replaces the other:

  • Without the soft gate: hard signals catch runtime failures but miss semantic drift. The artifact compiles and passes tests, but doesn’t reflect the requirements. You iterate through expensive test cycles to discover what a quick LLM review would have caught.
  • Without the hard verifier: soft reviews confirm alignment but miss actual breakage. The Expert “looks correct” but the generated artifact crashes on startup. The LLM reviewer can’t catch what only execution reveals.

Perstack’s own create-expert uses this exact pattern: review-definition (soft gate) checks whether the generated perstack.toml faithfully reflects plan.md’s domain constraints — a semantic judgment that requires LLM reasoning. Only after review passes does the loop proceed to test-expertverify-test (hard verification). The soft reviewer has no exec — it reads files and judges alignment. The hard verifier has exec — it runs commands and compares outputs.

When using soft signals:

  1. Place them before hard signals — catch semantic issues early, before investing in expensive execution and verification.
  2. Give the soft reviewer only read access — no exec, no file writes. This keeps its role pure: it judges, it doesn’t act.
  3. Never use soft signals as the final gate — the last check before completion must be hard. A soft “looks good” is not a shipping signal.
  4. Context-separate the reviewer from the generator — just as with hard verification, the soft reviewer should be a separate Expert that sees only the artifacts, not the generator’s reasoning.

When building Experts, ask: what hard signal can verify this Expert’s output?

  • If the Expert generates code → compiler errors, test suite, application startup
  • If the Expert generates configuration → validation schema, dry-run deployment
  • If the Expert generates UI → screenshot diff, accessibility audit
  • If the Expert generates data → schema validation, constraint checks
  • If the Expert generates natural language → this is genuinely hard to verify with hard signals; acknowledge the limitation and supplement with hard signals on adjacent properties (e.g., format validation, length constraints, required keyword presence)

If the only answer is “another LLM reads it,” the verification loop is soft. The system will oscillate rather than converge. Look for a way to make the signal harder — even a partial hard signal (format validation, schema check) is better than none.

Many verification criteria start as subjective questions: “is this code clean?”, “is this instruction concise?”, “is this output high quality?”. These are soft signals — the LLM always judges its own output favorably.

The conversion strategy: replace subjective evaluation with binary checks that have unambiguous yes/no answers.

Soft check (LLM opinion)Hard check (binary)
“Is the instruction concise?”wc -l instruction ≤ 15 lines
”Does the code follow best practices?”npx tsc --noEmit exits 0, npm test exits 0
”Is the output well-structured?”grep -c '```' = 0 (no code blocks in instructions)
“Are all dependencies declared?”grep 'delegates' present for every expert that references delegates

Each binary check has a clear pass/fail result, a clear remediation action, and produces the same verdict every time. Subjective checks (“would removing this make the output worse?”) always pass because the LLM cannot judge its own output objectively.

An Expert doesn’t have to rely on external processes to get hard signals. You can embed the verifier inside the delegation tree — a dedicated verifier Expert that executes hard signal checks, structurally separated from the generator.

coordinator
├── generator — produces the artifact
├── reviewer — checks semantic alignment (soft gate, read-only)
├── executor — runs the artifact (pure execution, no evaluation)
└── verifier — executes hard signal checks against the result

The key design constraints:

  1. Verifier as a direct child of the coordinator — not nested under the generator. This guarantees context separation: the verifier shares no context with the generator.
  2. Verifier needs exec capability — without it, verification degrades to file reading, which is a soft signal. Hard signals require running commands that produce deterministic output.
  3. Executor and verifier are separate — the executor runs the artifact and reports what happened (facts only). The verifier runs checks and reports pass/fail. Combining them leaks execution context into verification.
  4. Reviewer has no exec — the soft gate reads files and judges alignment. Keeping it read-only prevents it from accidentally becoming a verifier.

Perstack’s own create-expert uses this pattern: review-definition (soft gate) checks plan alignment with read-only access, then test-expert executes the generated expert (pure executor, no evaluation), then verify-test runs hard signal checks, re-runs them a second time to confirm reproducibility, and performs structural checks — all deterministic, all independent of LLM judgment.

Confirming determinism: the reproducibility check

Section titled “Confirming determinism: the reproducibility check”

A signal is only as hard as its consistency. If a check passes once but fails on re-execution with the same artifact, the signal is non-deterministic — it produces noise, not information.

The practical fix: re-run every verification command a second time and compare results. If the output is identical, the signal is deterministic (hard). If it differs, the signal or the artifact needs fixing before you can trust it.

This is a cheap check that catches a common failure mode: flaky tests, environment-dependent behavior, or time-sensitive assertions that break reproducibility.

  • Experts — how context isolation enables context separation
  • Runtime — how deterministic state enables deterministic verification
  • Testing Experts — applying signal quality to your test strategy
  • Best Practices — the “Keep It Verifiable” principle