Hard Signals

Agentic AI output quality is determined by the strength of verification signals in the loop.

When an agent generates output, something needs to check whether that output is correct. The nature of that check — the signal — determines whether the agent can self-correct and converge on a good result, or oscillate indefinitely.

Soft signals vs hard signals

There are two kinds of verification signals:

Soft signals depend on LLM judgment. Code review by another LLM. Self-evaluation (“does this look right?”). Ad hoc manual testing where a human eyeballs the output. These signals are valuable when the verification requires semantic understanding that no deterministic check can provide — but they share a fundamental limitation: the verifier’s judgment can vary across runs.

Hard signals are independent of LLM judgment. Compiler errors. E2E test pass/fail. Screenshot diffs against a baseline. These signals are binary, deterministic, and unambiguous. The agent either produced working code or it didn’t.

The practical difference:

	Soft signal	Hard signal
Example	LLM reviews generated code	Compiler rejects generated code
Depends on	Evaluator’s judgment	Artifact’s actual behavior
Same input, different runs	May produce different verdicts	Always produces the same verdict
Agent behavior	Oscillates — “looks fine” one iteration, “has issues” the next	Converges — broken means broken, fixed means fixed

When you build a loop with soft signals, the agent appears to work during demos but fails unpredictably in production. When you build a loop with hard signals, the agent detects breakage reliably and self-corrects.

Three conditions for a hard signal

A signal is hard when all three conditions hold. They are a logical AND — if any one is missing, the signal degrades to soft.

1. Ground truth

Verify the actual artifact, not a proxy.

Ground truth	Proxy
Screenshot of the rendered page	DOM tree structure
Full user flow from login to checkout	Return value of a single function
Application starts and serves requests	Compilation passes
E2E test exercises the real system	Unit test with mocked dependencies

Proxies are cheaper to check, but they can pass while the actual artifact is broken. A function can return the right value while the UI that calls it is unusable. Code can compile while the application crashes on startup.

Ground truth means going to the source: does the thing actually work?

2. Context separation

The generator and the verifier must be structurally separated. The verifier sees only the artifact — not the generator’s intent, reasoning, or intermediate state.

Why this matters: when the same context window generates output and then evaluates it, the evaluation validates against intent (“I meant to do X, and this does X”) rather than against the artifact (“does this actually work?”). The generator’s reasoning biases the verification.

Context separation breaks this loop. The verifier has no access to why the artifact was created — it can only judge what the artifact is.

In practice:

A compiler checking generated code has no knowledge of the LLM’s reasoning
An E2E test running against a deployed app doesn’t know what the agent intended
A screenshot diff compares pixels, not intentions
A delegated Expert receiving only a query (not the parent’s message history) evaluates independently

3. Determinism

The verification procedure must be fixed. Same artifact in, same verdict out. Every time.

If the verification itself varies — different results on different runs, or results that depend on evaluator mood — then it produces noise, not signal. An agent cannot correct course based on noise.

Deterministic verification:

Compiler: same code always produces the same errors (or none)
Test suite: same implementation always passes or fails the same tests
Screenshot diff: same render always produces the same pixel comparison

Non-deterministic verification:

LLM evaluation: same code may get “looks good” or “has issues” depending on the run
Manual review: depends on the reviewer’s attention and context
Flaky tests: sometimes pass, sometimes fail on the same code

When verification is deterministic, the agent gets a stable signal it can act on. When verification is non-deterministic, the agent cannot distinguish real problems from noise.

How Perstack enables hard signals

Perstack’s architecture is designed to maximize hard signal opportunities. Each architectural decision maps to one of the three conditions:

Context isolation → Context separation

When Expert A delegates to Expert B, the delegate runs in a completely separate context — empty message history, its own instruction, no access to the parent’s reasoning. When Expert A receives the result, it evaluates only the returned artifact.

This is context separation by construction. The runtime enforces it — you cannot accidentally leak the generator’s context to the verifier.

Deterministic runtime → Determinism

The runtime draws a clear boundary between probabilistic (LLM reasoning) and deterministic (state management). Events are recorded deterministically. Checkpoints capture complete state. Replaying from a checkpoint produces identical results.

This means any verification process built on the runtime’s state — event stream analysis, checkpoint comparison, artifact diffing — inherits determinism automatically.

Sandbox and workspace → Ground truth

Experts write artifacts to the workspace — files, code, configurations. The sandbox isolates the execution environment. Together, they create a controlled space where artifacts can be built and verified against ground truth.

An Expert that generates code can write it to the workspace. Another Expert (or an external process) can compile it, run its tests, or start the application — verifying the actual artifact, not a proxy.

Observability → Deterministic audit trail

The full event stream is a deterministic record of everything that happened. Same execution always produces the same events. An external verifier can process these events, compare outputs against baselines, and produce a verdict — without any LLM involvement.

Practical impact

The difference between soft and hard signal loops compounds over time:

Soft signal loop: Agent generates output → LLM evaluates (“looks good”) → agent generates more → LLM evaluates (“actually, this part is wrong”) → agent fixes → LLM evaluates (“looks good now, but this other part…”) → oscillation. The agent appears productive but never converges.

Hard signal loop: Agent generates code → compiler rejects it → agent reads the error → agent fixes the specific issue → compiler accepts → tests fail → agent reads the failure → agent fixes → tests pass → done. Each iteration makes measurable progress because the signal is unambiguous.

Hard signals don’t make agents smarter. They make the feedback loop trustworthy. An agent with mediocre reasoning but hard signals in the loop will outperform a brilliant agent with only soft signals — because the first one knows when it’s wrong.

Quality floor and quality ceiling

Hard signals establish a quality floor — the minimum bar that the output must clear. The code compiles. The tests pass. The app starts. The screenshot matches the baseline. If any of these fail, the agent keeps iterating. The floor is the harness’s responsibility.

The quality ceiling is determined by domain knowledge — the constraints, rules, and context baked into the Expert’s instruction. Does the generated API follow the team’s naming conventions? Does the game have balanced difficulty progression? Does the customer support agent know the company’s refund policy? No hard signal can verify these. They are the Expert author’s responsibility.

This separation matters because it defines who owns what:

	Quality floor	Quality ceiling
What determines it	Hard signals in the verification loop	Domain knowledge in the instruction
Who owns it	The harness (Perstack)	The Expert author (you)
How it fails	Artifact doesn’t work	Artifact works but doesn’t solve the right problem
How it improves	Better verification signals	Better domain constraints

Architecture — monolithic or micro-agent — does not determine quality. A monolithic agent with hard signals will produce better output than a micro-agent team without them. What makes quality a system property is the combination: the harness provides the verification floor, and the author provides the knowledge ceiling. Neither alone is sufficient.

Combining soft and hard signals

Soft signals are not useless — they are essential when the verification requires semantic judgment that no deterministic check can provide.

Some questions only an LLM can answer:

“Does this instruction faithfully reflect the domain constraints from the requirements?”
“Is this generated content appropriate for the target audience?”
“Does this API design follow the conventions of the existing codebase?”

These are inherently qualitative evaluations. Trying to force them into binary checks would lose the nuance that makes them valuable. The key is where you place them in the loop and what you combine them with.

The soft gate + hard verification pattern

The most effective architecture uses soft signals as an early gate and hard signals as the final authority:

write → review (soft) → test → verify (hard)
         ↑                        |
         └── fix ←────────────────┘

The soft gate catches semantic misalignment early — before the expensive test-verify cycle runs. The hard verifier provides the final pass/fail decision. Neither replaces the other:

Without the soft gate: hard signals catch runtime failures but miss semantic drift. The artifact compiles and passes tests, but doesn’t reflect the requirements. You iterate through expensive test cycles to discover what a quick LLM review would have caught.
Without the hard verifier: soft reviews confirm alignment but miss actual breakage. The Expert “looks correct” but the generated artifact crashes on startup. The LLM reviewer can’t catch what only execution reveals.

Perstack’s own create-expert uses this exact pattern: review-definition (soft gate) checks whether the generated perstack.toml faithfully reflects plan.md’s domain constraints — a semantic judgment that requires LLM reasoning. Only after review passes does the loop proceed to test-expert → verify-test (hard verification). The soft reviewer has no exec — it reads files and judges alignment. The hard verifier has exec — it runs commands and compares outputs.

Guidelines for soft signal placement

When using soft signals:

Place them before hard signals — catch semantic issues early, before investing in expensive execution and verification.
Give the soft reviewer only read access — no exec, no file writes. This keeps its role pure: it judges, it doesn’t act.
Never use soft signals as the final gate — the last check before completion must be hard. A soft “looks good” is not a shipping signal.
Context-separate the reviewer from the generator — just as with hard verification, the soft reviewer should be a separate Expert that sees only the artifacts, not the generator’s reasoning.

Designing for hard signals

When building Experts, ask: what hard signal can verify this Expert’s output?

If the Expert generates code → compiler errors, test suite, application startup
If the Expert generates configuration → validation schema, dry-run deployment
If the Expert generates UI → screenshot diff, accessibility audit
If the Expert generates data → schema validation, constraint checks
If the Expert generates natural language → this is genuinely hard to verify with hard signals; acknowledge the limitation and supplement with hard signals on adjacent properties (e.g., format validation, length constraints, required keyword presence)

If the only answer is “another LLM reads it,” the verification loop is soft. The system will oscillate rather than converge. Look for a way to make the signal harder — even a partial hard signal (format validation, schema check) is better than none.

Converting soft checks to hard checks

Many verification criteria start as subjective questions: “is this code clean?”, “is this instruction concise?”, “is this output high quality?”. These are soft signals — the LLM always judges its own output favorably.

The conversion strategy: replace subjective evaluation with binary checks that have unambiguous yes/no answers.

Soft check (LLM opinion)	Hard check (binary)
“Is the instruction concise?”	`wc -l instruction` ≤ 15 lines
”Does the code follow best practices?”	`npx tsc --noEmit` exits 0, `npm test` exits 0
”Is the output well-structured?”	grep -c '```' = 0 (no code blocks in instructions)
“Are all dependencies declared?”	`grep 'delegates'` present for every expert that references delegates

Each binary check has a clear pass/fail result, a clear remediation action, and produces the same verdict every time. Subjective checks (“would removing this make the output worse?”) always pass because the LLM cannot judge its own output objectively.

The built-in verifier pattern

An Expert doesn’t have to rely on external processes to get hard signals. You can embed the verifier inside the delegation tree — a dedicated verifier Expert that executes hard signal checks, structurally separated from the generator.

coordinator
├── generator     — produces the artifact
├── reviewer      — checks semantic alignment (soft gate, read-only)
├── executor      — runs the artifact (pure execution, no evaluation)
└── verifier      — executes hard signal checks against the result

The key design constraints:

Verifier as a direct child of the coordinator — not nested under the generator. This guarantees context separation: the verifier shares no context with the generator.
Verifier needs exec capability — without it, verification degrades to file reading, which is a soft signal. Hard signals require running commands that produce deterministic output.
Executor and verifier are separate — the executor runs the artifact and reports what happened (facts only). The verifier runs checks and reports pass/fail. Combining them leaks execution context into verification.
Reviewer has no exec — the soft gate reads files and judges alignment. Keeping it read-only prevents it from accidentally becoming a verifier.

Perstack’s own create-expert uses this pattern: review-definition (soft gate) checks plan alignment with read-only access, then test-expert executes the generated expert (pure executor, no evaluation), then verify-test runs hard signal checks, re-runs them a second time to confirm reproducibility, and performs structural checks — all deterministic, all independent of LLM judgment.

Confirming determinism: the reproducibility check

A signal is only as hard as its consistency. If a check passes once but fails on re-execution with the same artifact, the signal is non-deterministic — it produces noise, not information.

The practical fix: re-run every verification command a second time and compare results. If the output is identical, the signal is deterministic (hard). If it differs, the signal or the artifact needs fixing before you can trust it.

This is a cheap check that catches a common failure mode: flaky tests, environment-dependent behavior, or time-sensitive assertions that break reproducibility.

What’s next

Experts — how context isolation enables context separation
Runtime — how deterministic state enables deterministic verification
Testing Experts — applying signal quality to your test strategy
Best Practices — the “Keep It Verifiable” principle