When you build an agent-powered app, the instinct is to start with the app — set up a project, install dependencies, write scaffolding. Then somewhere in the middle, you start figuring out what the agent should actually do.
This is backwards.
Agent-first means starting with the agent. Get the brain working first. Once the agent behaves the way you want, expand outward: add tools, then build the shell around it. The agent is the product — everything else is infrastructure.
This matters because the agent will keep evolving. Prompts change, capabilities expand, behavior gets refined. If the agent is tangled with your application code, every change risks breaking something unrelated. Keep the brain separate from the body, and both can evolve on their own terms.
This guide uses Perstack — a toolkit for agent-first development. In Perstack, agents are called Experts: modular micro-agents defined in plain text (perstack.toml), executed by a runtime that handles model access, tool orchestration, and state management. Perstack supports multiple LLM providers including Anthropic, OpenAI, and Google. You define what the agent should do; the runtime makes it work.
Writing TOML by hand works, but there’s a faster way. create-expert is a CLI that generates Expert definitions from natural language descriptions — it’s itself an Expert that builds other Experts.
Terminal window
npxcreate-expert"A code review assistant that checks for security vulnerabilities, suggests fixes, and explains the reasoning behind each finding"
create-expert takes your description, generates a perstack.toml, test-runs the Expert against sample inputs, and iterates on the definition until behavior stabilizes. You get a working Expert — no code, no setup.
The description doesn’t need to be precise. Start vague:
Terminal window
npxcreate-expert"Something that helps with onboarding new team members"
create-expert will interpret your intent, make decisions about scope and behavior, and produce a testable Expert. You can always refine from there.
create-expert reads the existing perstack.toml in your current directory. Run it again with a refinement instruction, and it modifies the definition in place:
Terminal window
npxcreate-expert"Make it more concise. It's too verbose when explaining findings"
Terminal window
npxcreate-expert"Add a severity rating to each finding: critical, warning, or info"
Terminal window
npxcreate-expert"Run 10 tests with different code samples and show me the results"
Each iteration refines the definition. The Expert gets better, and you never open an editor.
Prototyping isn’t just about getting the agent to run — it’s about finding where it fails.
Write a test case that your agent should catch. For the code reviewer, create a file with a deliberate vulnerability:
Terminal window
npxcreate-expert"Read the file test/vulnerable.py and review it. It contains a SQL injection — make sure the reviewer catches it and suggests a parameterized query fix"
If the reviewer misses it, you’ve found a gap in the instruction. Refine and test again:
Terminal window
npxcreate-expert"The reviewer missed the SQL injection in the raw query on line 12. Update the instruction to pay closer attention to string concatenation in SQL statements"
This is the feedback loop that matters: write a scenario the agent should handle, test it, fix the instruction when it fails, repeat. By the time you build the app around it, you already know what the agent can and can’t do.
At some point you need feedback beyond your own testing. perstack start makes this easy — hand someone the perstack.toml and they can run the Expert themselves:
Terminal window
npxperstackstartreviewer
The interactive UI lets them try their own queries and see how the Expert responds. No app to deploy, no environment to configure beyond the API key.
Every execution is recorded as checkpoints in the local perstack/ directory. After a round of feedback, inspect what happened:
Terminal window
npxperstacklog
npxperstacklog--tools# what tools were called
npxperstacklog--errors# what went wrong
You can review specific runs, filter by step, or export as JSON for deeper analysis. See the CLI Reference for the full set of options.
This gives you a lightweight evaluation workflow: distribute the TOML, collect usage, analyze the logs, refine the instruction.