L / 002

Unleash the Kimis!

At wild, we're exploring the frontiers of Agentic Software Engineering: not just using AI to write code, but architecting systems of agents that coordinate, specialize, and converge on correctness through structured collaboration. This post documents our investigation into the orchestrator-subagent pattern, implemented in a side project using Kimi Code CLI to understand its mechanics before we consider production adoption.

Size Matters

The typical coding agent operates as a monolith: a single LLM juggling requirements analysis, architecture decisions, implementation, testing, and documentation within one growing context window. As sessions extend and complexity grows, we see what might be called "attention diffusion." The model's capacity to stay focused degrades as token count approaches context limits. (Transformers, it turns out, are not immune to the limits of working memory any more than the rest of us.)

Picture a session at token 180,000 where the agent is trying to remember: the original user requirements, architectural constraints, three half-finished features, type definitions across twelve files, recent debugging adventures, and a mental map of the whole project. As the context window fills up, the chances of drift (where the output subtly wanders from what you actually asked for) go up pretty quickly.

The Orchestrator-Subagent Architecture

Our approach partitions the agent system as used by Kimi Code into specialized components, aka the Kimi agents:

  1. One orchestrator agent that delegates but never implements

  2. Domain-specific subagents with focused prompts and isolated context windows

  3. A structured execution loop that enforces separation of concerns

The orchestrator's system prompt explicitly forbids direct implementation. Its responsibilities reduce to: context gathering, delegation, coordination, and termination decisions. Implementation details, type signatures, and code patterns live exclusively in subagent contexts, which are ephemeral and task-scoped. Unleashed the Kimis are!

Think of it as microservices for LLM prompts: each service handles one domain, and none of them accumulate technical debt in the form of irrelevant context tokens.

The Execution Loop

The system implements a strict four-phase cycle

The system implements a strict four-phase cycle:

  1. Planning as decomposition. The task planner breaks work into clear, bite-sized tasks with dependencies mapped out before any code is written. We've made yak shaving structurally impossible.

  2. Specialized execution. Domain experts handle their slice: server engineers work on backend logic, game engineers handle Canvas rendering, the changelog maintainer keeps documentation current. Each gets a fresh context window with only the information they need. When tasks don't depend on each other, the orchestrator runs multiple agents at once. Embarrassingly parallel, as the HPC folks would say.

  3. Review as a quality gate. The reviewer runs through a detailed checklist: type strictness, architecture patterns, naming conventions, completeness. It runs validation commands and delivers a verdict. It doesn't fix anything; it just reports what it finds. (The reviewer is a pure function. We're very proud of this.)

  4. Iterative convergence. If the reviewer spots issues, the orchestrator loops back to planning. The system converges through iteration rather than crossing its fingers and hoping for first-try perfection. We've essentially implemented gradient descent for software development, except the loss function is "does the reviewer approve?"

Why This Works

Context efficiency. The orchestrator keeps a minimal record: just task metadata and what got delegated to whom. Even in a session with 50+ iterations, its context stays lean. It remembers that work happened, not every detail of how. If the orchestrator were a manager, it would be the rare kind that trusts the team and doesn't micromanage.

Reduced hallucination. When we give each agent a focused area of expertise and the right reference docs, we see fewer invented APIs. This is prompt engineering through specialization, but applied at the system level instead of cramming everything into one mega-prompt. Software engineers will recognize this pattern.

Systematic error correction. In our testing, the review phase catches >90% of implementation errors. Bugs that sneak through one round get caught in the next. The trade-off: more iterations. The upside: higher quality at the end.

Tool-level constraints. Implementation agents can write files; the orchestrator can't. The planner can read files but not change them. We don't just trust the orchestrator not to write code, we make it structurally impossible. Principle of least privilege, applied to AI agents.

The Trade-offs

It takes longer. For our side project (~15k LOC TypeScript/React), a typical feature takes 12 to 18 minutes with 8 to 12 iterations. That's slower than a skilled human, but you get consistent quality and complete documentation every time. We're still deciding whether we've built a careful craftsman or an obsessive perfectionist.

It's more complex. The orchestrator pattern adds moving parts: more agents, more prompts, more coordination. The monolithic agent is simpler, one prompt, one model, done. But we've known for decades that monolithic architectures don't scale well. Turns out Conway's Law applies to AI systems too.

Prompts are infrastructure. Each subagent's prompt captures domain knowledge and project conventions. In production, these prompts become critical infrastructure, documentation as executable code, basically. They need to be versioned, reviewed, and maintained just like regular code.

Open Questions

We're actively exploring some interesting questions: Can the orchestrator learn from its own mistakes? (Meta-learning for agent systems. The Singularity, but for code reviews.) How many review iterations are too many before we escalate to a human? How do we measure when the system has actually "converged" on the right answer? Should subagents talk to each other directly, or does that just create coordination chaos?

For bigger codebases, we might need 15 to 20 specialists instead of our current eight. At what point does having too many agents create more problems than it solves? We're playing with dynamic agent creation in Kimi Code, the orchestrator can spawn new specialists on the fly for novel tasks. It's agents all the way down, potentially.

A lesson learned?

Our experiment shows that treating AI coding assistants as architected systems, not magic generators, is promising. Specialized agents with focused context reduced errors, maintained architecture, and improved documentation. This aligns with software architecture principles, suggesting separation of concerns works for AI too. We believe this pattern (orchestrator, planner, specialists, reviewer) can apply to any stack or project type. Using Kimi Code CLI, we're now designing collaborative, iterative AI systems, a significant step beyond just using AI to write code.