I’ve previously written about various systems I built to try to automate more of the process of software development — from autonomous agents that work overnight like elves to the behaviour trees that drove them. I’ve been on a quest to try to work out how best to work with these systems, particularly as a lone product person.
Some issues I ran into:
- Agents allow us to be more ambitious, therefore projects are just as hard to finish as they were in the past.
- The 80:20 rule where the last part of the work consumes 80% of the time is still in force, but if the first part is done really quickly, perhaps the last part ends up being 95% of the total time.
- I become painfully aware that I am the blocker on the process and that agents could be doing stuff faster if one could let them off the leash (creating good and bad output)
- With the speed of development, parallelization etc increasing, it can be hard to know where I am at in the process and lose sight of goals (eg pursuing side quests instead)
My main observation is that human attention is precious and needs to be guarded well. Otherwise agentic development is immensely tiring. My second observation is that agents are reasonably good at checking work and finding mistakes in it that can be reworked. The problem for me is in being consistent enough about how these ideas are applied.
Another realization I’ve come to is that protection of human attention comes in multiple forms:
- Trying to build as much as possible from a very small prompt/intent (vs drowning in planning)
- Not asking me to check anything that an agent could check first (or that contains obvious issues)
A small raw intent must be preserved and referenced throughout the process. Agents expand it into a scope and wireframes, but the seed of the thing is preserved so that agents can ask themselves whether it has stayed true, or whether the process has been in some way lossy.
Another thing that I ran into in previous versions was an incorrect balance between deterministic scripts and LLMs. The determinism is great for avoiding agents cutting corners or not finishing things properly, and it’s a useful protection against spending tokens on things that a deterministic process could do. But in many instances the agent that is building the system bakes too much brittle deterministic stuff into it and makes it fail in strange ways. Eg in previous incarnations of multi-agent projects I was trying to map kanban ideas and statemachines onto agent work. When the work was done it transitions to another state, and does something else. But this is really brittle. The nice thing about LLMs is that they can act as very flexible glue.
Enter OODA loops
The behaviour tree model I previously worked with was quite good for autonomously getting on with work. It had a division between finding work that should be done, and implementing that work.
I’ve since iterated on that model, and I’m now favouring a system that is composed of multiple OODA loops. This is a concept from military planning and stands for Observe, Orient, Decide, Act.
In practice that means a central ticker spawns agent(s) for each of the different phases:
- A set of observers build up a picture of what is going on in a project (is there work in progress?, do reviews need to happen?)
- An orienter looks at the output of the observers and tries to make sense of it in the context of the project’s goals
- A decider looks at the output of orienters and decides which work to create for implementers by writing messages to their inboxes
- Implementers act on the things they’ve been asked to do
The idea is that this is all relatively loosely coupled. It’s designed to approximate the way a human product manager tries to balance various concerns and understand what’s going on.
Observers each bring a different lens that they are viewing the world with. They can have their own scripts if that is necessary. They are there to counter drift in a project. The idea is that if you think a problem is emerging, you put an observer with that concern on the roster and they’ll highlight things that need to be done.
Loops in loops
There are also multiple loops in tandem, focussing on different areas:
- Main loop - is trying to make progress against product roadmap
- PM loop - takes product intent and expands it out for work by agents
- Build loop - takes the artifacts from the PM loop for a given feature and gets them built
There are flow observers that look at the work that is happening and apply kanban principles for spinning up new work.
Checks and balances
The thing that is important in all these loops is that when agents do work, other agents check it. The pattern I’m trying to move away from is that this is some sort of linear process where process A produces an output, and then process B checks the output before creating some other output. This quickly gets very complicated and hasn’t worked well for me.
Instead, I’m trying to think about listing out what needs to exist for a chunk of work to be considered done. For the PM loop, for example, that means that there needs to be a set of user stories, an architectural plan and a prototype illustrating how it will work. But there also needs to be a pairing of each of these things with a check from other actors. This is to confirm that nothing wacky gets scheduled for working. But it’s agents doing the checks rather than me.
When a feature is built in a worktree, there is a similar checklist of artifacts that need to be present: it needs to pass an architectural review, a ux review against all the stories to check the functionality exists, and a separate review based on information architecture principles to make sure the interface is broadly usable by a human.
The multi-pass pyramid
I am building a relatively experimental piece of software. I’m not sure it’s a good idea or not, and I’ll only know by building it and evaluating the whole. This means that there’s a long list of stuff that I think will be good in it, but I know some of it will need to be removed. Spending too much time on individual features outside of this holistic picture is wasteful to me because the judgment cannot be made in that way.
What I instead want to do is to build all of those things in an imperfect way, and then hone the result further in subsequent passes. This means I can possibly benefit from fixing patterns that are broken across everything, rather than doing it again and again on each feature as it is built.
I therefore created a long list of the things that I thought needed to be included, and asked the agent to ask me about each in turn. These were the seeds of the features that went through the pm loop and the build loop. In this model, features that have been built and checked are automatically merged into main without my approval. My theory is that it’s more expensive to have them hanging around holding up work than potentially having something in the wrong shape that needs to be fixed. Especially with merge conflicts and rebasing.
Like Maslow’s hierarchy of needs, the base of the pyramid is that the system can do all of the things that I want it to be able to do. Once this is met, my plan is to use agents to look at the quality of the whole through different lenses and make refinements. Eg, it may be UX related - we might have observers that look at whether the same style of button is used throughout, or whether there is a consistent layout grid that we are sticking to. Or the observers may look at whether the architecture makes sense and whether the right level of abstraction is being used. Or it might be performance related. We could do that as we go, but if we polish things we don’t end up keeping, then that’s waste.
It seems to be working reasonably well for me at the moment. I’m waking up to new features being built that mostly match my intention. I am doing steering occasionally, but I’m not a blocker on anything. I haven’t got to the higher levels of the pyramid yet, but I can already see what that’s going to look like.