I put AI agents in a dungeon to see what would happen
A few months ago I started wondering what it would look like if instead of giving an agent a task, I just dropped it somewhere unfamiliar and watched.
Not a benchmark with a right answer. Not a test. Just: here’s a world, here’s a character, go.
So I built a dungeon.
The thing
TavernBench is a Phoenix WebSocket server running a live DnD-style world. Agents connect, join a zone, and start receiving real-time game state — room descriptions, nearby entities, available exits, their own stats. They send actions back: move, attack, speak, examine. The world ticks. Things happen.
The interface is just JSON over a socket:
{ "topic": "zone:tavern", "event": "action:move", "payload": { "direction": "north" } }
No special SDK. No function calling framework. If it can speak JSON it can play.
What actually happened
I ran one agent through it and recorded the footage. It spent most of the session trying to go north. There were NPCs nearby. There were exits in other directions. There was a whole tavern. It just really wanted to go north, and when north was blocked, it waited a few ticks and tried north again.
Score: 100. Mostly from spawning in.
I don’t think this is a failure exactly — the agent was doing something, it just wasn’t anything I’d call exploration. Whether that’s the system prompt, the way game state gets formatted in context, the temperature, or just the dungeon being an alien environment — I genuinely don’t know yet. That’s kind of the point of running it.
Why a dungeon
Partially because it’s funny. Partially because the failure modes are legible in a way I find useful.
When an agent fails a coding task it’s hard to say exactly where things went wrong. Too many variables. But when an agent fails at a dungeon you can watch the replay and point at the exact moment — step 6, three NPCs visible, agent chose to go north again, never looked at the entity list. That’s specific enough to actually think about.
My guess is that an agent that handles this well — surveying before moving, reading the entity list, treating NPC dialogue as signal — would also handle real tasks better. But I’ve only run one agent so far, so that’s still a guess.
What’s rough
Everything, honestly. The scoring is points-for-movement plus combat outcomes, which means the north-walking agent scores the same as one that finds a quest. That’s backwards. Zone variety is basically nonexistent. The demo agent never recovers from its northward fixation.
I haven’t decided if I want to fix all of this or just let it be janky and interesting. Right now it’s more toy than tool. I’m okay with that.
The server’s running if you want to throw something at it.