In our application, the chatbot can’t hide behind a loading spinner; users keep talking and expect it to pivot instantly. This constraint forced us to develop some lightweight techniques you can graft onto your own LLM app that serves impatient users.
Our LLM-based NPC lives in an online crime game – an environment where content moderation is tricky, player exploitation is rampant (especially if valuable items are involved), and every interaction is scrutinized. But our game designer had a specific vision: the NPC would use a (mostly) regular player account and communicate via the same real-time, in-game chat channel players use for everything from coordination to trash-talk.
Critically, the existing in-game chat that players use offered no special affordances to signal interaction with an NPC. This meant we had no way to block users from sending further messages while our NPC was still processing or generating a reply to a previous one.
This lack of UI back-pressure directly mirrors how standard human-to-human chat applications work – the Send button is never greyed out just because you’re waiting for a reply. Given this freedom in our game, players naturally send quick follow-ups, clarifications, and corrections, exactly as they would when chatting with another human. They expect their conversation partner – NPC or otherwise – to handle this fluidly.
Without a system designed for this, our NPC would simply queue incoming messages, leading to disconnected replies, ignored corrections, and a profoundly unnatural interaction feel. This differs significantly from most deployed LLM applications I’ve seen, which typically enforce a stricter turn-taking model and block or disable input during generation.
Handling this real-time, interruptible conversation flow required me to move beyond established patterns. Because few existing models applied directly, much of the design was guessed at, tried, and fixed where it fell short. In this article, I’ll describe how I implemented an LLM-based NPC designed to behave naturally under these demanding conditions, focusing on the core techniques that made it possible: self-destructing work units and input debounce windows.
To handle the constant potential for interruption without UI controls, we implemented several interacting mechanisms:
Principle: Any ongoing work must be instantly abandonable when a new relevant input (a user message in the same conversation) arrives.
Implementation: The core idea was to associate every NPC reply generation “run” with a unique identifier and a corresponding flag that we could check and flip remotely, and have the “run” periodically check if it’d been cancelled
// Simplified pseudo-code for our run loop
const cancelToken = generateToken(redis, conversationId);
const result1 = await externalCall_LLM();
if (await cancelToken.isCancelled()) return cleanupAndExit(cancelToken);
const result2 = await externalCall_DatabaseLookup(result1);
if (await cancelToken.isCancelled()) return cleanupAndExit(cancelToken);
const result3 = await externalCall_GameAPI(result2);
if (await cancelToken.isCancelled()) return cleanupAndExit(cancelToken);
// Result can no longer be cancelled, so send then write
await sendUser(conversationId, result3);
await writeAnswer(conversationId, result3);
return cleanupAndExit(cancelToken);
Principle: Canceled or partial work must never corrupt the conversation state or the context window we provide to the LLM.
Implementation: This required us to enforce strict rules about when data is saved.
Problem: Players often send multiple messages rapidly (“hey”…“how much are grenades?”…“the flash ones”). Reacting instantly to the first message only to cancel moments later upon receiving the second (and third) is inefficient and creates unnecessary system churn (start-cancel-start) for us.
Principle: Instead of reacting instantly, we decided to group rapid-fire messages together using a short, server-side delay. This introduces a controlled, soft back-pressure.
Implementation:
Principle: Our policy checks (moderation) and intent/workflow analysis (e.g., is this friendly chat, an information query, or something problematic?) must consider the entire burst of messages captured by the debounce window for accurate decisions.
Implementation: We use a rolling approach within the batch.
Principle: We needed a simple, overarching limit to prevent abuse and manage costs, especially important when the core logic allows for rapid interruption and reprocessing cycles.
Implementation: We apply a flat rate limit per conversation or per player (e.g., X messages allowed within any Y-second rolling window). This limit is checked early in our message handling process, before debouncing or run creation, providing a basic layer of protection independent of the more complex interrupt logic.
The raw mechanics of interrupts and debounces could feel jarringly robotic. We paid close attention to the perceived experience, primarily through the typing indicator:
0 ms
: Player sends “hey how are you”10 ms
: Our system receives message, starts run, [Typing Indicator ON]1000 ms
: Player sends “how much are flash grenades worth?” (Interrupt)1010 ms
: Our system receives interrupt message. An active run exists, so we set the cancel token in Redis and start the 3s debounce timer. The currently running task polls Redis soon after, sees ‘canceled’, and stops. [Typing Indicator OFF]4010 ms
: 3-second debounce timer expires. A new worker picks up the batch of messages (“hey how are you”, “how much are flash grenades worth?”). It starts a new run. [Typing Indicator ON]~14000 ms
: The new run completes processing the combined intent of the batch, generates the answer (“Flash grenades sell for about $250 right now.”), delivers it. [Typing Indicator OFF]From the player’s perspective, the NPC started typing, seemed to pause or “rethink” when the second message arrived (indicator off), then started typing again after a short delay, delivering a relevant answer to the corrected/updated query. We found this mimics a human interaction pause, not a technical fault.
Testing asynchronous, interrupt-driven systems is notoriously difficult. Race conditions and edge cases were hard for us to trigger reliably in a live environment. In addition, we wanted to get this right initially without considering any complications added by using Redis.
Approach: We developed a YAML-based event simulation harness.
- { time: 0, type: msg, conversation: A, msg: 'A-1' }
- { time: 20, type: process, conversation: A, run_id: 1, starts_processing: 'A-1' } # Simulate worker picking up A-1
- { time: 50, type: msg, conversation: A, msg: 'A-2' } # Interrupting message
- { time: 70, type: process, conversation: A, run_id: 1, finishes_external_call: 1 } # Run 1 checks cancel token *after* this
- { time: 80, type: process, conversation: A, debounce_timer_expires: true } # Debounce for A-2 expires
# ... more events
expected_trace: |
20|A|run_1|started_processing|messages:[A-1]
50|A|run_1|cancel_requested
70|A|run_1|detected_cancel_after_call_1
70|A|run_1|terminated
80|A|run_2|started_processing|messages:[A-2] # Assuming A-1 was too old or handled differently
This allows us to write deterministic tests covering complex timing scenarios, race conditions (e.g., message arrives just as debounce expires), and interrupt handling logic. One key advantage here was that we could make sure this worked nicely with an easily-checkable in-memory solution before we implemented the Redis-based solution, and check our tests ported nicely across unchanged.
No system is perfect, and we made conscious trade-offs based on our current scale (around 70k DAU) and priorities:
AbortController
Integration: We didn’t thread cancellation signals deep into every downstream HTTP client request. While technically possible, the complexity outweighed the marginal latency gain (saving fractions of a second on already-canceled runs). We may revisit this if external call latencies become a major issue.Our primary goals were responsiveness to interrupts and maintaining a natural interaction pace.
While the current system works well, we see areas for improvement:
AbortController
or similar) into key external calls (especially long-running LLM inferences) for faster resource release.Building truly interactive conversational agents that share the same unconstrained interfaces as human users requires embracing interruptibility from the start. In our experience, it cannot be merely bolted on.
When UI back-pressure isn’t an option, you must handle the inevitable input bursts and context shifts gracefully on the backend. Our experience shows that a combination of techniques – self-destructing work units identified by shared tokens, server-side debouncing to batch rapid inputs, strict history hygiene to prevent state corruption, and basic rate limiting for stability – provides a robust and effective solution. This approach allowed our LLM agent to maintain the illusion of natural, fluid conversation, even when users won’t wait.
If you liked it, you might like other stuff I write