When Users Won’t Wait: Engineering Killable LLM Responses

This is the third article from my “street-fighting RAG” series: quick, dirty, and effective tricks for better Retrieval-Augmented Generation apps, based on developing systems for a crime-based online game.

In our application, the chatbot can’t hide behind a loading spinner; users keep talking and expect it to pivot instantly. This constraint forced us to develop some lightweight techniques you can graft onto your own LLM app that serves impatient users.

Setting the Scene: An NPC in the Crossfire

Our LLM-based NPC lives in an online crime game – an environment where content moderation is tricky, player exploitation is rampant (especially if valuable items are involved), and every interaction is scrutinized. But our game designer had a specific vision: the NPC would use a (mostly) regular player account and communicate via the same real-time, in-game chat channel players use for everything from coordination to trash-talk.

Critically, the existing in-game chat that players use offered no special affordances to signal interaction with an NPC. This meant we had no way to block users from sending further messages while our NPC was still processing or generating a reply to a previous one.

The Unblockable Input Problem

This lack of UI back-pressure directly mirrors how standard human-to-human chat applications work – the Send button is never greyed out just because you’re waiting for a reply. Given this freedom in our game, players naturally send quick follow-ups, clarifications, and corrections, exactly as they would when chatting with another human. They expect their conversation partner – NPC or otherwise – to handle this fluidly.

Without a system designed for this, our NPC would simply queue incoming messages, leading to disconnected replies, ignored corrections, and a profoundly unnatural interaction feel. This differs significantly from most deployed LLM applications I’ve seen, which typically enforce a stricter turn-taking model and block or disable input during generation.

Forging a Solution

Handling this real-time, interruptible conversation flow required me to move beyond established patterns. Because few existing models applied directly, much of the design was guessed at, tried, and fixed where it fell short. In this article, I’ll describe how I implemented an LLM-based NPC designed to behave naturally under these demanding conditions, focusing on the core techniques that made it possible: self-destructing work units and input debounce windows.

Core Mechanisms: Building an Interruptible & Responsive Backend

To handle the constant potential for interruption without UI controls, we implemented several interacting mechanisms:

1. Making Work Self-Destruct: The Interruptible Run Life-Cycle

Principle: Any ongoing work must be instantly abandonable when a new relevant input (a user message in the same conversation) arrives.

Implementation: The core idea was to associate every NPC reply generation “run” with a unique identifier and a corresponding flag that we could check and flip remotely, and have the “run” periodically check if it’d been cancelled

Cancel Token: When a run starts, it generates a unique token ID, linked to an ID for the conversation more generally. We create a corresponding key in a shared, fast key-value store (Redis, for its speed, ability to handle frequent checks, and suitability for sharing state across server instances). Initially, this key indicates the run is ‘active’.
Polling Check: A run typically involves several steps, often including multiple external calls: generating embeddings, calling the main LLM, looking up data in our database, or calling other game APIs. These calls can take seconds, sometimes up to 10 seconds each, and are generally run sequentially. After each significant external call completes, our worker process checks the status of its token ID in Redis.

// Simplified pseudo-code for our run loop
const cancelToken = generateToken(redis, conversationId);

const result1 = await externalCall_LLM();
if (await cancelToken.isCancelled()) return cleanupAndExit(cancelToken);

const result2 = await externalCall_DatabaseLookup(result1);
if (await cancelToken.isCancelled()) return cleanupAndExit(cancelToken);

const result3 = await externalCall_GameAPI(result2);
if (await cancelToken.isCancelled()) return cleanupAndExit(cancelToken);

// Result can no longer be cancelled, so send then write
await sendUser(conversationId, result3);
await writeAnswer(conversationId, result3);

return cleanupAndExit(cancelToken);

Cancellation Trigger: When a new message arrives for a conversation, we find any tokens linked to that conversation ID and mark them cancelled. Active runs will detect this change upon their next check and terminate themselves politely.

2. Managing State: History Hygiene for Interrupted Flows

Principle: Canceled or partial work must never corrupt the conversation state or the context window we provide to the LLM.

Implementation: This required us to enforce strict rules about when data is saved.

Write-Only-If-Sent: We only persist the full text of the NPC’s reply to the permanent conversation history after it has been successfully generated and delivered to the player. If a run is canceled mid-way, nothing from that partial attempt is saved to the conversation history.
Context Management: Separately, we implemented robust context summarization and trimming mechanisms (as you’d need in any LLM application) to keep the prompt within necessary limits, but the key here is that our cancelled runs don’t add noise or partial thoughts to that context.

3. Handling Input Bursts: Soft Back-Pressure via Debouncing

Problem: Players often send multiple messages rapidly (“hey”…“how much are grenades?”…“the flash ones”). Reacting instantly to the first message only to cancel moments later upon receiving the second (and third) is inefficient and creates unnecessary system churn (start-cancel-start) for us.

Principle: Instead of reacting instantly, we decided to group rapid-fire messages together using a short, server-side delay. This introduces a controlled, soft back-pressure.

Implementation:

Debounce Timer: When a message arrives for a conversation:
- If no run is active and no debounce timer is running, we start a new run immediately with that message.
- If a run is active, or if a debounce timer is already ticking, we start (or reset) a 3-second timer. Any subsequent messages arriving for that conversation during these 3 seconds simply reset the timer back to 3 seconds.
Batch Processing: When the 3-second timer finally elapses without being reset, a worker picks up all the messages that arrived during that debounce window and processes them together as a single batch in the next run.

4. Understanding Intent Across Bursts: Rolling Moderation & Workflow Dispatch

Principle: Our policy checks (moderation) and intent/workflow analysis (e.g., is this friendly chat, an information query, or something problematic?) must consider the entire burst of messages captured by the debounce window for accurate decisions.

Implementation: We use a rolling approach within the batch.

Cumulative Analysis: For a batch of messages [M1, M2, M3]:
1. We analyze M1 alone -> Assign category (e.g., “BANTER”).
2. We analyze M1+M2 together -> Assign combined category (e.g., “QUESTION”).
3. We analyze M1+M2+M3 together -> Assign combined category (e.g., “ATTEMPTED_JAILBREAK”).
Decision Logic: Our system looks at the categories assigned across all cumulative checks. The “most severe” or overriding category determines the outcome. For example, if “ATTEMPTED_JAILBREAK” appears at any step, we might block or flag the entire batch, regardless of other categories like “QUESTION”. If no blocking category is found, the category from the final (full batch) analysis might determine the workflow we use (e.g., use a retrieval-augmented prompt for “QUESTION”, a simple chat prompt for “BANTER”).

5. Ensuring Stability: System Protection via Rate Limiting

Principle: We needed a simple, overarching limit to prevent abuse and manage costs, especially important when the core logic allows for rapid interruption and reprocessing cycles.

Implementation: We apply a flat rate limit per conversation or per player (e.g., X messages allowed within any Y-second rolling window). This limit is checked early in our message handling process, before debouncing or run creation, providing a basic layer of protection independent of the more complex interrupt logic.

The Player Experience: Crafting the Illusion of Interaction

The raw mechanics of interrupts and debounces could feel jarringly robotic. We paid close attention to the perceived experience, primarily through the typing indicator:

0 ms: Player sends “hey how are you”
10 ms: Our system receives message, starts run, [Typing Indicator ON]
1000 ms: Player sends “how much are flash grenades worth?” (Interrupt)
1010 ms: Our system receives interrupt message. An active run exists, so we set the cancel token in Redis and start the 3s debounce timer. The currently running task polls Redis soon after, sees ‘canceled’, and stops. [Typing Indicator OFF]
4010 ms: 3-second debounce timer expires. A new worker picks up the batch of messages (“hey how are you”, “how much are flash grenades worth?”). It starts a new run. [Typing Indicator ON]
~14000 ms: The new run completes processing the combined intent of the batch, generates the answer (“Flash grenades sell for about $250 right now.”), delivers it. [Typing Indicator OFF]

From the player’s perspective, the NPC started typing, seemed to pause or “rethink” when the second message arrived (indicator off), then started typing again after a short delay, delivering a relevant answer to the corrected/updated query. We found this mimics a human interaction pause, not a technical fault.

Ensuring Correctness: Deterministic Testing

Testing asynchronous, interrupt-driven systems is notoriously difficult. Race conditions and edge cases were hard for us to trigger reliably in a live environment. In addition, we wanted to get this right initially without considering any complications added by using Redis.

Approach: We developed a YAML-based event simulation harness.

Event Log: We define scenarios as a timed sequence of events:

- { time: 0, type: msg, conversation: A, msg: 'A-1' }
- { time: 20, type: process, conversation: A, run_id: 1, starts_processing: 'A-1' } # Simulate worker picking up A-1
- { time: 50, type: msg, conversation: A, msg: 'A-2' } # Interrupting message
- { time: 70, type: process, conversation: A, run_id: 1, finishes_external_call: 1 } # Run 1 checks cancel token *after* this
- { time: 80, type: process, conversation: A, debounce_timer_expires: true } # Debounce for A-2 expires
# ... more events

In-Memory Runner: A test runner consumes this log, simulating time passing and triggering corresponding actions in an in-memory model of our system (message queues, run states, debounce timers) without needing actual Redis calls or LLM inference.
Assertion: The runner outputs its own trace of state changes, which we compare against an expected trace defined in the test case.

expected_trace: |
  20|A|run_1|started_processing|messages:[A-1]
  50|A|run_1|cancel_requested
  70|A|run_1|detected_cancel_after_call_1
  70|A|run_1|terminated
  80|A|run_2|started_processing|messages:[A-2] # Assuming A-1 was too old or handled differently

This allows us to write deterministic tests covering complex timing scenarios, race conditions (e.g., message arrives just as debounce expires), and interrupt handling logic. One key advantage here was that we could make sure this worked nicely with an easily-checkable in-memory solution before we implemented the Redis-based solution, and check our tests ported nicely across unchanged.

Pragmatic Choices: Trade-offs & What We Didn’t Build (Yet)

No system is perfect, and we made conscious trade-offs based on our current scale (around 70k DAU) and priorities:

Skipped: Deep AbortController Integration: We didn’t thread cancellation signals deep into every downstream HTTP client request. While technically possible, the complexity outweighed the marginal latency gain (saving fractions of a second on already-canceled runs). We may revisit this if external call latencies become a major issue.
Skipped: Complex Orchestration/Hot-Shard Avoidance: Standard Redis hash-slotting provided sufficient distribution for our cancel tokens and run state. More advanced orchestration wasn’t necessary at our scale.
Skipped: UI-Level Throttling/Cancel Buttons: We explicitly prioritized a seamless player experience mimicking human chat over giving users UI controls that reveal the underlying mechanics. Our server handles interrupts gracefully instead.
Skipped: Full Observability Suite: For this relatively low-stakes feature, intensive distributed tracing or monitoring felt like premature optimization to us. Basic logging and metrics suffice for now.

Results: Meeting Performance and Experience Goals

Our primary goals were responsiveness to interrupts and maintaining a natural interaction pace.

Cancel Reaction Time:
- Target: ≤ 1 external-call interval (i.e., detect cancel shortly after the next external call finishes).
- Achieved: 1–3 seconds. Our runs typically notice cancellation quickly after completing their next multi-second external call.
Full Reply Time (End-to-End):
- Target: ≤ 20 seconds (aiming for a human-like thinking time).
- Achieved: Typically 8–12 seconds, including debounce delays and multiple external calls for complex queries.
Player Feedback:
- Target: Minimize complaints about “stuck,” “laggy,” or “confused” NPCs.
- Achieved: Near-zero specific complaints related to the NPC failing to handle interruptions or follow-up messages correctly.

Future Work & Potential Enhancements

While the current system works well, we see areas for improvement:

True Mid-Flight Aborts: We want to wire cancellation signals (e.g., via AbortController or similar) into key external calls (especially long-running LLM inferences) for faster resource release.
Adaptive Debouncing: We plan to experiment with adjusting the 3-second debounce window, perhaps based on the player’s recent messaging cadence or the conversation context.
Observability Hooks: We might add optional, more detailed tracing and monitoring hooks that can be enabled if the feature becomes more critical or our usage scales significantly.

Conclusion & Key Takeaways

Building truly interactive conversational agents that share the same unconstrained interfaces as human users requires embracing interruptibility from the start. In our experience, it cannot be merely bolted on.

When UI back-pressure isn’t an option, you must handle the inevitable input bursts and context shifts gracefully on the backend. Our experience shows that a combination of techniques – self-destructing work units identified by shared tokens, server-side debouncing to batch rapid inputs, strict history hygiene to prevent state corruption, and basic rate limiting for stability – provides a robust and effective solution. This approach allowed our LLM agent to maintain the illusion of natural, fluid conversation, even when users won’t wait.