Street-fighting RAG: Chain-of-thought prompting

or, reducing hallucination and making in-generation adjustments to LLM responses

This is the first article from my “street-fighting RAG” series: quick, dirty, and effective tricks for better Retrieval-Augmented Generation apps, based on developing systems for a crime-based online game.

I build LLM-based non-player characters (NPCs) for a popular online game. The game exists in a constrained-reality: it’s based (kind of!) in the real world, but with a limited and fixed set of real-world items you can interact with:

You can travel to Canada in the game, but not, say, Lithuania;
You can acquire throwing weapons like a Ninja Star or a Snowball, but there are no boomerangs;
You have an Endurance stat, but it applies purely to your in-game-career performance, and not at all to your gains at the in-game gym or to fighting other players

If — like us — you want to use an off-the-shelf model, this is a big freakin’ problem. The LLM already has a strong, everyday association with how “endurance” should relate to the gym, has strong opinions on whether or not a boomerang is a “throwing weapon”, and is quite convinced that Lithuania is a real place with fun things you can do there. In order to do what we want, we need to explain an awful lot about our constrained reality first!

I’ll cover solving these problems more in future articles, but for now let it suffice to say: I have LLM hallucination challenges that would make you grimace.

This article describes the role that chain-of-thought (CoT) prompting plays in helping me keep LLM-generated responses on the strait and narrow.

What is chain-of-thought prompting?

CoT is asking an LLM to produce a “set of intermediate reasoning steps” as part of its generation.

Most of the commonly used LLMs are auto-regressive, which means they generate one token at a time — choosing somewhat randomly between what’s statistically most likely to come next — with no ability to “undo” their previous choice of token.

This is a bit like being forced to write an essay one word at a time, while also being forced to keep your essay coherent. Once you’ve selected and written down a word that was wrong, it’s really hard to back out of it; each word changes the trajectory of what you’re writing. It’s easy enough if you’re asked to review the work once it’s written to point out errors (and I’d highly recommend getting into the habit of asking “is that real” to your favourite LLM before moving forward with its answers), but the majority of text that LLMs are trained on don’t have a lot of meta self-reflection, so the LLM happily burbles on with whatever path its committed to.

CoT prompting asks the LLM to generate self-reflective responses — both pre- and post-generation — to explicitly reason about and enumerate ideas, dramatically reducing these “oops, too late now” moments that lead to hallucination.

In practice

In practice, for my use-case, I’ve been asking the LLM to generate me a markdown document, with a sequence of named sections and what I want to see in each, concluding with a JSON blob. Everything you’re about to read is in one single, giant prompt, rather than a series of continuous prompts.

I’ll set the context:

You are an in-game assistant, in dialogue with the user. I'd like you to produce a Markdown document for me based on this dialogue which I'll show you in a moment.

I start producing examples of what I want the LLM to generate for me:

Firstly, identify the key things the user is asking about, and whether you have evidence that they are in the game or not from the information below, under the heading "# INITIAL REPORT". If the user was asking if they can use boxing gloves to improve their attack against a penguin using a Fiat 500, you would produce a list like:

```markdown
# INITIAL REPORT

- boxing gloves: true, this is in the game
- attacking: true, this is in the game
- penguin: false, not in the game
- fiat 500: true, this is in the game
```

I ask the LLM to generate several layers of self-reflection:

Then I want you to go back and add evidence to each point, quoting text you saw in the results to support that, along with a percentage of how sure you are. Don't go on suggestion or make guesses, back it up properly, using real mentions -- for example, don't assume that the "Fiat 500" exists simply because cars are mentioned

Each provided with one (or several) examples of how that response should look. I ask for a full initial response to the question, then I ask for reflections on that initial response, then I ask for a second response, and so on.

Finally, the dynamic parts of the prompt are added: search results that will help determine if something is an in-game item, and the prior conversation with the user, as well as their current query.

nb: Naïvely, you might lay this out with all the context first and then the instructions of what you want the LLM to generate. Certainly for me, this feels more natural. However, prompt caching in various forms (where previous partial generations from the LLM are stored for repeated prompts; both OpenAI and Anthropic offer this) means you’re rewarded for keeping as much of the start of a long prompt static as you can, both in terms of response latency and cost. Many of my prompts are long — the static portion can be significantly over 10,000 characters, and having this cached (by delaying dynamic portions of the prompt until the very end) will halve the costs of processing them.

Just how much self-review?

This process is very useful for improving accuracy, but it isn’t free; I’m getting the LLM to write me a small essay to generate what might end up being a few sentences of user-facing response! These extra tokens are expensive on several axes:

It costs actual cash-money per token generated
Each extra token takes time to generate, increasing perceptual latency of responses to our users
Each token generated in a given time-period and added to a context window pushes up against your provider’s rate-limiting
The larger your context window, the more you may be forced into a more expensive model

How you optimize along these is going to heavily depend on your use-case. In my use-case, the product doesn’t generate any revenue per conversation (this isn’t part of a sales cycle, or freeing up a support team), and users are interacting live as part of a game: it’s important that response are fast and cheap. If an NPC gives the user atrocious advice or gets something wrong, I’m also not messing up a sale here — potentially it even adds character! Therefore, I’m looking to be accurate, but speed and cost are very important constraints I need to work within.

Finally, there are decreasing marginal returns here: if the system is working from faulty data, no amount of self-review is going to help, and if generation has started strongly enough down the wrong part, then asking for further review may end up just reinforcing earlier errors and getting them repeated a lot.

Reviewing and tweaking

Two extra and related benefits of this approach are the ability to tweak the response “mid-generation”, and also being able to see the intermediate responses that are absolutely invaluable for figuring out where the LLM got things wrong.

As mentioned at the start of the article, the game has some ideas that are genuinely confusing for human players, let alone a stochastic parrot; the game is 20 years old, and has picked up a chunk of “lore debt”. There are — for example — three entirely distinct concepts of a DVD as an item, and the game has three progress measures called Rank, Title, and Level, all of which are separate and not that neatly delineated.

The lengthy CoT prompts allow us to firstly spot areas where the LLM is getting confused, and secondly to insert tweaks for problems that seem otherwise insurmountable, eg Rank/Title/Level.

I’m asking for several different attempts at answering the user’s question in the single prompt, and for the attempt, I add in:

Please take a moment to confirm you haven't confused Rank, Title, and Level, by incorrectly referring to a Rank as a Level or vice versa.

and make sure that I give an example of what I mean in the example I provide to the LLM. The second attempt has:

If you mixed up Rank, Level, or Title, explain what's right here instead.

… again, I’m modelling an example of how that should look in the example I provide to the LLM. Finally, I have a reflections section, and would you believe, it includes:

Are you totally sure you didn't confuse Rank, Level, and Title?

This is an extreme case, but I don’t think I’d have been able to handle this distinction without a lengthy CoT prompt, and experiments/evals with even the most capable models on the market showed them frequently struggling with this.

nb: Repetition is very helpful for getting cheap and fast models to do what you want, but an error mode I’ve seen frequently is that if you go overboard on this, it starts leaking into the response itself. If you tell the LLM too many times that Rank/Level/Title aren’t the same, and that it’s important not to include them, you run the risk of it parrotting that back out in response to completely unrelated questions, which causes some really confusing responses!

Extracting the response

To stay in character, obviously the NPCs can’t return the self-reflective essay to the user directly, so the LLM needs to be prompted to return a parseable final answer.

I ask for a specific JSON object at the end of the response:

Finally, I need you to push out a JSON document, summarizing your previous response, in a very specific shape. It should be an object with a single key "response", and your response to the user as that string

As always at least one example is provided to the LLM to help guide it to the response I want — actually in this case I tend to provide several examples. Failure at this point can be a pain-in-the-ass and I want to hammer-home the point to the LLM of exactly what I want, and repetition is a good tool for this.

In each of these examples, I give the LLM the markdown syntax for a code-section:

```json
{ response goes here }
```

Which makes pulling it out with a regex straight-forward, before passing it to a validator to check the response is the shape I’m expecting.

This works most of the time, with an observed failure rate of about 5%. For the times when either the extraction or validation failed, I have a retry prompt:

You were meant to be helping me output a JSON blob. I had asked you to use a chain-of-thought to summarize some text. You almost got there, but you failed at the final step, which was to output a JSON payload. I need you to push out a JSON document, summarizing your previous response, in a very specific shape.

Where I provide the LLM with further examples of how the JSON output should be formed, and its previous response. I’m yet to see a failure persist past the retry prompt…

Conclusion

Chain-of-thought is a great technique for reducing the inaccuracies that come from auto-regressive LLM outputs. It gives you more accurate and grounded answers, albeit at the cost of generating more tokens, which usually comes with latency/money/rate-limit costs.

Done right, you may find you can drop to a cheaper and faster model, if the trade-offs make sense for you, and you’ll also gain excellent insight to where your LLM outputs are going off the rails by being able to see its workings in the intermediate generation documents.