In-memory free-text search is a super-power for LLMs

This is the second article from my “street-fighting RAG” series: quick, dirty, and effective tricks for better Retrieval-Augmented Generation apps, based on developing systems for a crime-based online game.

While working on LLM-driven NPCs, I observed significant improvements in several areas by adding a simple component: in-memory free-text search (IMFTS) to augment the main inference process. It’s a straightforward approach that helped us navigate common LLM trade-offs, and this article shares how.

IMFTS: In-memory Free Text Search, or, running a traditional keyword search against a bunch of text, as provided by tools like Lunr (Node.js), Whoosh (Python), or Bleve (Go).

Prompt augmentation with the results of IMFTS has allowed me to reduce the number of inference calls I make, which saves time and money and helps strips away unnecessary architectural complexity.

LLM-based apps (which my NPCs are) are constrained by trade-offs:

Time / latency: how long it takes to produce an answer; users get bored of waiting around for an answer
Inference cost: how much it costs to generate an answer; we have 60,000 chatty users and so token cost is a real consideration
Quality: how accurate the answer is; there’s little user value in responses that are inaccurate

Improvement along one of those factors usually comes at the cost of one of the others. However, adding in IMFTS is able to help “for free” in some instances, by improving answer quality with a latency cost measured in milliseconds and (essentially) no extra financial cost.

It’s rarely just one prompt

If your app accepts free-text input from a user, you’ll be very lucky if you can get away with a single inference call to the model. For a traditional RAG app (which most LLM apps seem to be these days), you’ll at least need to generate an embedding that captures the meaning of the user query, and then summarize source data that you’ve found using that embedding.

But you’re probably also doing at least one of:

Reranking results to improve relevance after initial retrieval
Running some kind of safety system to prevent jail-breaks or toxicity
Generating synthetic queries to expand search coverage
Rewriting user queries to fit your embedding-model’s view of the world
Classifying user intent to route queries appropriately
Breaking complex queries into sub-questions that can be answered independently

In many of these cases you can save an inference call (or two or more) by using free-text in-memory search that’s lightning fast and virtually free.

The IMFTS augmentation pattern

My app processes each user query through multiple LLM calls - anywhere from 1 to 7 completions plus occasional embedding queries. This complexity is necessary but expensive, both in time and computation costs.

But I also maintain a lightweight knowledge base of approximately 5,000 “facts” – a curated collection drawing from game APIs, fan glossaries, and manually compiled notes on rules and mechanics – each with a title, type, and concise definition (typically under 250 characters). User queries are run through an in-memory keyword search against these facts, 3-10 of the most relevant matches are selected, and injected directly into my prompts.

This might sound like a step backward from traditional RAG with embeddings, but this is a complementary approach: while embedding-based retrieval excels at semantic understanding, IMFTS shines at:

Near-instant retrieval (microseconds vs. hundreds of milliseconds)
Zero additional API costs
Excellent performance with shorter, well-structured reference data like my “facts”
Reducing the need for multiple inference cycles to clarify intent or find specific terms

By front-loading relevant context before more expensive operations, I’m essentially giving my LLM calls a “head start” that often eliminates entire steps in the pipeline.

Specific areas this is helping with

Generating better query embeddings

My characters inhabit a world with very specific meanings given to certain words that are far outside what an off-shelf LLM would consider; “Endurance” — also referred to as “END” by players — is a stat related to your in-game job and has no relationship to in-game gym training or combat mechanics.

When a user searches for “how do i improve end”, we have little hope of directly expanding out the query relying purely on an off-the-shelf model. But if we run IMFTS first over our definition list, we pull out:

"Endurance (END)": one of the three working stats for in-game jobs

which we can feed to the query-improvement inference step, and update the user’s query to be “how do I improve my working stat endurance for my in-game job”, which will give us a much better search embedding.

Hallucination reduction

My characters also inhabit a violent, criminal, and drug-fueld world of constrained realism (see my chain of thought article). This applies to both items and actions: you can buy (and abuse) Xanax in the game but not Percocet; you can buy throwing stars but not a boomerang; and you can attack other players but you can’t kidnap them.

This causes several issues for a “helpful chatbot” trained on real-world knowledge. If a player asks how to travel Lithuania (which is not a valid in-game location), the embedding search will pick up knowledge chunks from the embedding search on how to travel in the game to locations like Tokyo (which is a valid in-game location) and start to extrapolate from that, missing the fact that the absence of mention of Lithuania specifically precludes travel there, and ends up directing the user to the in-game travel agency.

Having a keyword search over our fact base allows us to construct a chain-of-thought prompt that says “if you don’t see this in the keyword search it isn’t there”. (Changed “ain’t” to “isn’t” for slightly broader appeal).

Content moderation

This mechanism is also helpful for our content-moderation strategy, which is tricky! We use a similar approach to Llama Guard but need to let much more through than a normal LLM would. It’s entirely valid to ask how to attack other players with a knife, the best kind of ammunition to use for causing damage, ways to improve your shoplifting, etc.

However, the game has no gendered or sexual violence, no home-brewing of meth, and while the game has all sorts of things as you can use as weapons, you can’t use the in-game gasoline to set fire to another player’s house. A player asking how to use Xanax to subdue their date should be flagged as potential real-world harm and reprimanded, and the characters shouldn’t give advice on making meth (or sudafed), despite the underlying LLM presumably knowing how to do both. On the other hand, a player asking how to stab another player is talking (we have to assume) about using the in-game mechanic, and should be told about acquiring and equipping melee weapons and initiating attacks.

Ideally, we can perform much of this content moderation stage as cheaply and quickly as possible, without the several steps of inference that doing a full RAG-based search and generation would entail; the IMFTS allows us to provide a very quick overview of game mechanics to the content-moderation stage: “stab” will bring back a list of stabbing weapons instantly allowing the content-moderation to see it, where “crystal meth” will bring back zero results, allowing us to get rid of the prompt quickly, before needing to fall back to the expensive RAG search which would let us see that.

What about using a fine-tuned embedding model instead?

The common theme binding these examples is the mismatch between the general-purpose embedding space of off-the-shelf models and the specific vocabulary and concepts of our game world. This naturally raises the question: why not fine-tune a model?

There are two main approaches to consider: fine-tuning the large generative model itself, or fine-tuning a smaller, dedicated embedding model just for retrieval tasks.

Trying to fine-tune the main generative LLM would quickly present some significant hurdles. It’s computationally expensive, requires careful dataset curation, and locks us into a specific base model architecture. That approach sacrifices flexibility; we’d be locked in, unable to easily experiment with or switch to newer, potentially better or cheaper, general-purpose models as they become available.

Fine-tuning a dedicated embedding model looked more promising at first glance. It could potentially improve the semantic understanding of game-specific queries for the RAG part of the pipeline. However, several challenges remain, primarily related to data. While we have access to a large corpus of game forum data spanning 20 years, it contains a mix of outdated rules, player speculation, and evolving mechanics. Using this directly for fine-tuning without extensive (and ongoing) cleaning risks embedding incorrect or contradictory information.

Furthermore, a key advantage of our current IMFTS approach is agility. Game rules change, new items are added, and mechanics get tweaked. We can update our “facts” database almost instantly, and the changes are immediately reflected in the IMFTS results. Fine-tuning, whether for the main LLM or an embedding model, requires retraining or updating cycles, introducing latency between a game change and the NPC’s knowledge of it.

For our specific needs – handling unique terminology, ensuring factual accuracy based on current rules, and maintaining flexibility and low cost – using fast, cheap, and easily updatable IMFTS results simply provides more pragmatic value for us right now than diving into complex fine-tuning. That said, fine-tuning a specific component, like a reranking model to better sort results from an initial retrieval pass, might be a viable enhancement down the road if the data challenges can be addressed for that specific task.

Not All Fun and Games: Broader Applications

My examples come from a fairly extreme situation: a semi-realistic online game where terms are overloaded and there’s 20 years of narrative debt. But the core principle here – using fast, cheap keyword search for specific, dynamic, or precisely defined info – applies way beyond that. I see this pattern being valuable anywhere LLMs bump into domain-specific knowledge that general embeddings miss or where you need absolute precision. Think about:

Internal Knowledge Bases: Companies drown in specific jargon, product codes, acronyms, and internal procedures. Just as IMFTS helps my NPCs pinpoint game items, it can help an internal LLM-based app nail down exact policy details, product specifications, or procedure steps, reducing ambiguity, and inference calls
Customer Support Systems: How often do support bots need to recognize exact error codes, product model numbers, or predefined troubleshooting steps? IMFTS gives you a rapid, reliable way to match these precise terms against a curated list, ensuring accurate identification before potentially involving more complex semantic search or LLM generation
Compliance and Regulated Industries: Any field needing strict adherence to specific legal phrasing, regulatory codes, or internal compliance rules can potentially benefit significantly. IMFTS ensures the exact required terminology gets retrieved and presented, minimizing the risk of an LLM paraphrasing crucial information incorrectly.

Basically, anywhere you need to guarantee specific facts get injected quickly and accurately into an LLM’s context – especially if that info changes often or needs to be dead-on precise – IMFTS is a pragmatic and powerful addition to your semantic search toolkit.

Conclusion

Building effective LLM applications, especially interactive ones like my game NPCs, means constantly navigating complex trade-offs between cost, latency, and quality under real-world operational constraints. While advanced techniques like fine-tuning have their place, my practical experience shows you can often get significant gains with simpler, more direct methods.

For us, augmenting LLM prompts with results from an in-memory free-text search (IMFTS) turned out to be a highly effective strategy. It directly addresses challenges with domain-specific terminology, helps mitigate hallucinations related to out-of-scope concepts, and provides crucial, timely context for nuanced tasks like content moderation – all with negligible latency and zero inference cost.

Crucially, this approach gives us the agility we need in a dynamic environment where game rules and content change frequently. The ability to instantly update the underlying “facts” database and see those changes reflected immediately is a practical advantage that slower fine-tuning cycles just can’t match. For developers building systems under similar pressures – needing accuracy, speed, cost-efficiency, and adaptability – using a well-curated IMFTS layer isn’t a step backward; it’s a pragmatic and powerful tool in the LLM toolkit.