RAG chunking isn’t one problem, it’s three

this article is an unfinished draft
Chunking in RAG refers to splitting a source-document into a smaller piece to isolate relevant information and reduce the number of tokens you’re providing to an LLM before it writes an answer
You have a shiny RAG system powering your customer knowledge base. A customer asks how to cancel their Enterprise subscription. Your system retrieves a chunk about ‘cancellation’ from the Terms of Service as strongly relevant … but that chunk doesn’t mention that Enterprise plans require 90-day notice - that’s three paragraphs later under ‘Notice Periods’.
OK fine. You make your chunks bigger to capture more context, but now your ‘cancellation’ chunk also includes refunds, billing cycles, and payment methods. The embedding is diluted. Searches for ‘refund policy’ start returning this monster chunk. How about overlapping chunks? Better recall, but now you’re stuffing even more redundant information into your prompt, responses are taking a long time to generate and sometimes wander off on a tangent.
The wunderkind junior dev hits on an ingenious solution: you use the LLM to generate small and punchy summaries of the chunks using the LLM itself, that pull in important context. This works well until you get a call from Legal saying a customer is quoting from the Terms and Conditions, but the Terms and Conditions don’t actually say the thing they’re quoting – the LLM is quoting its helpful summary instead. Nobody’s happy.
Existing articles often focus on a chunk as a singular concept: you split the article into paragraphs, say, and this paragraph must:
- Provide sufficient information to the LLM to help it craft an answer
- Make a high-quality embedding and BM25 (keyword-matching) target
- Be a high-fidelity source document that we can quote back to the user
But that’s three problems to solve, not one problem Trying to solve all three problems with a single chunk of text is painful, which is why we have a whole flotilla of adjunct solutions like re-rankers.
There’s nothing stopping you solving these problems individually and well.
TODO: LLM-Feeding
Main points:
- Talk about expanding chunks
- Talk about mentioning where it is in the document hierarchy
- What else?
TODO: Embeddings and or/keyword search
- This is the problem that reranking tries to solve
- You could instead reverse this and create synthetic questions that are answered by the text, and use that, creating the embedding from that
- Keyword stuffing may be appropriate in some situations
TODO: Source document
- If you’re going to quote a document to the user, you have 100% fucked up if you hallucinate a single word; it’s reasonable (and perhaps important) to mention where we are in the document though; the same sentence can mean very different things if it’s about laws in one country rather than another
- This is still the location: you need to be anchoring all three of these to a specific location in the source document, and this is where that lives
- So this basically needs to be the exact text, but perhaps enriched with document location information
TODO: Conclusion
- Start with a source document and location, with a semantically sensible chunk
- You have permission to not use that as the embedding/keyword search chunk or what’s fed directly to the LLM