← Back to Blog

RAG for Analytics: Letting Your Team Ask Questions in Plain English

March 3, 2026  ·  AI  ·  8 min read


At any company past 50 people, the analytics team becomes a bottleneck. It's not because analysts are slow. It's because the number of people who have data questions grows faster than the number of analysts. The queue fills up. People stop asking questions they expect will take two weeks to answer.

I wrote about Text-to-SQL with LLMs in a previous post — and it does solve part of this. Give people the ability to ask "how many users signed up last week?" without filing a ticket, and a real portion of the queue disappears.

But Text-to-SQL hits a wall quickly. The wall is context.

"What does activated mean in our product?" The SQL system doesn't know. "Why is March always lower?" It can't tell you. "Which cohort analysis should I look at to understand this drop?" It has no idea. These aren't SQL questions. They're questions about your business, your definitions, your history. They require institutional knowledge that lives in analyst heads, Confluence pages, old Slack threads, and metric definition docs that half the company doesn't know exist.

RAG — Retrieval-Augmented Generation — is how you give the LLM access to that knowledge.


What RAG actually is and why it's different from Text-to-SQL

The core idea is simple: before asking the LLM to answer a question, you retrieve the relevant pieces of context from a knowledge base and include them in the prompt. The LLM then answers based on both the question and the retrieved context — not just whatever it learned during pre-training.

Text-to-SQL retrieves schema information and generates a query. RAG retrieves knowledge documents and generates an answer — or a better-informed query, or an explanation, or a recommendation. The distinction matters because the type of knowledge is completely different.

Schema: what tables exist, what columns they have. That's the Text-to-SQL domain.

Knowledge: what "activation" means in your specific product, why the March dip happens every year (it's because enterprise contracts renew in Q4 and March is a slow sign-up month — the kind of thing that's obvious to a senior analyst and opaque to everyone else), which dashboard is the canonical source of truth for a given metric. That's the RAG domain.


What to put in your knowledge base

This is the decision that most determines whether the system is useful or not. Get this wrong and you're retrieving irrelevant documents and the answers are either wrong or generic.

The highest-value items I've found:

What not to include: raw data, very long documents (chunk them), documents that are outdated and you haven't updated (stale context is worse than no context — the model will use it confidently).


Embedding and retrieval — practical not theoretical

Retrieval works by converting both your documents and the user's question into vectors, then finding the documents whose vectors are most similar to the question vector. The math is cosine similarity. What you need to choose is the embedding model and the vector search library.

For an internal analytics knowledge base, I'd use either sentence-transformers (specifically all-MiniLM-L6-v2 — fast, small, good enough for semantic similarity on business text) or Gemini's embedding API if you're already in the Google Cloud ecosystem and want to keep the stack consistent. FAISS from Meta is the vector search library to start with — it's fast, runs in memory, and requires zero infrastructure.

Chunking matters more than most tutorials suggest. If your metric definitions document is 4,000 words and you embed it as a single document, the retrieval will miss every individual metric definition query — the vector of the whole document is too diffuse. Chunk by logical unit: one chunk per metric definition, one chunk per known anomaly explanation, one chunk per business rule. Aim for 200-500 tokens per chunk.

from sentence_transformers import SentenceTransformer
import faiss
import numpy as np
import json

class AnalyticsKnowledgeBase:
    def __init__(self, embedding_model: str = "all-MiniLM-L6-v2"):
        self.model = SentenceTransformer(embedding_model)
        self.index = None
        self.documents = []   # raw text chunks
        self.metadata = []    # source, category, etc.

    def add_documents(self, docs: list[dict]):
        """
        docs: list of {"text": "...", "source": "metric_definitions",
                        "category": "metric_definition", "title": "..."}
        """
        texts = [d["text"] for d in docs]
        embeddings = self.model.encode(texts, show_progress_bar=True,
                                        normalize_embeddings=True)

        if self.index is None:
            dim = embeddings.shape[1]
            self.index = faiss.IndexFlatIP(dim)  # Inner product = cosine on normalized vecs

        self.index.add(embeddings.astype(np.float32))
        self.documents.extend(texts)
        self.metadata.extend(docs)

    def retrieve(self, query: str, top_k: int = 5) -> list[dict]:
        """Return the top_k most relevant chunks for a query."""
        query_vec = self.model.encode([query], normalize_embeddings=True)
        scores, indices = self.index.search(query_vec.astype(np.float32), top_k)

        results = []
        for score, idx in zip(scores[0], indices[0]):
            if idx == -1:
                continue
            results.append({
                "text": self.documents[idx],
                "score": float(score),
                "metadata": self.metadata[idx]
            })
        return results

    def save(self, path: str):
        faiss.write_index(self.index, f"{path}/faiss.index")
        with open(f"{path}/documents.json", "w") as f:
            json.dump({"docs": self.documents, "meta": self.metadata}, f)

    def load(self, path: str):
        self.index = faiss.read_index(f"{path}/faiss.index")
        with open(f"{path}/documents.json") as f:
            data = json.load(f)
        self.documents = data["docs"]
        self.metadata = data["meta"]

Loading your knowledge base

The loading step is where most of the actual work is. Here's a minimal example of how to structure and ingest metric definitions:

kb = AnalyticsKnowledgeBase()

# Metric definitions — one document per metric
metric_definitions = [
    {
        "text": """Metric: Activated User
Definition: A user is considered 'activated' when they complete their first transaction
within 7 days of signup. This is our primary early-engagement metric and the leading
indicator for 30-day retention. Note: users who sign up via invite link have a separate
activation definition (first social share) tracked in the growth_activations table.
Last updated: Jan 2026.""",
        "source": "metric_definitions_doc",
        "category": "metric_definition",
        "title": "Activated User"
    },
    {
        "text": """Metric: GMV (Gross Merchandise Value)
Definition: Total value of all transactions processed in a given period, before refunds.
This differs from Revenue (which nets out refunds and platform fees). GMV is reported
in the transactions table (status IN ('completed', 'pending_settlement')).
Exclude status = 'reversed' and 'failed'. Always filter to India timezone for daily GMV.
Last updated: Feb 2026.""",
        "source": "metric_definitions_doc",
        "category": "metric_definition",
        "title": "GMV"
    },
]

# Known seasonality/anomaly explanations
context_docs = [
    {
        "text": """Seasonality: March dip in new user signups
Every year, new user signups in March are 15-20% below the Q4 and Q1 monthly averages.
This is not a product issue. It correlates with enterprise budget cycles: most enterprise
clients renew in Q4 (Oct-Dec) and new enterprise procurement is slow in Q1. The consumer
side is not affected — only the B2B acquisition funnel shows this pattern.
Source: Growth analytics retrospective, April 2025.""",
        "source": "seasonality_notes",
        "category": "seasonality",
        "title": "March signup dip"
    },
]

kb.add_documents(metric_definitions + context_docs)
kb.save("./analytics_kb")

The full pipeline: question to answer

Once the knowledge base is built, the pipeline is: embed the question, retrieve the top relevant chunks, build an augmented prompt that includes both the retrieved context and the question, then call Gemini.

import google.generativeai as genai

genai.configure(api_key="YOUR_GEMINI_API_KEY")

RAG_SYSTEM_PROMPT = """You are an analytics assistant for a data team.
Answer the user's question using the context provided below.
The context contains metric definitions, business rules, and institutional knowledge.

Context:
{context}

Rules:
- If the context directly answers the question, answer from it. Cite the source.
- If the context is partially relevant, use it and say what you're inferring.
- If the context doesn't address the question at all, say so clearly. Do not make up an answer.
- Be specific. If a metric definition is in the context, quote the relevant part.
- Keep answers concise — 2-4 sentences for most questions.

Question: {question}
Answer:"""

def answer_analytics_question(question: str,
                                kb: AnalyticsKnowledgeBase,
                                top_k: int = 4) -> dict:
    # Retrieve relevant context
    retrieved = kb.retrieve(question, top_k=top_k)

    if not retrieved or retrieved[0]["score"] < 0.3:
        # Low confidence retrieval — flag it
        return {
            "answer": ("I don't have specific context for this question in the knowledge base. "
                       "You may want to check with the analytics team directly."),
            "retrieved_docs": [],
            "confidence": "low"
        }

    context_text = "\n\n---\n\n".join([
        f"Source: {r['metadata'].get('title', 'unknown')}\n{r['text']}"
        for r in retrieved
    ])

    prompt = RAG_SYSTEM_PROMPT.format(context=context_text, question=question)
    model = genai.GenerativeModel("gemini-1.5-flash")
    response = model.generate_content(prompt)

    return {
        "answer": response.text,
        "retrieved_docs": [r["metadata"].get("title") for r in retrieved],
        "confidence": "high" if retrieved[0]["score"] > 0.6 else "medium"
    }

What a working interaction looks like:

User: "Why is March always lower in the activation numbers?"

System retrieves: the seasonality note about enterprise budget cycles, the activated user definition.

Response: "March activation numbers are typically 15-20% below the Q4/Q1 average due to enterprise procurement cycles, not a product issue. Enterprise clients tend to renew in Q4, making Q1 (and especially March) slow for B2B acquisition. The consumer activation funnel is not affected — this pattern only appears in the enterprise cohort. Source: Growth analytics retrospective, April 2025."

That answer would have required tracking down an analyst, who would have had to remember the context, type it out, and send it. Now it takes 800ms.


Combining RAG with Text-to-SQL

The full picture is RAG feeding into Text-to-SQL, not one replacing the other. When someone asks "show me the activation rate for March 2025," the system should:

1. Retrieve the definition of "activated user" from the knowledge base.
2. Use that definition to construct the correct SQL — filtering to the right table, using the right status conditions, applying the right time-zone handling.
3. Return both the SQL result and the metric definition it used, so the user knows what was counted.

Without the RAG step, the Text-to-SQL system guesses what "activated" means and is sometimes wrong. With it, the query generation is grounded in your actual business logic.


What goes wrong

Stale knowledge base. This is the main failure mode. A metric definition document that was accurate in 2024 but hasn't been updated since the product changed will produce confidently wrong answers. The system has no way to know the document is out of date. Assign someone ownership of the knowledge base — treat it like a product, not a one-time build. A quarterly review is the minimum.

Retrieval misses. Semantic similarity is not perfect. A question phrased one way may not retrieve the document that answers it, even if that document exists. The fix is iterative: run real questions against the system, see what it retrieves, and add documents or rephrase existing ones to close the gaps. Budget time for this — it's not a one-day job.

Hallucination when context is thin. If the retrieved documents are weakly related to the question (low similarity scores), the LLM may still generate a confident-sounding answer using its pre-training. This is why the confidence thresholding in the code above matters — if retrieval confidence is below 0.3, return a "I don't have context for this" rather than passing thin documents to the model and hoping for the best.

The cold start problem. The system is useless until the knowledge base has meaningful content. The first week you build this, it will give mediocre answers to most questions because the knowledge base only has what you've had time to write. This is normal. It gets better with every document you add, and the highest-value documents (metric definitions, known seasonality) are usually manageable to write in a day or two.


The honest take on maintenance cost

This system is not "build once, run forever." It requires ongoing maintenance — updating the knowledge base as definitions change, auditing retrieval quality periodically, adding new documents when new questions surface that the system can't answer.

That maintenance is worth it if your analytics team is fielding repeated questions about the same things. If you're answering "what does activated mean?" three times a week across Slack threads, the ROI of writing that definition once and making it retrievable is obvious. If your team is small and the question volume is low, the overhead may not be worth it yet.

The teams where this adds the most value are ones with a large number of metric definitions that aren't consistently documented, a mix of technical and non-technical stakeholders asking data questions, and an analytics team that spends meaningful time answering definitional questions rather than doing analysis. If that's your situation, the build is a week of work and the payoff is ongoing.