Anomaly Detection Is the Easy Part. Explaining What It Means Is the Hard Part.

February 25, 2026 · AI · 8 min read

We had a Slack alert go off at 2:17am. It said: "ALERT: payment_success_rate — anomaly detected (z-score: -1.84, current: 91.2%, baseline: 93.8%)"

Nobody did anything. The on-call engineer saw it, wasn't sure if it was real, and went back to sleep. By morning we'd lost several hours of degraded payment flow before someone manually spotted the trend in the dashboard.

The detection worked perfectly. That was the problem — the detection was the easy part. The hard part was that nobody knew what to do with a z-score at 2am. The alert spoke in the language of statistics. The person reading it needed plain English.

Why statistical output is useless as an alert

A z-score of -1.84 tells you something is outside the expected range. It does not tell you: is this serious or borderline? Has this happened before on a Sunday? Is there a known event that explains it — a deployment, a bank downtime, end-of-month traffic patterns? Should I wake someone up or wait for the 9am standup?

An anomaly detection system that outputs a sigma value has done roughly 20% of the job. The remaining 80% is turning that signal into something actionable. For a long time that 80% required a human analyst who knew the metrics, knew the business context, knew what "normal Sunday night traffic looks like." That doesn't scale. And it definitely doesn't work at 2am.

This is where LLMs fit — not doing the detection, but doing the interpretation. The translation layer between the statistical signal and the human who needs to act on it.

What the detection layer actually looks like

For payment metrics — success rate, transaction volume, latency — I use a combination of Z-score detection for fast, interpretable flagging and Isolation Forest for catching subtler multi-dimensional anomalies. Neither is exotic. Both are practical for operational monitoring.

The Z-score approach works well when your metric has a reasonably stable distribution within a time window. For payment success rate, I compute a rolling 28-day baseline (same day-of-week weighted, because Sundays genuinely look different) and flag anything beyond ±2σ.

import numpy as np
import pandas as pd
from scipy import stats

def detect_anomaly_zscore(series: pd.Series, window: int = 28) -> dict:
    """
    Rolling Z-score anomaly detection with day-of-week baseline.
    Returns detection result and context for narration.
    """
    current = series.iloc[-1]
    # Use same day-of-week observations for baseline
    dow = series.index[-1].dayofweek
    baseline = series[series.index.dayofweek == dow].iloc[-window:]

    mean = baseline.mean()
    std = baseline.std()
    z_score = (current - mean) / std if std > 0 else 0

    return {
        "current_value": current,
        "baseline_mean": round(mean, 4),
        "baseline_std": round(std, 4),
        "z_score": round(z_score, 2),
        "is_anomaly": abs(z_score) > 2.0,
        "direction": "below" if z_score < 0 else "above",
        "pct_deviation": round(((current - mean) / mean) * 100, 2),
        "n_baseline_obs": len(baseline)
    }

For multi-metric anomaly detection — catching when volume, success rate, and latency all shift together in a way that individually looks borderline — Isolation Forest handles it better:

from sklearn.ensemble import IsolationForest

def detect_anomaly_isolation(features_df: pd.DataFrame,
                              contamination: float = 0.05) -> np.ndarray:
    """
    Multi-feature anomaly detection. contamination is your expected
    anomaly rate — 0.05 means ~5% of points treated as anomalous.
    Lower this if you're getting too many alerts.
    """
    model = IsolationForest(contamination=contamination,
                            random_state=42, n_estimators=100)
    model.fit(features_df)
    scores = model.decision_function(features_df)
    predictions = model.predict(features_df)
    return predictions, scores

Building the context that makes narration useful

This is the step most people skip, and it's the most important one. The LLM cannot narrate a useful alert if all you give it is a number and a sigma. You need to build context before you call the model.

What context matters for a payment metric anomaly:

The recent trend — was this already drifting down, or is this a sudden drop?
Day-of-week pattern — is this a normal Sunday trough or a genuine deviation from Sunday norms?
Historical precedent — has this metric dropped this much before? What happened then?
Known events — scheduled maintenance, recent deployments, bank holidays, end-of-month patterns
Correlated metrics — if success rate dropped, did volume also drop? Or is volume normal but success rate down?

Most of this context can be computed programmatically. You don't need the LLM to figure it out — you need to compute it and hand it to the model.

def build_alert_context(metric_name: str,
                         series: pd.Series,
                         detection_result: dict,
                         known_events: list = None) -> dict:
    """
    Build rich context for LLM narration.
    known_events: list of dicts like [{"date": "2026-03-03", "description": "DB migration"}]
    """
    # Recent trend: slope over last 7 days
    recent = series.iloc[-7:]
    trend_slope = np.polyfit(range(len(recent)), recent.values, 1)[0]
    trend_direction = "declining" if trend_slope < -0.001 else (
                      "rising" if trend_slope > 0.001 else "stable")

    # How rare is this value historically?
    historical_percentile = stats.percentileofscore(series.iloc[:-1], series.iloc[-1])

    # Same-day-of-week comparison: last 4 weeks
    dow = series.index[-1].dayofweek
    same_dow = series[series.index.dayofweek == dow].iloc[-4:]
    dow_avg = same_dow.mean()
    vs_dow_avg = round(((series.iloc[-1] - dow_avg) / dow_avg) * 100, 2)

    # Recent events in window
    today_str = str(series.index[-1].date())
    recent_events = [e for e in (known_events or [])
                     if e["date"] >= str((series.index[-1] - pd.Timedelta(days=3)).date())]

    return {
        "metric_name": metric_name,
        "current_value": detection_result["current_value"],
        "z_score": detection_result["z_score"],
        "pct_deviation": detection_result["pct_deviation"],
        "direction": detection_result["direction"],
        "trend_7d": trend_direction,
        "historical_percentile": round(historical_percentile, 1),
        "vs_same_dow_avg_pct": vs_dow_avg,
        "recent_events": recent_events,
        "timestamp": str(series.index[-1])
    }

The narration prompt that actually works

The goal is a 3-4 sentence alert that a non-analyst can read and act on. The key constraints: don't claim causation, be specific about severity, and always tell them what to do next.

I spent a lot of time on the "don't claim causation" part. LLMs naturally want to explain why something happened, and they will invent a plausible reason if you don't stop them. That's dangerous in an operational alert. "Success rate dropped because of increased fraud" sounds authoritative. It might be completely wrong. I now explicitly tell the model to flag possible reasons as speculative.

import google.generativeai as genai

genai.configure(api_key="YOUR_GEMINI_API_KEY")

NARRATION_PROMPT = """You are an analytics alert system for a payments platform.
Write a clear, plain-English alert message for an on-call engineer.

Anomaly data:
- Metric: {metric_name}
- Current value: {current_value}
- Deviation: {pct_deviation}% {direction} the expected range (z-score: {z_score})
- 7-day trend before this: {trend_7d}
- Historical context: this value is at the {historical_percentile}th percentile historically
- vs. same day of week average: {vs_same_dow_avg_pct}%
- Recent events near this time: {recent_events}

Write 3-4 sentences. Cover:
1. What happened and how significant it is
2. Whether the context makes it more or less concerning (trend, day-of-week, known events)
3. What the on-call engineer should check first

Rules:
- Never state a cause as fact. If suggesting a reason, say "possibly" or "worth checking"
- Be specific with numbers
- Do not use jargon (no "z-score" in the output)
- End with a concrete next action"""

def narrate_alert(context: dict) -> str:
    model = genai.GenerativeModel("gemini-1.5-flash")
    prompt = NARRATION_PROMPT.format(**context)
    response = model.generate_content(prompt)
    return response.text

The full pipeline: detection to Slack

Putting it together — this is roughly the pipeline running for Razorpay payment metrics. Simplified, but the structure is real.

import requests

SLACK_WEBHOOK_URL = "https://hooks.slack.com/services/YOUR/WEBHOOK/URL"

def send_slack_alert(metric_name: str, narration: str,
                     context: dict, channel: str = "#payments-alerts"):
    severity = "CRITICAL" if abs(context["z_score"]) > 3 else "WARNING"
    color = "#E53E3E" if severity == "CRITICAL" else "#DD6B20"

    payload = {
        "channel": channel,
        "attachments": [{
            "color": color,
            "title": f"{severity}: {metric_name} anomaly detected",
            "text": narration,
            "footer": (f"z-score: {context['z_score']} | "
                       f"Current: {context['current_value']} | "
                       f"{context['timestamp']}"),
            "footer_icon": "https://nikunjkaushik.com/assets/nk_favicon.png"
        }]
    }
    requests.post(SLACK_WEBHOOK_URL, json=payload)


def run_anomaly_pipeline(metric_name: str,
                          series: pd.Series,
                          known_events: list = None):
    result = detect_anomaly_zscore(series)

    if not result["is_anomaly"]:
        return  # No alert

    context = build_alert_context(metric_name, series, result, known_events)
    narration = narrate_alert(context)
    send_slack_alert(metric_name, narration, context)
    return narration

What the alert looks like on the other end: "Payment success rate has dropped to 91.2%, which is 2.8% below the expected range for Sunday evenings. This follows a stable trend over the past week, making the drop appear sudden rather than a gradual drift. There are no scheduled events logged for this window. Recommended first check: payment gateway error codes in the last 30 minutes and whether a specific bank or instrument type is driving the drop."

That's something an engineer can act on at 2am. The first version — a sigma value — was not.

The gotchas

The LLM will overclaim causation if you let it. "The drop is likely caused by" is the phrase to watch for. Add explicit instructions to use hedging language. Review a week of live alerts and check every causal claim. You'll find some that look authoritative and are pure hallucination.

Threshold tuning still matters — maybe more than before. When your alerts were just numbers, a false positive was annoying. When your alert is a well-written paragraph with a severity label and a recommended action, a false positive feels more authoritative and is more likely to cause unnecessary work. Tune your anomaly thresholds before you add the narration layer. If you're getting 20 alerts a day, the narration makes that worse, not better.

Don't alert on everything. Not every anomaly deserves a Slack message. I set different thresholds for different metrics: z > 2.5 for success rate (high impact), z > 3.0 for transaction volume (noisier metric), z > 2.0 for latency spikes (fast-moving, needs faster response). Running every metric through the LLM narration on every anomaly will cost money and create alert fatigue. Be selective about what triggers the full pipeline versus what gets logged silently for human review.

The knowledge base of known events needs maintenance. The "recent events" context I pass into the prompt — deployments, bank maintenance windows, holidays — is only useful if it's kept up to date. I have a simple Google Sheet that the engineering team writes to. It's not elegant, but it means the model occasionally generates alerts like "a database migration was scheduled for this window — this may be expected" instead of flagging a known maintenance window as a crisis.

When this is worth building

If you have operational metrics that matter, humans who aren't data scientists reading the alerts, and any kind of on-call rotation — this is worth an afternoon. The detection code is not the investment. The investment is in the context-building layer and in spending time reviewing real alerts to tune the prompt.

It's not worth building if your anomaly volume is low enough that a human analyst reviews every flag anyway, or if your metrics are so domain-specific that the LLM consistently gets the context wrong. In the second case, the prompt engineering rabbit hole gets deep fast and the ROI drops.

The question to ask yourself: when an alert fires today, does the person reading it know what to do? If the answer is "only if the right analyst happens to be awake," you have the problem this solves.