Payment Metrics Anomaly Detection

Fintech · Real-time ML · GCP

Python Isolation Forest statsmodels BigQuery Google Pub/Sub Cloud Run

The Problem

At Razorpay's scale, payment success rate drops of even 1–2% represent tens of crores in failed transactions and direct merchant churn. The challenge: detecting these anomalies fast enough to act — before merchants notice, before support tickets spike, before the damage compounds.

The existing approach was manual: on-call engineers eyeballing dashboards. Mean time to detect (MTTD) was 25–40 minutes for most incidents. In payments, that's an eternity.

The Solution: Layered Anomaly Detection

The system runs two complementary detection methods in parallel, each suited to a different anomaly type:

Statistical Control Charts (CUSUM) — detects gradual drifts and sustained shifts. Excellent for catching slow-burn degradation (e.g., a bank's acceptance rate declining over 2 hours).
Isolation Forest — detects sudden multivariate anomalies by isolating outliers across payment success rate, volume, latency, and error code distributions simultaneously.

Architecture

Metrics flow from payment processing systems into Google Pub/Sub, are aggregated in 5-minute windows in BigQuery, and processed by a Cloud Run service that runs both detectors continuously.

from sklearn.ensemble import IsolationForest
import numpy as np

class PaymentAnomalyDetector:
    def __init__(self, contamination=0.02):
        self.iso_forest = IsolationForest(
            contamination=contamination,
            n_estimators=200,
            random_state=42
        )
        self.is_fitted = False

    def fit(self, historical_df):
        """Train on 30 days of clean historical data"""
        features = self._extract_features(historical_df)
        self.iso_forest.fit(features)
        self.is_fitted = True
        return self

    def _extract_features(self, df):
        return df[[
            'success_rate',
            'txn_volume',
            'p95_latency_ms',
            'error_rate',
            'timeout_rate'
        ]].values

    def score(self, window_df):
        """Returns anomaly score for a 5-min window. Lower = more anomalous."""
        features = self._extract_features(window_df)
        scores = self.iso_forest.decision_function(features)
        predictions = self.iso_forest.predict(features)  # -1 = anomaly, 1 = normal
        return scores, predictions

CUSUM for Drift Detection

def cusum_detect(series: list, threshold: float = 5.0, drift: float = 0.5) -> bool:
    """
    CUSUM control chart for detecting sustained shifts.
    Returns True if a significant change is detected.
    """
    mean = np.mean(series[:20])  # Baseline from first 20 observations
    std = np.std(series[:20])

    cusum_pos, cusum_neg = 0, 0

    for x in series[20:]:
        z = (x - mean) / (std + 1e-8)
        cusum_pos = max(0, cusum_pos + z - drift)
        cusum_neg = max(0, cusum_neg - z - drift)

        if cusum_pos > threshold or cusum_neg > threshold:
            return True  # Anomaly detected

    return False

Alert Logic

Alerts fire only when both detectors agree — this eliminates the false positive problem that plagues single-method systems. When both flag an anomaly, a Slack alert is sent with:

Which metrics are anomalous and by how much
Likely affected payment instruments (cards, UPI, netbanking)
A BigQuery deep-link to drill into the raw data
Severity classification (P1/P2/P3) based on the magnitude and duration

Results

70% Reduction in MTTD

~7 min Average detection time

<3% False positive rate

Key Learnings

Seasonality is everything. Payment volumes follow strong intraday and intraweek patterns. The Isolation Forest must be retrained weekly or it starts flagging normal peak traffic as anomalous.
Require dual confirmation. Single-detector systems generate too many false positives at scale. Engineering trust in alerts requires a low false positive rate above everything else.
Alert fatigue kills adoption. The first version sent too many P2 alerts. Tightening thresholds and requiring 2 consecutive anomalous windows before alerting reduced volume by 60% with minimal loss in detection sensitivity.