← Back to Projects

Payment Metrics Anomaly Detection

Fintech · Real-time ML · GCP


Python Isolation Forest statsmodels BigQuery Google Pub/Sub Cloud Run

The Problem

At Razorpay's scale, payment success rate drops of even 1–2% represent tens of crores in failed transactions and direct merchant churn. The challenge: detecting these anomalies fast enough to act — before merchants notice, before support tickets spike, before the damage compounds.

The existing approach was manual: on-call engineers eyeballing dashboards. Mean time to detect (MTTD) was 25–40 minutes for most incidents. In payments, that's an eternity.


The Solution: Layered Anomaly Detection

The system runs two complementary detection methods in parallel, each suited to a different anomaly type:


Architecture

Metrics flow from payment processing systems into Google Pub/Sub, are aggregated in 5-minute windows in BigQuery, and processed by a Cloud Run service that runs both detectors continuously.

from sklearn.ensemble import IsolationForest
import numpy as np

class PaymentAnomalyDetector:
    def __init__(self, contamination=0.02):
        self.iso_forest = IsolationForest(
            contamination=contamination,
            n_estimators=200,
            random_state=42
        )
        self.is_fitted = False

    def fit(self, historical_df):
        """Train on 30 days of clean historical data"""
        features = self._extract_features(historical_df)
        self.iso_forest.fit(features)
        self.is_fitted = True
        return self

    def _extract_features(self, df):
        return df[[
            'success_rate',
            'txn_volume',
            'p95_latency_ms',
            'error_rate',
            'timeout_rate'
        ]].values

    def score(self, window_df):
        """Returns anomaly score for a 5-min window. Lower = more anomalous."""
        features = self._extract_features(window_df)
        scores = self.iso_forest.decision_function(features)
        predictions = self.iso_forest.predict(features)  # -1 = anomaly, 1 = normal
        return scores, predictions

CUSUM for Drift Detection

def cusum_detect(series: list, threshold: float = 5.0, drift: float = 0.5) -> bool:
    """
    CUSUM control chart for detecting sustained shifts.
    Returns True if a significant change is detected.
    """
    mean = np.mean(series[:20])  # Baseline from first 20 observations
    std = np.std(series[:20])

    cusum_pos, cusum_neg = 0, 0

    for x in series[20:]:
        z = (x - mean) / (std + 1e-8)
        cusum_pos = max(0, cusum_pos + z - drift)
        cusum_neg = max(0, cusum_neg - z - drift)

        if cusum_pos > threshold or cusum_neg > threshold:
            return True  # Anomaly detected

    return False

Alert Logic

Alerts fire only when both detectors agree — this eliminates the false positive problem that plagues single-method systems. When both flag an anomaly, a Slack alert is sent with:


Results

70% Reduction in MTTD
~7 min Average detection time
<3% False positive rate

Key Learnings