Fintech · Real-time ML · GCP
At Razorpay's scale, payment success rate drops of even 1–2% represent tens of crores in failed transactions and direct merchant churn. The challenge: detecting these anomalies fast enough to act — before merchants notice, before support tickets spike, before the damage compounds.
The existing approach was manual: on-call engineers eyeballing dashboards. Mean time to detect (MTTD) was 25–40 minutes for most incidents. In payments, that's an eternity.
The system runs two complementary detection methods in parallel, each suited to a different anomaly type:
Metrics flow from payment processing systems into Google Pub/Sub, are aggregated in 5-minute windows in BigQuery, and processed by a Cloud Run service that runs both detectors continuously.
from sklearn.ensemble import IsolationForest
import numpy as np
class PaymentAnomalyDetector:
def __init__(self, contamination=0.02):
self.iso_forest = IsolationForest(
contamination=contamination,
n_estimators=200,
random_state=42
)
self.is_fitted = False
def fit(self, historical_df):
"""Train on 30 days of clean historical data"""
features = self._extract_features(historical_df)
self.iso_forest.fit(features)
self.is_fitted = True
return self
def _extract_features(self, df):
return df[[
'success_rate',
'txn_volume',
'p95_latency_ms',
'error_rate',
'timeout_rate'
]].values
def score(self, window_df):
"""Returns anomaly score for a 5-min window. Lower = more anomalous."""
features = self._extract_features(window_df)
scores = self.iso_forest.decision_function(features)
predictions = self.iso_forest.predict(features) # -1 = anomaly, 1 = normal
return scores, predictions
def cusum_detect(series: list, threshold: float = 5.0, drift: float = 0.5) -> bool:
"""
CUSUM control chart for detecting sustained shifts.
Returns True if a significant change is detected.
"""
mean = np.mean(series[:20]) # Baseline from first 20 observations
std = np.std(series[:20])
cusum_pos, cusum_neg = 0, 0
for x in series[20:]:
z = (x - mean) / (std + 1e-8)
cusum_pos = max(0, cusum_pos + z - drift)
cusum_neg = max(0, cusum_neg - z - drift)
if cusum_pos > threshold or cusum_neg > threshold:
return True # Anomaly detected
return False
Alerts fire only when both detectors agree — this eliminates the false positive problem that plagues single-method systems. When both flag an anomaly, a Slack alert is sent with: