← Back to Projects

User Churn Prediction Model

Fintech  ·  Python · XGBoost · SHAP · SQL


Problem

At a high-growth fintech platform, we were losing roughly 18% of activated users within the first 90 days. Retention campaigns existed but were untargeted — every churned user got the same re-engagement email, regardless of why they left. The result: low response rates, high unsubscribe rates, and wasted campaign budget.

The goal was to build a model that could identify users at risk of churning before they churned, score them by risk level, and enable the CRM team to run differentiated interventions.


Defining Churn

The first challenge was definitional. For a payments platform, "churn" is non-contractual — users don't cancel; they just stop transacting. We defined churn as: no transaction in 60 days for users who had at least one transaction in their first 30 days of activation. This gave us a clean binary label and excluded new users still in the activation funnel.


Feature Engineering

We built a feature set around three themes, all computed on a rolling 30-day lookback window per user:

import pandas as pd
from datetime import timedelta

def build_features(df_events, reference_date, lookback_days=30):
    cutoff = reference_date - timedelta(days=lookback_days)
    window = df_events[df_events['event_date'] >= cutoff].copy()

    features = (
        window.groupby('user_id')
        .agg(
            txn_count        = ('txn_id',      'count'),
            active_days      = ('event_date',  'nunique'),
            avg_txn_value    = ('amount',      'mean'),
            success_rate     = ('is_success',  'mean'),
            products_used    = ('product_code','nunique'),
            days_since_last  = ('event_date',  lambda x: (reference_date - x.max()).days),
        )
        .reset_index()
    )
    return features

Model

We trained a gradient-boosted classifier (XGBoost) with 5-fold stratified cross-validation. Hyperparameters were tuned with Optuna. Class imbalance (~80:20 non-churn:churn) was handled via scale_pos_weight. We prioritised recall in the top two deciles — we'd rather flag too many at-risk users than miss them.

Final model metrics on holdout:


Interpretability with SHAP

SHAP (SHapley Additive exPlanations) was used to make the model actionable for the CRM team. Instead of a black-box score, each user got a risk score plus the top 3 drivers of their score. This told the CRM team why a user was at risk — enabling personalised messaging.

import shap

explainer   = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_holdout)

# Get top 3 features driving churn risk per user
def top_shap_drivers(shap_row, feature_names, n=3):
    sorted_idx = abs(shap_row).argsort()[::-1][:n]
    return [(feature_names[i], round(shap_row[i], 3)) for i in sorted_idx]

df_holdout['shap_drivers'] = [
    top_shap_drivers(shap_values[i], X_holdout.columns)
    for i in range(len(X_holdout))
]

Deployment & Impact

The model was deployed as a weekly batch job. Every Monday, users were scored and the top two risk deciles were passed to the CRM platform with their SHAP-driven personalisation tags. Three CRM interventions were built:

After 3 months of running the model-driven campaigns vs the old blanket approach: 22% reduction in 90-day churn in the targeted cohort, translating to approximately ₹8 crore in retained annual revenue.


Key Learnings