Churn Prediction with Machine Learning: A Practitioner's Guide

Published on: May 3, 2025

Churn prediction is one of the most common ML use cases in industry, and also one of the most commonly done wrong. The model is usually the easy part. The hard parts are defining churn correctly, engineering features that actually carry predictive signal, and closing the loop between a score and a business outcome. This guide covers all three.

Step 1: Define Churn Before You Model It

There are two types of churn contexts, and they need different approaches:

Contractual churn: The user explicitly cancels a subscription. The label is clean — you know exactly when churn happened.
Non-contractual churn: The user simply stops transacting (e-commerce, payments, apps). There's no cancellation event. You have to define an inactivity threshold.

For non-contractual contexts, your churn definition should be grounded in data, not intuition. Plot the distribution of inter-purchase gaps. Find the point at which the probability of a user ever transacting again drops sharply — that's your churn threshold. Typically 45–90 days depending on the product's natural usage frequency.

Step 2: Point-in-Time Feature Construction

The biggest modelling mistake in churn prediction is feature leakage — accidentally using data from after the label date. Features must be computed as of a fixed observation date, with the label window coming after that date.

import pandas as pd
from datetime import timedelta

def build_churn_dataset(df_events, observation_date, label_window_days=60):
    """
    observation_date: the cutoff point. Features use data before this.
    label_window_days: if no activity in this window after obs_date → churned.
    """
    label_cutoff = observation_date + timedelta(days=label_window_days)

    # FEATURES: data up to observation_date
    features = (
        df_events[df_events['event_date'] < observation_date]
        .groupby('user_id')
        .agg(
            recency      = ('event_date', lambda x: (observation_date - x.max()).days),
            frequency    = ('txn_id',     'count'),
            avg_value    = ('amount',     'mean'),
            active_weeks = ('week',       'nunique'),
        )
    )

    # LABELS: any activity in the label window?
    label_users = set(
        df_events[
            (df_events['event_date'] >= observation_date) &
            (df_events['event_date'] <  label_cutoff)
        ]['user_id']
    )
    features['churned'] = (~features.index.isin(label_users)).astype(int)

    return features.reset_index()

Step 3: Feature Engineering That Actually Works

Raw events rarely have much predictive power on their own. The signal lives in ratios, trends, and velocity measures:

Recency trend: Is the user's time between sessions increasing (slowing down) or decreasing? A widening gap is a stronger signal than just days-since-last-active.
Feature breadth vs depth: Users who use many features shallowly churn faster than users who use one feature deeply and consistently.
Early-life engagement: Behaviour in days 1–14 post-acquisition is disproportionately predictive of 90-day retention. Create separate D7/D14 engagement scores.
Failure signals: Failed transactions, unresolved support tickets, and app crashes are strong leading indicators that often get overlooked.

Step 4: Model Selection

Start simple and escalate complexity only when the simpler model fails:

Logistic Regression: Always your baseline. Highly interpretable, fast to train, provides calibrated probabilities. If it achieves AUC > 0.75, think twice before spending weeks on a GBM.
Gradient Boosted Trees (XGBoost/LightGBM): The workhorse for tabular churn prediction. Handles missing values and feature interactions natively. Typically 3–8 AUC points above logistic regression on well-featured datasets.
Survival models (BG/NBD, Weibull): For non-contractual settings where you want not just "will they churn" but "when." Better for LTV forecasting use cases.

Step 5: Evaluation — Precision/Recall over Accuracy

Churn datasets are almost always imbalanced (5–20% churn rate). Accuracy is useless — a model that predicts "no churn" for everyone achieves 85% accuracy and is completely worthless. Use:

AUC-ROC: Overall ranking quality. Target > 0.80 before deploying.
Precision @ top decile: Of the users you flag as high-risk, what % actually churned? This drives your campaign economics.
Recall @ top two deciles: What fraction of all churners do you catch in your top-two risk buckets? This determines coverage.

Step 6: From Score to Action

A churn score without an intervention plan is analytics theatre. The deployment architecture should look like:

Weekly batch scoring → push scores + top SHAP drivers to CRM.
Segment-specific playbooks: Don't send the same message to every at-risk user. A user churning due to payment failures needs a different intervention than one churning due to low feature adoption.
Holdout group: Always keep 10–15% of at-risk users as a control group to measure model-driven campaign lift vs baseline. Without this, you can never prove the model drove impact.

Conclusion

The churn prediction models that fail in production almost always fail for non-technical reasons: vague label definitions, leaky features, or no connection between the model output and a concrete action. Get those three things right and even a logistic regression will drive measurable retention lift. The model sophistication is almost secondary.