Published on: May 3, 2025
Churn prediction is one of the most common ML use cases in industry, and also one of the most commonly done wrong. The model is usually the easy part. The hard parts are defining churn correctly, engineering features that actually carry predictive signal, and closing the loop between a score and a business outcome. This guide covers all three.
There are two types of churn contexts, and they need different approaches:
For non-contractual contexts, your churn definition should be grounded in data, not intuition. Plot the distribution of inter-purchase gaps. Find the point at which the probability of a user ever transacting again drops sharply — that's your churn threshold. Typically 45–90 days depending on the product's natural usage frequency.
The biggest modelling mistake in churn prediction is feature leakage — accidentally using data from after the label date. Features must be computed as of a fixed observation date, with the label window coming after that date.
import pandas as pd
from datetime import timedelta
def build_churn_dataset(df_events, observation_date, label_window_days=60):
"""
observation_date: the cutoff point. Features use data before this.
label_window_days: if no activity in this window after obs_date → churned.
"""
label_cutoff = observation_date + timedelta(days=label_window_days)
# FEATURES: data up to observation_date
features = (
df_events[df_events['event_date'] < observation_date]
.groupby('user_id')
.agg(
recency = ('event_date', lambda x: (observation_date - x.max()).days),
frequency = ('txn_id', 'count'),
avg_value = ('amount', 'mean'),
active_weeks = ('week', 'nunique'),
)
)
# LABELS: any activity in the label window?
label_users = set(
df_events[
(df_events['event_date'] >= observation_date) &
(df_events['event_date'] < label_cutoff)
]['user_id']
)
features['churned'] = (~features.index.isin(label_users)).astype(int)
return features.reset_index()
Raw events rarely have much predictive power on their own. The signal lives in ratios, trends, and velocity measures:
Start simple and escalate complexity only when the simpler model fails:
Churn datasets are almost always imbalanced (5–20% churn rate). Accuracy is useless — a model that predicts "no churn" for everyone achieves 85% accuracy and is completely worthless. Use:
A churn score without an intervention plan is analytics theatre. The deployment architecture should look like:
The churn prediction models that fail in production almost always fail for non-technical reasons: vague label definitions, leaky features, or no connection between the model output and a concrete action. Get those three things right and even a logistic regression will drive measurable retention lift. The model sophistication is almost secondary.