A ride-sharing company (Company X) is interested in predicting rider retention. Using data for rider activity, we developed a model that identifies what factors are best predictors of retention. We also offer suggestions to operationalize insights to help Company X.
We have a mix of rider demographics, rider behavior, ride characteristics, and rider/driver ratings of each other. Data spanned a 7 month period.
Variable | Description |
---|---|
city | City this user signed up in |
phone | Primary device for this user |
signup_date | Date of account registration |
last_trip_date | Last time user completed a trip |
avg_dist | Average distance (in miles) per trip taken in first 30 days after signup |
avg_rating_by_driver | Rider’s average rating over all trips |
avg_rating_of_driver | Rider’s average rating of their drivers over all trips |
surge_pct | Percent of trips taken with surge multiplier > 1 |
avg_surge | Average surge multiplier over all of user’s trips |
trips_in_first_30_days | Number of trips user took in first 30 days after signing up |
luxury_car_user | TRUE if user took luxury car in first 30 days |
weekday_pct | Percent of user’s trips occurring during a weekday |
We converted dates into date time objects to calculate the churn outcome variable. Users were identified as having churned if they had not used the ride-share service in the past thirty days:
def convert_dates(df):
df['last_trip_date'] = df['last_trip_date'].apply(lambda x: datetime.strptime(x, '%Y-%m-%d'))
df['signup_date'] = df['signup_date'].apply(lambda x: datetime.strptime(x, '%Y-%m-%d'))
current_date = datetime.strptime('2014-07-01', '%Y-%m-%d')
active_date = current_date - timedelta(days=30)
y = np.array([0 if last_trip_date > active_date else 1 for last_trip_date in df['last_trip_date']])
return y
Categorical variables where classes were represented with strings were encoded as numerical classes:
def label_encode(df, encode_list):
le = preprocessing.LabelEncoder()
for col in encode_list:
le.fit(df[col])
df[col + '_enc'] = le.transform(df[col])
return df
We discovered that some of the predictor variables (e.g., average distance, number of trips in first 30 days) were positively skewed to a rather marked degree. These variables also included zero values so it was not possible to use simple corrections for skew, such as log transform.
Skewed data were normalized using an inverse hyperbolic sine transformation:
def normalize_inv_hyperbol_sine(x):
x_arr = np.array(df[x])
df[x+'_normalized'] = np.arcsinh(x_arr)
This worked well to normalize the data.
While examining distributions of the variables, we noticed that the percent of users' trips occurring during a weekday had an interesting distribution, with definite spikes for 0% and 100% and a more normal/Gaussian-looking distribution for the space between 0 and 100:
We decided to create dummy variables to split this variable apart:
- All rides on weekdays
- All rides on weekends
- Mix of weekdays and weekends
def categorize_weekday_pct(df):
df['all_weekday'] = (df.weekday_pct == 100).astype('int')
df['all_weekend'] = (df.weekday_pct == 0).astype('int')
df['mix_weekday_weekend'] = ((df.weekday_pct <100) & (df.weekday_pct > 0)).astype('int')
Random Forest is a great place to start with a classification problem like this. It's fast, easy to use, and pretty accurate right out of the box. Our Random Forest Classifier produced an F1 Score of 77% on unseen data.
To improve our model fit, we next tried some boosted classification models. While boosted models require more tuning (and therefore take a bit longer to get working than Random Forest), they are usually more accurate than Random Forest.
- Gradient boost
- Using Scikit Learn's
GridSearchCV
, we first performed a grid search to determine the best model parameters for aGradientBoostingClassifier
. The resultant classifier performed well, with an F1 Score of 83% on unseen data.
- XGBoost
Coming soon!
-
Use the best fitting model (above) to obtain predicted probabilities for individuals. Target those with greater than some probability of churning (choose this cutoff by considering profit curve based on confusion matrix).
-
Offer discounts or free rides to at-risk users to try and retain them - no need to target users below a certain probability threshold.
Classifiers like random forest and boosted trees are quite robust to skewed and non-normally distributed data. We probably did not need to spend time transforming our data or creating dummy variables for percent of weekday rides.
Our team included Micah Shanks (github.com/Jomonsugi), Stuart King (github.com/Stuart-D-King), Jennifer Waller (github.com/jw15), and Ian