The Scenario

From https://www.kaggle.com/abhinav89/telecom-customer/version/1.

This data set consists of 100 variables and approx 100 thousand records. This data set contains different variables explaining the attributes of telecom industry and various factors considered important while dealing with customers of telecom industry. The target variable here is churn which explains whether the customer will churn or not. We can use this data set to predict the customers who would churn or who wouldn't churn depending on various variables available.

Import data

In [1]:
import multiprocessing
import pandas as pd
path = "churn.csv"
df = pd.read_csv(path, delimiter=',', header='infer')
df.head()
Out[1]:
rev_Mean mou_Mean totmrc_Mean da_Mean ovrmou_Mean ovrrev_Mean vceovr_Mean datovr_Mean roam_Mean change_mou ... forgntvl ethnic kid0_2 kid3_5 kid6_10 kid11_15 kid16_17 creditcd eqpdays Customer_ID
0 23.9975 219.25 22.500 0.2475 0.00 0.0 0.0 0.0 0.0 -157.25 ... 0.0 N U U U U U Y 361.0 1000001
1 57.4925 482.75 37.425 0.2475 22.75 9.1 9.1 0.0 0.0 532.25 ... 0.0 Z U U U U U Y 240.0 1000002
2 16.9900 10.25 16.990 0.0000 0.00 0.0 0.0 0.0 0.0 -4.25 ... 0.0 N U Y U U U Y 1504.0 1000003
3 38.0000 7.50 38.000 0.0000 0.00 0.0 0.0 0.0 0.0 -1.50 ... 0.0 U Y U U U U Y 1812.0 1000004
4 55.2300 570.50 71.980 0.0000 0.00 0.0 0.0 0.0 0.0 38.50 ... 0.0 I U U U U U Y 434.0 1000005

5 rows × 100 columns

Generate the X (features) and y (target) dataframes

create a dataframe (x) of all the features and another dataframe (y) with all the labels

In [2]:
x=  df[[
 'rev_Mean',
 'mou_Mean',
 'totmrc_Mean',
 'da_Mean',
 'ovrmou_Mean',
 'ovrrev_Mean',
 'vceovr_Mean',
 'datovr_Mean',
 'roam_Mean',
 'change_mou',
 'change_rev',
 'drop_vce_Mean',
 'drop_dat_Mean',
 'blck_vce_Mean',
 'blck_dat_Mean',
 'unan_vce_Mean',
 'unan_dat_Mean',
 'plcd_vce_Mean',
 'plcd_dat_Mean',
 'recv_vce_Mean',
 'recv_sms_Mean',
 'comp_vce_Mean',
 'comp_dat_Mean',
 'custcare_Mean',
 'ccrndmou_Mean',
 'cc_mou_Mean',
 'inonemin_Mean',
 'threeway_Mean',
 'mou_cvce_Mean',
 'mou_cdat_Mean',
 'mou_rvce_Mean',
 'owylis_vce_Mean',
 'mouowylisv_Mean',
 'iwylis_vce_Mean',
 'mouiwylisv_Mean',
 'peak_vce_Mean',
 'peak_dat_Mean',
 'mou_peav_Mean',
 'mou_pead_Mean',
 'opk_vce_Mean',
 'opk_dat_Mean',
 'mou_opkv_Mean',
 'mou_opkd_Mean',
 'drop_blk_Mean',
 'attempt_Mean',
 'complete_Mean',
 'callfwdv_Mean',
 'callwait_Mean',
 'months',
 'uniqsubs',
 'actvsubs',
 'new_cell',
 'crclscod',
 'asl_flag',
 'totcalls',
 'totmou',
 'totrev',
 'adjrev',
 'adjmou',
 'adjqty',
 'avgrev',
 'avgmou',
 'avgqty',
 'avg3mou',
 'avg3qty',
 'avg3rev',
 'avg6mou',
 'avg6qty',
 'avg6rev',
 'prizm_social_one',
 'area',
 'dualband',
 'refurb_new',
 'hnd_price',
 'phones',
 'models',
 'hnd_webcap',
 'truck',
 'rv',
 'ownrent',
 'lor',
 'dwlltype',
 'marital',
 'adults',
 'infobase',
 'income',
 'numbcars',
 'HHstatin',
 'dwllsize',
 'forgntvl',
 'ethnic',
 'kid0_2',
 'kid3_5',
 'kid6_10',
 'kid11_15',
 'kid16_17',
 'creditcd',
 'eqpdays',
 'Customer_ID'
       ]]


y =  df[['churn']]

#check columns in new df
list(x)
Out[2]:
['rev_Mean',
 'mou_Mean',
 'totmrc_Mean',
 'da_Mean',
 'ovrmou_Mean',
 'ovrrev_Mean',
 'vceovr_Mean',
 'datovr_Mean',
 'roam_Mean',
 'change_mou',
 'change_rev',
 'drop_vce_Mean',
 'drop_dat_Mean',
 'blck_vce_Mean',
 'blck_dat_Mean',
 'unan_vce_Mean',
 'unan_dat_Mean',
 'plcd_vce_Mean',
 'plcd_dat_Mean',
 'recv_vce_Mean',
 'recv_sms_Mean',
 'comp_vce_Mean',
 'comp_dat_Mean',
 'custcare_Mean',
 'ccrndmou_Mean',
 'cc_mou_Mean',
 'inonemin_Mean',
 'threeway_Mean',
 'mou_cvce_Mean',
 'mou_cdat_Mean',
 'mou_rvce_Mean',
 'owylis_vce_Mean',
 'mouowylisv_Mean',
 'iwylis_vce_Mean',
 'mouiwylisv_Mean',
 'peak_vce_Mean',
 'peak_dat_Mean',
 'mou_peav_Mean',
 'mou_pead_Mean',
 'opk_vce_Mean',
 'opk_dat_Mean',
 'mou_opkv_Mean',
 'mou_opkd_Mean',
 'drop_blk_Mean',
 'attempt_Mean',
 'complete_Mean',
 'callfwdv_Mean',
 'callwait_Mean',
 'months',
 'uniqsubs',
 'actvsubs',
 'new_cell',
 'crclscod',
 'asl_flag',
 'totcalls',
 'totmou',
 'totrev',
 'adjrev',
 'adjmou',
 'adjqty',
 'avgrev',
 'avgmou',
 'avgqty',
 'avg3mou',
 'avg3qty',
 'avg3rev',
 'avg6mou',
 'avg6qty',
 'avg6rev',
 'prizm_social_one',
 'area',
 'dualband',
 'refurb_new',
 'hnd_price',
 'phones',
 'models',
 'hnd_webcap',
 'truck',
 'rv',
 'ownrent',
 'lor',
 'dwlltype',
 'marital',
 'adults',
 'infobase',
 'income',
 'numbcars',
 'HHstatin',
 'dwllsize',
 'forgntvl',
 'ethnic',
 'kid0_2',
 'kid3_5',
 'kid6_10',
 'kid11_15',
 'kid16_17',
 'creditcd',
 'eqpdays',
 'Customer_ID']
In [3]:
#show unique values in the dataframe column
df.churn.unique()
Out[3]:
array([1, 0])

Standardize & encode data

When we’re getting our data ready for our machine learning models, it’s important to consider scaling and encoding.

Scaling is a method used to standardise the range of data. This is important as if one field stores age (between 18 and 90) and another stores salary (between 10,000 and 200,000), the machine learning algorithm might bias its results towards the larger numbers, as it may assume they’re more important. SciKitLearn state that “If a feature has a variance that is orders of magnitude larger that others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.”

Using this SciKitLearn library, we can convert each feature to have a mean of zero and a standard deviation of 1; removing the potential bias in the model.

For some models, this is an absolute requirement, as certain algorithms expect that your data is normally distributed and centre around zero.

Encoding is simple – machine learning algorithms can only accept numerical features. If you have input variables of Male & Female, we can encode them to be 0 or 1 so that they can be used in the machine learning model

In [4]:
from sklearn.preprocessing import LabelEncoder, StandardScaler
import numpy as np

#encoding with get_dummies
x = pd.get_dummies( x )

#fill in NA values with zeros
x = x.fillna(0)

#standardize the scale
x = StandardScaler().fit_transform(x)

#convert dataframes to numpy arrays
x = np.array(x)
y = np.array(y)

Split data (75% training & 25% testing)

In [5]:
from sklearn.model_selection import train_test_split
train_features, test_features, train_labels, test_labels = train_test_split(x, y, test_size = 0.25, random_state = 42)

Train the model (fit) on the training data

In [6]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()

model.fit(train_features, train_labels.ravel())
Out[6]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

Predict test data outcomes

In [7]:
predictions = model.predict(test_features)
In [8]:
predictions
Out[8]:
array([1, 1, 0, ..., 0, 1, 0])

Add actual + predicted outcome to a dataframe

In [9]:
dfx = pd.DataFrame(test_labels)
dfx.columns = ['actual']
In [10]:
dfx['predictions'] = pd.DataFrame(predictions)
In [11]:
dfx.head()
Out[11]:
actual predictions
0 1 1
1 0 1
2 0 0
3 1 1
4 0 1

Calculate accuracy

If the prediction was 1 & the outcome was 1 or the prediction was 0 and the outcome was 0, subtracting the two fields will result in 0. Any other combination will be either +1 or -1.

In [12]:
model.score(train_features, train_labels)
Out[12]:
0.6008933333333333
In [13]:
model.score(test_features, test_labels)
Out[13]:
0.5914

Cross Validation Scores

This removes the impact of different samples on the machine learning model score

In [14]:
from sklearn.model_selection import cross_val_score

# k-fold requires an unfitted model:
model = LogisticRegression()

# run the model 3 times (cv) and print the scores for each. We can select the % split as a parameter
scores = cross_val_score(model, x, y.ravel(), cv=3)

# Calculate the mean and standard deviation for each generated score
scores.mean(), scores.std()
Out[14]:
(0.5701597440510088, 0.00909443372373965)

Let's try a different model type: Random Forest

In [15]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
model = RandomForestClassifier(n_estimators = 250, random_state = 42)

model.fit(train_features, train_labels.ravel())
/Users/keenek1/anaconda3/lib/python3.7/site-packages/sklearn/ensemble/weight_boosting.py:29: DeprecationWarning: numpy.core.umath_tests is an internal NumPy module and should not be imported. It will be removed in a future NumPy release.
  from numpy.core.umath_tests import inner1d
Out[15]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=250, n_jobs=1,
            oob_score=False, random_state=42, verbose=0, warm_start=False)
In [16]:
predictions = model.predict(test_features)
In [17]:
model.score(train_features, train_labels)
Out[17]:
1.0
In [18]:
model.score(test_features, test_labels)
Out[18]:
0.62668

Can we remove some features?

  • Reduces Overfitting
  • Improves Accuracy
  • Reduces Training Time
In [19]:
#store feature importances in a dictionary
importance = model.feature_importances_
importances = pd.DataFrame(importance)

dictionary = dict(zip(df.columns, model.feature_importances_))
In [20]:
#create a pandas dataframe from the index
feature_matrix = pd.DataFrame(dictionary, index=[0])
featurex = feature_matrix.T
featurex.columns = ['meas']
In [21]:
#Check the score for every column in the DF
sorted = featurex.sort_values(by=['meas'], ascending=False)
with pd.option_context("display.max_rows", 10000): 
    print(sorted)
                      meas
models            0.029952
change_mou        0.022874
hnd_webcap        0.021463
churn             0.021417
mou_Mean          0.019031
asl_flag          0.018461
change_rev        0.018448
totrev            0.018167
adjmou            0.018122
crclscod          0.018071
adjrev            0.017667
rev_Mean          0.017441
totmou            0.017335
actvsubs          0.017319
new_cell          0.017006
totcalls          0.016946
adjqty            0.016756
mou_cvce_Mean     0.016072
avgrev            0.016064
mou_opkv_Mean     0.016015
avgqty            0.015931
mou_peav_Mean     0.015914
avg3mou           0.015472
totmrc_Mean       0.015172
mouowylisv_Mean   0.015126
mou_rvce_Mean     0.014695
peak_vce_Mean     0.014578
opk_vce_Mean      0.014417
unan_vce_Mean     0.014369
avg3qty           0.014221
avgmou            0.014092
recv_vce_Mean     0.013915
owylis_vce_Mean   0.013680
attempt_Mean      0.013679
complete_Mean     0.013666
plcd_vce_Mean     0.013644
comp_vce_Mean     0.013602
inonemin_Mean     0.013529
drop_blk_Mean     0.013310
drop_vce_Mean     0.012866
mouiwylisv_Mean   0.012755
ovrmou_Mean       0.011318
ovrrev_Mean       0.011274
iwylis_vce_Mean   0.011114
vceovr_Mean       0.010808
avg3rev           0.010745
blck_vce_Mean     0.010715
area              0.010093
cc_mou_Mean       0.008793
da_Mean           0.008615
refurb_new        0.008390
roam_Mean         0.008068
ccrndmou_Mean     0.007792
custcare_Mean     0.007285
callwait_Mean     0.007177
dualband          0.007101
months            0.005273
avg6mou           0.004856
hnd_price         0.004548
threeway_Mean     0.004492
avg6qty           0.003804
uniqsubs          0.003468
datovr_Mean       0.003021
mou_cdat_Mean     0.002822
plcd_dat_Mean     0.002527
adults            0.002468
rv                0.002361
mou_opkd_Mean     0.002193
comp_dat_Mean     0.002151
ownrent           0.002114
mou_pead_Mean     0.002102
avg6rev           0.001946
numbcars          0.001863
truck             0.001821
opk_dat_Mean      0.001729
lor               0.001727
peak_dat_Mean     0.001643
ethnic            0.001493
prizm_social_one  0.001406
phones            0.001245
infobase          0.000945
eqpdays           0.000872
unan_dat_Mean     0.000791
drop_dat_Mean     0.000604
HHstatin          0.000386
recv_sms_Mean     0.000330
blck_dat_Mean     0.000324
dwlltype          0.000321
callfwdv_Mean     0.000167
Customer_ID       0.000134
income            0.000121
kid16_17          0.000089
dwllsize          0.000083
kid6_10           0.000077
kid3_5            0.000071
forgntvl          0.000035
creditcd          0.000032
kid0_2            0.000006
kid11_15          0.000002
marital           0.000000
In [22]:
#create a new DF with only scores above a certain threshold
df_limited = df[[
    'models',
    'change_mou',
    'hnd_webcap',
    'mou_Mean',
    'change_rev',
    'asl_flag',
    'crclscod',
    'adjmou',
    'totrev',
    'adjrev',
    'rev_Mean',
    'actvsubs',
    'totmou',
    'new_cell',
    'totcalls',
    'adjqty',
    'mou_cvce_Mean',
    'avgrev',
    'avgqty',
    'mou_opkv_Mean',
    'mou_peav_Mean',
    'avg3mou',
    'mouowylisv_Mean',
    'totmrc_Mean',
    'mou_rvce_Mean',
    'peak_vce_Mean'
]]

Encode & standardize the limited feature set

In [23]:
#encoding with get_dummies
x2 = pd.get_dummies( df_limited )

#fill in NA values with zeros
x2 = x2.fillna(0)

#standardize the scale
x2 = StandardScaler().fit_transform(x2)

#convert dataframes to numpy arrays
x2 = np.array(x2)

Train the model with the new limited dataset

In [24]:
from sklearn.model_selection import train_test_split
train_features, test_features, train_labels, test_labels = train_test_split(x2, y, test_size = 0.25, random_state = 42)
In [25]:
model = RandomForestClassifier(n_estimators = 250, random_state = 42)
model.fit(train_features, train_labels.ravel())
Out[25]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=250, n_jobs=1,
            oob_score=False, random_state=42, verbose=0, warm_start=False)
In [26]:
predictions = model.predict(test_features)
In [27]:
model.score(train_features, train_labels)
Out[27]:
1.0
In [28]:
model.score(test_features, test_labels)
Out[28]:
0.60244

As you can see, you can reduce the feature set significantly & slightly improve the score

Other methods we could (but didn't) use to reduce features

Recursive Feature Extraction (RFE) performs a search to find the best performing feature subset. It iteratively creates the models and determines which are the best features. It will then rank the features.

In [ ]:
from sklearn.feature_selection import RFE

#Recursive feature elimination - select only the k important features
k = 10
rfe = RFE(model, k)
fit = rfe.fit(train_features, train_labels.ravel())
print("Num Features: %s" % (fit.n_features_))
print("Selected Features: %s" % (fit.support_))
print("Feature Ranking: %s" % (fit.ranking_))

Hyperparameter tuning

n_estimators

N_estimators is the number of trees you want to include in your forest. As you remember, the trees vote & the majority wins. The higher number of trees gives you better model performance, but it can be very slow. The key is to take the maximum number of trees that your processor can handle as it makes your predictions more accurate and more stable when released into the wild.

max_features

This covers the max number of features that a random forest is allowed to try in a single decision trees. Here are the options:

  • Auto/None – This will take all the features it can possibly put in the tree
  • Sqrt – will take the square root of the total number of features for each individual tree. E.g. 100 features, would be 10 features per tree
  • 0.2 – this takes 20% of features for each tree.

Increasing the max number of features usually (but not always) improves model accuracy but slows down model performance.

min_samples_leaf

The leaf is the end node of the decision tree. The smaller leaf makes the tree more prone to capturing noise in training data & overfitting. Keeping the leaf min size at around 50 observations should help remove some of the overfitting.

bootstrap

The below came from https://www.quora.com/Why-does-random-forest-use-sampling-with-replacement-instead-of-without-replacement

“When random forest runs, it samples the data. IF it samples the data, selects a datapoint, records its value and then places it back into the dataset (so it can be sampled again), it is known as 'with replacement'. This allows random variation between the trees

In overly simplified terms, bootstrapping aggregates many possible forests built from sub-samples of your original data, to reduce overfitting; replacement is at the heart of building a bootstrapped sample.

The basic idea of bootstrapping is that you use your sample as a population. And from it you sample repeatedly, with replacement, to build other samples.

Replacement is an integral part of this process because you are trying to create other possible sample distributions.

Assuming that your sample is a random and reasonably accurate representation of the population, by repeatedly sampling with replacement (given that each observation in the original data is equally likely to be sampled each time an observation is requested) you are building multiple sub-samples of the same dimension as the original sample, with slightly different distributions of the variables.

You get an array of different means and standard deviations for each sub sample, each of which could have been an outcome of your original sample from the populations.

The idea is that when the subsets are aggregated (which takes us from the bootstrap part to the bagging part Bootstrap Aggregation =Bagg) then you obtain a more likely representation of the aggregate of the TRUE population than you would using the original sample by itself.

It reduces the likelihood of building a tree which fits the sampled data, but does not generalize well.”

In [ ]:
# Create the parameter grid based on the results of random search 
param_grid = {
'bootstrap': [True],
'max_depth': [80, 90, 100, 110],
'max_features': [2, 3],
'min_samples_leaf': [3, 4, 5],
'min_samples_split': [8, 10, 12],
'n_estimators': [100, 200, 300, 1000]
}
# Create a based model
model = RandomForestClassifier()

# Instantiate the grid search model
best = GridSearchCV(estimator = model, param_grid = param_grid, 
cv = 32, n_jobs = -1, verbose = 2)

best.fit(x, y.ravel())
best.score(x, y.ravel())

Grid search is super slow, let's use randomized search instead

gridsearch does an exhaustive search of all the combinations you define in the parameter grid. Random search doesn't, it randomly samples looking for the best, we define how many times it samples with n_iter

In [30]:
from sklearn.model_selection import RandomizedSearchCV
In [31]:
from sklearn.pipeline import make_pipeline

# Create a based model
model = RandomForestClassifier()

# Instantiate the random search model
best = RandomizedSearchCV(model, {
'bootstrap': [True, False],
'n_estimators': [500, 1000, 1500, 2000],
'max_depth': [110, 400],
'min_samples_leaf': [10, 20, 50]
}, cv=5, return_train_score=True, iid=True, n_iter = 3)

best.fit(train_features, train_labels.ravel())
print(best.best_score_)
print(best.best_estimator_)
print(best.best_params_)      
0.6053733333333333
RandomForestClassifier(bootstrap=False, class_weight=None, criterion='gini',
            max_depth=400, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=10, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=1000, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)
{'n_estimators': 1000, 'min_samples_leaf': 10, 'max_depth': 400, 'bootstrap': False}
In [ ]:
 
In [ ]: