Bank Churn

R Data Engineering



Name: Leigh Davis

Objective

In every business, there will always be customers coming in and going out. Customer churn ratio is the measure of the number of individuals or items moving out of a collective group over a specific period. Losing customers is expensive for businesses and understanding why the business loses customers can assist in taking corrective action. I will be observing factors to deduce why customers are being lost using trend analysis and linear regression to show relationships among variables. I will also be using a machine learning algorithm to predict whether a customer will churn or not. The data used in this case were taken from Kaggle, Churn Modeling (see references).

Importing Library

# Import library
library(dplyr) # This was the subject of my first project in STAT 484; provides set of verbs that help solve common data manipulation challenges
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(caret) # Contains functions to streamline the model training process for complex regression and classification problems
## Loading required package: ggplot2
## Loading required package: lattice
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select

Data Preparation

# Import Data
df <- read.csv(file = "~/Documents/PSU/R/Data/Churn_Modelling.csv", row.names = 1, stringsAsFactors = T)

# Take out irrelevant columns : CustomerId, Surname
churn <- df %>% 
  dplyr::select(-c(CustomerId, Surname)) %>% 
  mutate(HasCrCard = as.factor(HasCrCard),
         IsActiveMember = as.factor(IsActiveMember),
         Exited = as.factor(Exited))

# Checking for NA values
sum(is.na(churn))
## [1] 0
# Description
summary(churn)
##   CreditScore      Geography       Gender          Age            Tenure      
##  Min.   :350.0   France :5014   Female:4543   Min.   :18.00   Min.   : 0.000  
##  1st Qu.:584.0   Germany:2509   Male  :5457   1st Qu.:32.00   1st Qu.: 3.000  
##  Median :652.0   Spain  :2477                 Median :37.00   Median : 5.000  
##  Mean   :650.5                                Mean   :38.92   Mean   : 5.013  
##  3rd Qu.:718.0                                3rd Qu.:44.00   3rd Qu.: 7.000  
##  Max.   :850.0                                Max.   :92.00   Max.   :10.000  
##     Balance       NumOfProducts  HasCrCard IsActiveMember EstimatedSalary    
##  Min.   :     0   Min.   :1.00   0:2945    0:4849         Min.   :    11.58  
##  1st Qu.:     0   1st Qu.:1.00   1:7055    1:5151         1st Qu.: 51002.11  
##  Median : 97199   Median :1.00                            Median :100193.91  
##  Mean   : 76486   Mean   :1.53                            Mean   :100090.24  
##  3rd Qu.:127644   3rd Qu.:2.00                            3rd Qu.:149388.25  
##  Max.   :250898   Max.   :4.00                            Max.   :199992.48  
##  Exited  
##  0:7963  
##  1:2037  
##          
##          
##          
## 
head(churn)
##   CreditScore Geography Gender Age Tenure   Balance NumOfProducts HasCrCard
## 1         619    France Female  42      2      0.00             1         1
## 2         608     Spain Female  41      1  83807.86             1         0
## 3         502    France Female  42      8 159660.80             3         1
## 4         699    France Female  39      1      0.00             2         0
## 5         850     Spain Female  43      2 125510.82             1         1
## 6         645     Spain   Male  44      8 113755.78             2         1
##   IsActiveMember EstimatedSalary Exited
## 1              1       101348.88      1
## 2              1       112542.58      0
## 3              0       113931.57      1
## 4              0        93826.63      0
## 5              1        79084.10      0
## 6              0       149756.71      1

Upon preparation and first inspection of the data, we notice that there are no NA values so we may continue on to the data analysis.

Data Analysis

# I could use ggplot to graph with shorter code but I wanted to exemplify the functions in this course
op <- par(mfrow = c(1, 3))
boxplot(churn$Balance, churn$EstimatedSalary, main="Balance and Est. Salary", names=c("Balance","Est. Salary"), col = c("#fcba0377", "#0390fc77"))
abline(v = c(1.5), lty = 1)
boxplot(churn$Age, main=c("Age"), xlab=c("Age"), col = "#0000cc77")
boxplot(churn$NumOfProducts, main=c("Num. of Products"), xlab=c("Num. of Products"), col = "#fce80377")

boxplot(churn$CreditScore, main=c("Credit Score"), xlab=c("Credit Score"), col = "#00cccc77")
boxplot(churn$Tenure, main=c("Tenure"), xlab=c("Tenure"), col = "#cc00cc77")

We are able to visualize the numerical data in the boxplots above. We will now calculate the rate percentage of customers that are churning within the dataset.

# Note rows in data set
nrow(churn)
## [1] 10000
# Churn rate proportion in total
(churn.rate <- churn %>% 
  group_by(Exited) %>% 
  count(Exited) %>% 
  mutate(Percent = n/nrow(churn) * 100))
## # A tibble: 2 × 3
## # Groups:   Exited [2]
##   Exited     n Percent
##   <fct>  <int>   <dbl>
## 1 0       7963    79.6
## 2 1       2037    20.4

From the 10000 rows of data, we calculated that 79.63% are customers are not churn, and 20.37% are customers that churn. As a bonus of information, below is the churn rate based on geography.

# Churn rate proportion by Geography
(churn.geo <- churn %>% 
  group_by(Geography) %>% 
  count(Exited) %>% 
  mutate(Percent = n/nrow(churn) * 100))
## # A tibble: 6 × 4
## # Groups:   Geography [3]
##   Geography Exited     n Percent
##   <fct>     <fct>  <int>   <dbl>
## 1 France    0       4204   42.0 
## 2 France    1        810    8.1 
## 3 Germany   0       1695   17.0 
## 4 Germany   1        814    8.14
## 5 Spain     0       2064   20.6 
## 6 Spain     1        413    4.13
pie(churn.geo$Percent, labels= paste(churn.geo$Percent,"%"), col = c("#cfbaf077", "#cfbaf0", "#ffcfd277", "#ffcfd2", "#fbf8cc77",  "#fbf8cc"), radius = 0.6, main = "Churn by Geography") 
legend("topleft", c(rep(paste(unique(churn.geo$Geography),"not churn"),length.out=3), rep(paste(unique(churn.geo$Geography),"churn"),length.out=3)), cex = 0.8, fill = c("#cfbaf077", "#ffcfd277", "#fbf8cc77", "#cfbaf0", "#ffcfd2", "#fbf8cc"), ncol = 2)

Train-Test Split

The train-test split procedure is used to estimate the performance of machine learning algorithms when they are used to make predictions on data not used to train the model. It is critical to partition the data into training and testing sets when using learning algorithms such as Linear Regression, Random Forest, Naïve Bayes classification, Logistic Regression, and Decision Trees etc. Splitting helps to avoid over-fitting and to improve the training data set accuracy. The following code splits 80% of the data selected randomly into training set and the remaining 20% sample into test data set.

# Set seed for reproducibility
set.seed(123) 

# Split the churn data into training and test sets
split <- sample(nrow(churn), nrow(churn)*0.8)
train_churn <- churn [split, ]
test_churn <- churn [-split, ]

# Checking if distribution of partition data is correct for train set
table(train_churn$Exited)
## 
##    0    1 
## 6366 1634
prop.table(table(train_churn$Exited))
## 
##       0       1 
## 0.79575 0.20425
# Checking if distribution of partition data is correct for test set
table(test_churn$Exited)
## 
##    0    1 
## 1597  403
prop.table(table(test_churn$Exited))
## 
##      0      1 
## 0.7985 0.2015

Creating the Model

We will be creating a model that can perform well on unknown data. Logistic Regression has the advantage of an easy-to-interpret algorithm similar to linear Regression. The method used is stepwise Regression in the backward direction. We utilize the train data for training and predicting the algorithm. To test the trained model’s performance, will use the testing data set.

To train a classification model, there is mainly three steps: 1. Splitting Data into Training and Testing Set 2. Model Training/ Tuning 3. Model Testing

The Exited variable will be used as the target variable to predict whether a bank customer will churn or not.

# Logistic Regression
mlr <- glm(Exited ~., data = train_churn, family = "binomial")
mlrback <- step(object = mlr, direction = "backward", trace = F)
predlr <- predict(mlrback, test_churn, type = "response")

# Classifying Class and Predicting
pred_exited <- ifelse(predlr > 0.5,"1","0")
pred_exited <- as.factor(pred_exited)

# Model Summary
summary(mlrback)
## 
## Call:
## glm(formula = Exited ~ CreditScore + Geography + Gender + Age + 
##     Tenure + Balance + NumOfProducts + IsActiveMember, family = "binomial", 
##     data = train_churn)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3534  -0.6577  -0.4534  -0.2652   3.0006  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -3.374e+00  2.651e-01 -12.727  < 2e-16 ***
## CreditScore      -6.941e-04  3.138e-04  -2.212   0.0269 *  
## GeographyGermany  7.877e-01  7.582e-02  10.388  < 2e-16 ***
## GeographySpain    3.887e-02  7.881e-02   0.493   0.6219    
## GenderMale       -5.479e-01  6.097e-02  -8.986  < 2e-16 ***
## Age               7.398e-02  2.905e-03  25.469  < 2e-16 ***
## Tenure           -1.970e-02  1.048e-02  -1.879   0.0602 .  
## Balance           2.609e-06  5.735e-07   4.550 5.37e-06 ***
## NumOfProducts    -1.024e-01  5.288e-02  -1.937   0.0527 .  
## IsActiveMember1  -1.090e+00  6.468e-02 -16.854  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 8099.8  on 7999  degrees of freedom
## Residual deviance: 6835.6  on 7990  degrees of freedom
## AIC: 6855.6
## 
## Number of Fisher Scoring iterations: 5
# Print confidence intervals
exp(confint(mlrback))
## Waiting for profiling to be done...
##                       2.5 %     97.5 %
## (Intercept)      0.02034036 0.05750259
## CreditScore      0.99869142 0.99992065
## GeographyGermany 1.89512220 2.55116670
## GeographySpain   0.89018089 1.21251321
## GenderMale       0.51293370 0.65144412
## Age              1.07070598 1.08296897
## Tenure           0.96055023 1.00083550
## Balance          1.00000148 1.00000373
## NumOfProducts    0.81344299 1.00087219
## IsActiveMember1  0.29593155 0.38134996

The summary model results show several variables that have a significant effect on predicting the target class. The significant variables in predicting are Germany in Geography, Male Gender, Age, Balance, and Active Member.

Next we will run a confusion matrix using the ‘caret’ package. A confusion matrix is used to see the performance of a classification model and compare the predicted results with actual data.

# Confusion Matrix
confusionMatrix(pred_exited, test_churn$Exited, positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1530  314
##          1   67   89
##                                           
##                Accuracy : 0.8095          
##                  95% CI : (0.7916, 0.8265)
##     No Information Rate : 0.7985          
##     P-Value [Acc > NIR] : 0.1149          
##                                           
##                   Kappa : 0.2321          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.2208          
##             Specificity : 0.9580          
##          Pos Pred Value : 0.5705          
##          Neg Pred Value : 0.8297          
##              Prevalence : 0.2015          
##          Detection Rate : 0.0445          
##    Detection Prevalence : 0.0780          
##       Balanced Accuracy : 0.5894          
##                                           
##        'Positive' Class : 1               
## 

From the data above, we can see that the model’s accuracy is 81%, Sensitivity 22%, Specificity 96%, and Pos Pred Value 57%. It can be concluded that this model is quite good at predicting by looking at its accuracy, which reaches 81%.

Conclusion

Linear Regression is a simple but powerful predictive modeling technique. The performances of the models are fairly good with accuracies at 81%.

Some interesting findings in the dataset:

Through this project, we learned that a clean and quality data set is important since it will directly influence the performance of machine learning models and the independent variables have to be relevant in order to contribute to the prediction of the target variable.

References

Churn Modelling. (2019, April 3). Kaggle. https://www.kaggle.com/datasets/shrutimechlearn/churn-modelling
RPubs - Customer Churn Analysis. (2021, January 30). https://rpubs.com/anitaowens/customerchurn
RPubs - Classification Problem: Predicting Customer Bank Churn. (2021, July 7). https://rpubs.com/maalghozi/classification_cust_churn
A Short Introduction to the caret Package. (n.d.). https://cran.r-project.org/web/packages/caret/vignettes/caret.html