R Data Engineering
Name: Leigh Davis
In every business, there will always be customers coming in and going out. Customer churn ratio is the measure of the number of individuals or items moving out of a collective group over a specific period. Losing customers is expensive for businesses and understanding why the business loses customers can assist in taking corrective action. I will be observing factors to deduce why customers are being lost using trend analysis and linear regression to show relationships among variables. I will also be using a machine learning algorithm to predict whether a customer will churn or not. The data used in this case were taken from Kaggle, Churn Modeling (see references).
# Import library
library(dplyr) # This was the subject of my first project in STAT 484; provides set of verbs that help solve common data manipulation challenges
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(caret) # Contains functions to streamline the model training process for complex regression and classification problems
## Loading required package: ggplot2
## Loading required package: lattice
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
# Import Data
df <- read.csv(file = "~/Documents/PSU/R/Data/Churn_Modelling.csv", row.names = 1, stringsAsFactors = T)
# Take out irrelevant columns : CustomerId, Surname
churn <- df %>%
dplyr::select(-c(CustomerId, Surname)) %>%
mutate(HasCrCard = as.factor(HasCrCard),
IsActiveMember = as.factor(IsActiveMember),
Exited = as.factor(Exited))
# Checking for NA values
sum(is.na(churn))
## [1] 0
# Description
summary(churn)
## CreditScore Geography Gender Age Tenure
## Min. :350.0 France :5014 Female:4543 Min. :18.00 Min. : 0.000
## 1st Qu.:584.0 Germany:2509 Male :5457 1st Qu.:32.00 1st Qu.: 3.000
## Median :652.0 Spain :2477 Median :37.00 Median : 5.000
## Mean :650.5 Mean :38.92 Mean : 5.013
## 3rd Qu.:718.0 3rd Qu.:44.00 3rd Qu.: 7.000
## Max. :850.0 Max. :92.00 Max. :10.000
## Balance NumOfProducts HasCrCard IsActiveMember EstimatedSalary
## Min. : 0 Min. :1.00 0:2945 0:4849 Min. : 11.58
## 1st Qu.: 0 1st Qu.:1.00 1:7055 1:5151 1st Qu.: 51002.11
## Median : 97199 Median :1.00 Median :100193.91
## Mean : 76486 Mean :1.53 Mean :100090.24
## 3rd Qu.:127644 3rd Qu.:2.00 3rd Qu.:149388.25
## Max. :250898 Max. :4.00 Max. :199992.48
## Exited
## 0:7963
## 1:2037
##
##
##
##
head(churn)
## CreditScore Geography Gender Age Tenure Balance NumOfProducts HasCrCard
## 1 619 France Female 42 2 0.00 1 1
## 2 608 Spain Female 41 1 83807.86 1 0
## 3 502 France Female 42 8 159660.80 3 1
## 4 699 France Female 39 1 0.00 2 0
## 5 850 Spain Female 43 2 125510.82 1 1
## 6 645 Spain Male 44 8 113755.78 2 1
## IsActiveMember EstimatedSalary Exited
## 1 1 101348.88 1
## 2 1 112542.58 0
## 3 0 113931.57 1
## 4 0 93826.63 0
## 5 1 79084.10 0
## 6 0 149756.71 1
Upon preparation and first inspection of the data, we notice that there are no NA values so we may continue on to the data analysis.
# I could use ggplot to graph with shorter code but I wanted to exemplify the functions in this course
op <- par(mfrow = c(1, 3))
boxplot(churn$Balance, churn$EstimatedSalary, main="Balance and Est. Salary", names=c("Balance","Est. Salary"), col = c("#fcba0377", "#0390fc77"))
abline(v = c(1.5), lty = 1)
boxplot(churn$Age, main=c("Age"), xlab=c("Age"), col = "#0000cc77")
boxplot(churn$NumOfProducts, main=c("Num. of Products"), xlab=c("Num. of Products"), col = "#fce80377")
boxplot(churn$CreditScore, main=c("Credit Score"), xlab=c("Credit Score"), col = "#00cccc77")
boxplot(churn$Tenure, main=c("Tenure"), xlab=c("Tenure"), col = "#cc00cc77")
We are able to visualize the numerical data in the boxplots above. We will now calculate the rate percentage of customers that are churning within the dataset.
# Note rows in data set
nrow(churn)
## [1] 10000
# Churn rate proportion in total
(churn.rate <- churn %>%
group_by(Exited) %>%
count(Exited) %>%
mutate(Percent = n/nrow(churn) * 100))
## # A tibble: 2 × 3
## # Groups: Exited [2]
## Exited n Percent
## <fct> <int> <dbl>
## 1 0 7963 79.6
## 2 1 2037 20.4
From the 10000 rows of data, we calculated that 79.63% are customers are not churn, and 20.37% are customers that churn. As a bonus of information, below is the churn rate based on geography.
# Churn rate proportion by Geography
(churn.geo <- churn %>%
group_by(Geography) %>%
count(Exited) %>%
mutate(Percent = n/nrow(churn) * 100))
## # A tibble: 6 × 4
## # Groups: Geography [3]
## Geography Exited n Percent
## <fct> <fct> <int> <dbl>
## 1 France 0 4204 42.0
## 2 France 1 810 8.1
## 3 Germany 0 1695 17.0
## 4 Germany 1 814 8.14
## 5 Spain 0 2064 20.6
## 6 Spain 1 413 4.13
pie(churn.geo$Percent, labels= paste(churn.geo$Percent,"%"), col = c("#cfbaf077", "#cfbaf0", "#ffcfd277", "#ffcfd2", "#fbf8cc77", "#fbf8cc"), radius = 0.6, main = "Churn by Geography")
legend("topleft", c(rep(paste(unique(churn.geo$Geography),"not churn"),length.out=3), rep(paste(unique(churn.geo$Geography),"churn"),length.out=3)), cex = 0.8, fill = c("#cfbaf077", "#ffcfd277", "#fbf8cc77", "#cfbaf0", "#ffcfd2", "#fbf8cc"), ncol = 2)
The train-test split procedure is used to estimate the performance of machine learning algorithms when they are used to make predictions on data not used to train the model. It is critical to partition the data into training and testing sets when using learning algorithms such as Linear Regression, Random Forest, Naïve Bayes classification, Logistic Regression, and Decision Trees etc. Splitting helps to avoid over-fitting and to improve the training data set accuracy. The following code splits 80% of the data selected randomly into training set and the remaining 20% sample into test data set.
# Set seed for reproducibility
set.seed(123)
# Split the churn data into training and test sets
split <- sample(nrow(churn), nrow(churn)*0.8)
train_churn <- churn [split, ]
test_churn <- churn [-split, ]
# Checking if distribution of partition data is correct for train set
table(train_churn$Exited)
##
## 0 1
## 6366 1634
prop.table(table(train_churn$Exited))
##
## 0 1
## 0.79575 0.20425
# Checking if distribution of partition data is correct for test set
table(test_churn$Exited)
##
## 0 1
## 1597 403
prop.table(table(test_churn$Exited))
##
## 0 1
## 0.7985 0.2015
We will be creating a model that can perform well on unknown data. Logistic Regression has the advantage of an easy-to-interpret algorithm similar to linear Regression. The method used is stepwise Regression in the backward direction. We utilize the train data for training and predicting the algorithm. To test the trained model’s performance, will use the testing data set.
To train a classification model, there is mainly three steps: 1. Splitting Data into Training and Testing Set 2. Model Training/ Tuning 3. Model Testing
The Exited variable will be used as the target variable to predict whether a bank customer will churn or not.
# Logistic Regression
mlr <- glm(Exited ~., data = train_churn, family = "binomial")
mlrback <- step(object = mlr, direction = "backward", trace = F)
predlr <- predict(mlrback, test_churn, type = "response")
# Classifying Class and Predicting
pred_exited <- ifelse(predlr > 0.5,"1","0")
pred_exited <- as.factor(pred_exited)
# Model Summary
summary(mlrback)
##
## Call:
## glm(formula = Exited ~ CreditScore + Geography + Gender + Age +
## Tenure + Balance + NumOfProducts + IsActiveMember, family = "binomial",
## data = train_churn)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3534 -0.6577 -0.4534 -0.2652 3.0006
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.374e+00 2.651e-01 -12.727 < 2e-16 ***
## CreditScore -6.941e-04 3.138e-04 -2.212 0.0269 *
## GeographyGermany 7.877e-01 7.582e-02 10.388 < 2e-16 ***
## GeographySpain 3.887e-02 7.881e-02 0.493 0.6219
## GenderMale -5.479e-01 6.097e-02 -8.986 < 2e-16 ***
## Age 7.398e-02 2.905e-03 25.469 < 2e-16 ***
## Tenure -1.970e-02 1.048e-02 -1.879 0.0602 .
## Balance 2.609e-06 5.735e-07 4.550 5.37e-06 ***
## NumOfProducts -1.024e-01 5.288e-02 -1.937 0.0527 .
## IsActiveMember1 -1.090e+00 6.468e-02 -16.854 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 8099.8 on 7999 degrees of freedom
## Residual deviance: 6835.6 on 7990 degrees of freedom
## AIC: 6855.6
##
## Number of Fisher Scoring iterations: 5
# Print confidence intervals
exp(confint(mlrback))
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) 0.02034036 0.05750259
## CreditScore 0.99869142 0.99992065
## GeographyGermany 1.89512220 2.55116670
## GeographySpain 0.89018089 1.21251321
## GenderMale 0.51293370 0.65144412
## Age 1.07070598 1.08296897
## Tenure 0.96055023 1.00083550
## Balance 1.00000148 1.00000373
## NumOfProducts 0.81344299 1.00087219
## IsActiveMember1 0.29593155 0.38134996
The summary model results show several variables that have a significant effect on predicting the target class. The significant variables in predicting are Germany in Geography, Male Gender, Age, Balance, and Active Member.
Next we will run a confusion matrix using the ‘caret’ package. A confusion matrix is used to see the performance of a classification model and compare the predicted results with actual data.
# Confusion Matrix
confusionMatrix(pred_exited, test_churn$Exited, positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1530 314
## 1 67 89
##
## Accuracy : 0.8095
## 95% CI : (0.7916, 0.8265)
## No Information Rate : 0.7985
## P-Value [Acc > NIR] : 0.1149
##
## Kappa : 0.2321
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.2208
## Specificity : 0.9580
## Pos Pred Value : 0.5705
## Neg Pred Value : 0.8297
## Prevalence : 0.2015
## Detection Rate : 0.0445
## Detection Prevalence : 0.0780
## Balanced Accuracy : 0.5894
##
## 'Positive' Class : 1
##
From the data above, we can see that the model’s accuracy is 81%, Sensitivity 22%, Specificity 96%, and Pos Pred Value 57%. It can be concluded that this model is quite good at predicting by looking at its accuracy, which reaches 81%.
Linear Regression is a simple but powerful predictive modeling technique. The performances of the models are fairly good with accuracies at 81%.
Some interesting findings in the dataset:
The rate of churn in highest in Germany.
Males , Age, Balance, and Active Member are significant variables in predicting.
Having a credit card is not a good predictor for churn status mainly due to the high credit card ownership in Germany, France and Spain.
Through this project, we learned that a clean and quality data set is important since it will directly influence the performance of machine learning models and the independent variables have to be relevant in order to contribute to the prediction of the target variable.