Classifying Bank Customer Data Using R? Use K-Means Clustering

Table of Contents

Academic Declaration

01. Abstract

02. Introduction and Problem identification

03. K-means clustering

4.1.What is k means Clustering and how to use it on the selected data set

4.2.Advantages of k means clustering as a technique

4.3.Disadvantages of k means clustering as a technique.

4.4.Limitations of k means clustering as a technique

4.5.Why was K means clustering chosen for the provided data set

4.6.Advantages of applying K-means clustering to the selected data set.

4.7.Disadvantages of applying K-means clustering to the selected data set.

4.8.Limitations of applying K-means clustering to the selected data set.

04. Solution design and development

5.1. High-level Design of Proposed Solution

5.2. Data Pre-Processing

5.3. Finding the value of K

5.4. Data Analysis – Train Data

5.5. Applying business strategies to clusters of Train Data based on Analysis Findings

Cluster 01:

Cluster 2

Cluster 3

5.6. Implementing the different techniques discussed above on Test data to analyse the correctness of Train data and find the value of k.

5.7. Data Analysis – Test Data

5.8. Comparison – Train Data Vs Test Data

5.8.1. Comparing Train Data findings to Test Data findings

05. Conclusion:

06. Future Development of k means clustering in relation to ABI

07. References

Table of Figures

Figure 1 – Highlevel Design of Proposed Solution

Figure 2 – Initial Picture of Data Set

Figure 3 – 1st Stage of Data Pre-processing – Arranging the Data in Separate Columns

Figure 4 – Selecting 5000 Random Data using Kutools for Microsoft Excel

Figure 5 – R Code to call in the Data Set

Figure 6 – Train Data Set Loaded in R

Figure 7 – R Code to Calculate the Elbow value

Figure 8 – Elbow Value in R for Train Data

Figure 9 – Elbow Value / k Value = 3, as produced by R for Train Data

Figure 10 – Finding the k Value using Weka for Train Data

Figure 11 – Cluster Means and Clustering Vectors (Fg-1) as produced by R for Train Data

Figure 12 – Clustering Vectors (Fg-2) as produced by R for Train Data

Figure 13 – Clustering Vectors (Fg-3) as produced by R for Train Data

Figure 14 – Cluster Means of Train Data

Figure 15 – Applying Percentrank Values on Train Dataset to Convert data from Neumericals to Cateorical Labels.

Figure 16 – Final Findings of Train Data in Three Clusters

Figure 17 – Term Deposit Subscription Comparison – Cluster Wise – Train Data

Figure 18 – Client Loan Status Comparison – Cluster Wise – Train Data

Figure 19 – Mean wise Call Duration in Seconds – Comparison – Cluster Wise – Train Data

Figure 20 – Customer Age Group Categorization – Comparison – Cluster Wise – Train Data

Figure 21 – Test Data Set Loaded in R

Figure 22 – Elbow Value in R for Test Data

Figure 23 – Elbow Value / k Value = 3, as produced by R for Test Data

Figure 24 – Cluster Means and Clustering Vectors (Fg-1) as produced by R for Test Data

Figure 25 – Clustering Vectors (Fg-2) as produced by R for Test Data

Figure 26 – Final Findings of Test Data in Three Clusters

Figure 27 – Term Deposit Subscription Comparison – Cluster Wise – Test Data

Figure 28 – Client Loan Status – Comparison – Cluster Wise – Test Data

Figure 29 – Mean wise Call Duration in Seconds – Comparison – Cluster Wise – Test Data

Figure 30 – Customer Age Group Categorization – Comparison – Cluster Wise – Test Data

Figure 31 – Final Findings of Train Data Vs Test Data in Three Clusters

Figure 32 – Cluster 1 Comparison of Train Vs Test Data for Selected Attributes

Figure 33 – Cluster 2 Comparison of Train Vs Test Data for Selected Attributes

Figure 34 – Cluster 3 Comparison of Train Vs Test Data for Selected Attributes

Table of Tables

Table 1 – High Level WBS

Table 2 – Data Attribute Discussion (Original, Selected for Study and Reasons of Exclusion)

Table 3 – Q2. Marital Status of Client?

Table 4 – Q3. Has the Client got a loan?

Table 5 – Q4. Last contact month of year?

Table 6 – Q8. Outcome of the previous marketing campaign?

Table 7 – Q9. Has the client subscribed a term deposit?

Table 8 – Final Selection of 4 Attributes in reference to the original List of Attributes and Respective Q values

01.  Abstract

Segmenting data based on demographic factors is a widely used practice in marketing including the banking sector despite the fact that the correlation of these factors where the customers are concerned is often weak.

To segment such collected customer data based on expected benefits and attitudes provides the bank with the advantage to address the conflict/gap between individual services and cost-centric/cost-saving standardization.

To initiate the above mentioned process, k means cluster analysis was used to on a selected dataset to form similar combinations of customer behaviors as answers against a set of different attitude-dimensional questions formulated by the bank with the ultimate goal to find whether the customers were willing to sign up for a term deposit offered by the bank.

Certain clusters generated in this manner were superior in its homogeneity in comparison to other clusters and the profile to customer segments were gained by referring to the demographic differences and its qualities within each of them.

By following the above process, three characteristic groups as clusters of customers were identified in which each cluster had special characteristics or preferences in comparison to the others and, suitable and profitable business strategies were applied for each of these customer segment of clusters considering their behaviors to encourage them to sign up with the Term Deposit provided by the bank.

02.  Introduction and Problem identification

This report is created on a dataset obtained from a Portuguese Banking Institution which lists different attributes of customers contacted via phone during one of its Direct Marketing Campaigns to discuss the probability of the customer signing on with a term deposit at the bank.

There were 41189 instances of contacts with the customers in the original dataset. We have preprocessed the data to be only of 9 attributes from its original number of 21 and randomly selected a data set of 5000, as the study is required to be based on this number in which the first 3000 are treated as training data and rest 2000 are treated as Test data.

The final analysis was based on further filtered 4 attributes which was more meaningful than the rest of 5.

The below chart lists how the work flow of this report and the different tasks completed by each team member

Table 1 – High Level WBS

Task Description Assigned To Team Member:
  1. Data Preprocessing
Hoshani FS
  1. Data Analysis using R
Maliheh Gordoghli
  1. Evaluation
Maliheh Gordoghli and Hoshani FS
  1. Documentation
Maliheh Gordoghli and Hoshani FS

03. K-means clustering

4.1.    What is k means Clustering and how to use it on the selected data set

By using the models of clustering we aim to categorize the datasets by its attributes in to similar groups of observations also known as “clusters” whereby the observations within a given group will be similar to the other observations present in the same cluster and be dissimilar to observations of other clusters.

Just as the human brain, in clustering, we group objects of similar nature by applying pattern reasoning and while clustering is done using different methods such as Partitions, Hierarchical, Density-based, Grid and etc. This report applies Partition method by developing sub-divisions of the selected dataset in to predetermined number of “K” of non-empty subsets. (Vercellis, 2009)

The fundamental idea of K-means clustering is to find the K average or value in which the data can be clustered, there by breaking down the data set in to K number of groups. In order to use K-means clustering, the data is required to be in numerical representation and therefore we have preprocessed the data from its original form to an integer illustration. (The conversion of data under data preprocessing is further explained in Section 3).

4.2.    Advantages of k means clustering as a technique

K-means clustering in comparison to other techniques are easy to implement when practiced in the right manner and ideal where large data set are available. Due to the ease of use, it produces efficient results efficiently in comparison to other techniques.

In contrast to hierarchical clustering, k-means produces close-knit clusters and when an instance changes, depending on how it affect the data point, the clusters alters automatically by data being moved between clusters when computing the centroids.

An instance can change cluster (move to another cluster) when the centroids are recomputed.

Most importantly the ability to cluster data based on their qualities or behaviors in to similar groups of clusters and form different clusters in which each cluster contains data of similar nature and these clusters be different from one to another considering the data point behaviors that exists within them.

4.3.    Disadvantages of k means clustering as a technique.

As most datasets will be foreign to the analyzer, the k value will often be hard to predict and the initial seeds and order of data may have a strong impact on the findings of the final results. It is sensitive to scaling and so normalization and standardization of data will have a direct impact on the final results.

4.4.    Limitations of k means clustering as a technique

During clustering of data, the model may produce some clusters without any data in it and the model may pick these clusters and this could impact the final result directly.

Where outliers takes place, the Sum of Squared Error will usually be higher as the resulting cluster centroids (also known as Prototypes) may become less representative.

The fact of user having to guess/chose the k value for the number of clusters is another limitation of k means clustering. If one is using 2D data, this is somewhat an easy choice to make, however as the dimensionality of data increases the problem becomes more complex and the user cannot predict the appropriate number of clusters just by analyzing the data. (Singh, Malik, & Sharma, 2011)

4.5.    Why was K means clustering chosen for the provided data set

By using K-means clustering on the selected data set, we were able to create meaningful interpretations for phenomenon of interests through segregating the customers participated in the marketing campaign depending on their attributes and reveal the existence of a cluster representing certain customer group’s purchasing behaviors and their respond to the specific marketing campaign which can later be used at other market promotions, while solving the current issue.

Furthermore, Clustering helps makes current business decisions on data available than future predictions. By analyzing the selected dataset through K-means clustering, we can clearly identify the likelihoodness of a customer as a group for signing-up with the mentioned term deposit of the Bank.

4.6.    Advantages of applying K-means clustering to the selected data set.

The biggest advantage of using K-means clustering on the selected data set is its high efficiency in results. As the number of data available are high, an average person would take a longer time than usual to analyze it and produce a result. Considering the concept of time is money in a business scenario, by using k-means clustering, we are able to determine a tangible value in very less time which will help at business decision making for the banking institute.

Furthermore, considering the nature of complexity of data, k means clustering is easy to implement and where centroids are recomputed, the clusters have the freedom to move between other clusters. Also, k-means clustering can be used in the iteration of k*n*d in which “k” represents the number of clusters, “n” represents the number of examples/instances, and time of computing the Euclidian distance between 2 points is represented by “d”. (Guttag, 2017).

4.7.    Disadvantages of applying K-means clustering to the selected data set.

While k-means clustering will help make efficient business decisions, as customer preference or the market behaviors may change, the model cannot be used in making any future predictions. (Can K-means clustering alone predict future trend from historical data? How to analyze the clustered data to predict future trends?, 2015)

This also leads to the K value been altered and furthermore, the K value suggested by the algorithm may sometimes be wrong or require alterations in the future in accordance with the future values of data representation and will mostly not produce similar results with each run. (Singh, Malik, & Sharma, 2011)

4.8.    Limitations of applying K-means clustering to the selected data set.

Considering limitations, K-means clustering can only be performed on numerical data and due to this point, all data in the selected dataset required manipulation to be converted in to a meaningful numerical value from its original characteristic value. The process may sometimes terminate at a local optimum and the order of data has a direct impact on the result. (MACQUEEN, 1967)

04.  Solution design and development

5.1. High-level Design of Proposed Solution

Figure 1 – Highlevel Design of Proposed Solution

5.2. Data Pre-Processing

The selected data set was initially available in the format of Microsoft Excel and was separated only using semicolons

Figure 2 – Initial Picture of Data Set

Here, all data was selected and put in to separate columns using the Text to column option under Data Ribbon.

Figure 3 – 1st Stage of Data Pre-processing – Arranging the Data in Separate Columns

As the report only calls for only 5000 data and the original data set had over 41000, in all attributes, any data with a representation of an “Unknown” or “Null” value was removed.

As mentioned earlier, the original dataset had 21 attributes and this report only addresses selected attributes as discussed below.

Table 2 – Data Attribute Discussion (Original, Selected for Study and Reasons of Exclusion)

Original Attribute Number and Name Reasons of Exclusion from Study Attribute after renaming and with New Tag
# bank client data: 

1 – age (numeric)

Q1 – Age of Client
2 – job : type of job (categorical: ‘admin.’,’blue-collar’,’entrepreneur’,’housemaid’,’management’,’retired’,’self-employed’,’services’,’student’,’technician’,’unemployed’,’unknown’) The presence of unknown values and since it’s a banking institution, customers may always not provide accurate information about their employment status (Self-employed) to avoid the burden of taxation.
3 – marital : marital status (categorical: ‘divorced’, ‘married’, ‘single’, ‘unknown’; note: ‘divorced’ means divorced or widowed) Q2 – Marital Status of Client
4 – education (categorical: ‘basic.4y’,’basic.6y’,’basic.9y’,’high.school’,’illiterate’,’professional.course’,’university.degree’,’unknown’) The presence of unknown values and doesn’t really affect the final outcome.
5 – default: has credit in default? (categorical: ‘no’, ‘yes’, ‘unknown’) The original attribute did was not meaningful enough to identify what the findings represented.
6 – housing: has housing loan? (categorical: ‘no’, ‘yes’, ‘unknown’) As the client can have two types of loans, for ease of study purposes this attribute was removed.
7 – loan: has personal loan? (categorical: ‘no’, ‘yes’, ‘unknown’) All unknown values of this attributes were converted in to ½ Yes and ½ No as answer as meaningful data manipulation was allowed. Q3 – Has the Client got a loan
# related with the last contact of the current campaign:
8 – contact: contact communication type (categorical: ‘cellular’, ‘telephone’)
The client was anyway contacted using a phone whether or not it was the mobile or the telephone. As the base of communication was “telephone only” this attribute was removed
9 – month: last contact month of year (categorical: ‘jan’, ‘feb’, ‘mar’, …, ‘nov’, ‘dec’) Q4 – Last Contact Month of Year
10 – day_of_week: last contact day of the week(categorical: ‘mon’,’tue’,’wed’,’thu’,’fri’) Did not understand how this affected the final outcome of the result and so was removed.
11 – duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y=’no’). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model. Q5 – Last Contact duration in seconds
# other attributes: 

12 – campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)

Q6 – Number of contacts performed during this campaign and for the client
13 – pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted) Since the attribute is referring to a previous event, the attribute was removed.
14 – previous: number of contacts performed before this campaign and for this client (numeric) Q7 –Number of contacts performed before this campaign and for this client. (This attribute was later deemed useless and did not really affect the outcome of the final result.
15 – poutcome: outcome of the previous marketing campaign (categorical: ‘failure’, ‘nonexistent’, ‘success’) Q8 – Outcome of the previous marketing campaign
# social and economic context attributes 

16 – emp.var.rate: employment variation rate – quarterly indicator (numeric)

Was removed as no obvious correlation between the attribute and the final outcome was noticed.
17 – cons.price.idx: consumer price index – monthly indicator (numeric)
18 – cons.conf.idx: consumer confidence index – monthly indicator (numeric)
19 – euribor3m: euribor 3 month rate – daily indicator (numeric)
20 – nr.employed: number of employees – quarterly indicator (numeric) Not a meaningful attribute to understand and so was removed.
Output variable (desired target):
21 – y – has the client subscribed a term deposit? (binary: ‘yes’, ‘no’)
Q9 – Has the client subscribed for a term deposit

As shown earlier, the number of attributes has been minimized to 09 and each attribute name has been altered in a manner to express a deeper meaning while retaining its original denotation and as we needed to analyze the data using R and the final 9 attributes were categorized in Q values.

As k-means clustering using R could only produce results with numerical data, within the selected attributes, we were able to manipulate data by converting them to integer values while retaining the original meaning.

Therefore, if each Q value listed above is treated as a question, the data of each attribute column is treated as answers to these questions and as these answers were in different characteristics using different words, and converted as listed below to be able to use with k-means clustering using Microsoft Excel as listed below.

Q1 – Age of Client? – As it’s in numerical mode, data under this attribute did not require any conversation.

Table 3 – Q2. Marital Status of Client?

Original Data Name Represented using numerical value
Single 1
Married 2
Divorced 3

Table 4 – Q3. Has the Client got a loan?

Original Data Name Represented using numerical value
No 1
Yes 2

Table 5 – Q4. Last contact month of year?

Original Data Name Represented using numerical value
Mar 3
Apr 4
May 5
June 6
July 7
Aug 8
Sep 9
Oct 10
Nov 11
Dec 12

Q5. Last contact duration, in Seconds? As it’s in numerical mode, data under this attribute did not require any conversation.

Q6. Number of contacts performed during this campaign and for the client? As it’s in numerical mode, data under this attribute did not require any conversation.

Q7. Number of contacts performed before this campaign and for the client? As it’s in numerical mode, data under this attribute did not require any conversation.

Table 6 – Q8. Outcome of the previous marketing campaign?

Original Data Name Represented using numerical value
Failure 1
Success 2

Table 7 – Q9. Has the client subscribed a term deposit?

Original Data Name Represented using numerical value
No 1
Yes 2

Once after all data was converted in to numerical values, using the free software tool of “Kutools” for Microsoft Excel, 5000 Random Data was selected. Notice the “#” column values highlighted in Yellow, representing inconsistency of regularity.

Figure 4 – Selecting 5000 Random Data using Kutools for Microsoft Excel

In the new 5000 random Dataset, the first 3000 rows of data were selected as Training Data and the remaining 2000 rows were selected as test data.

5.3. Finding the value of K

The k value of a given dataset is normally chosen with prior knowledge about the classes or considering other attributes of the dataset. However, as the selected data set is foreign to us and we hold no prior-knowledge about this particular dataset, to assume a k value or to keep trying different values for k to evaluate the quality of results produced in R or to perform hierarchical clustering on a sub set of train data is extremely time consuming, illogical and nonsensical.(Guttag, 2017)

However, as we cannot proceed without a proper k value we chose the method of mitigating dependence on initial centroids by trying multiple sets of randomly chosen initial centroids to select the best result out of it by using the below code in which we analysed the change in the average distance of data points to the centroids as K kept increasing.

best = KMeans(points)for t in the range (numTrials):
c = kMeans(points)if dissimilarity(C) <dissimilaritybest):
best = C

Return best. (Lecture 12: Clustering – MIT OpenCourseWare, 2016)

In the above code C represents the dissimilarity which is the sum of all the variables within the k clusters.

The Variability represents the sum of all Euclidean distances between the centroid and each point within the cluster

A common method of choosing the appropriate cluster is by comparing the sum of Squared Error (SSE) for a number of cluster solutions. SSE represents the sum of squared distance between an individual cluster point and its cluster centroid which is why SSE can be considered as an accepted measure of error. As SSE is greater than the clusters, its value is normally expected to be decreasing as the number of clusters increases. Here, we can chose the most suitable number of clusters through a graphical representation – a plot of the SSE Vs the sequential level of clusters which is also known as Cluster Solution Appropriate, if the decrease in SSE seems to be slowing down in an almost dramatical manner which will eventually result in the representation of an Elbow formation of SSE within the graphical plot against the cluster solutions.  (Gove, 2017)

As k was an unknown value, considering the above, the Elbow method was chosen to find the most suitable value for K.

To do this, first we ran the train dataset in R using the below shown R Code(I.Kabacoff, 2017)

Figure 5 – R Code to call in the Data Set

By running the above code, the below result was provided by R.

Figure 6 – Train Data Set Loaded in R

Then we entered “>mydata<- D”
and entered the below mentioned code. (I.Kabacoff, 2017)

Figure 7 – R Code to Calculate the Elbow value

Which produced the following: 

Figure 8 – Elbow Value in R for Train Data

Figure 9 – Elbow Value / k Value = 3, as produced by R for Train Data

By looking at the result produced by R using the Elbow method for Train Data, it was confirmed that k was equal to 3.

Furthermore to support the above finding of k, we also ran the train dataset of 3000 in Weka, and below are the findings which confirms the k value as 3.

Figure 10 – Finding the k Value using Weka for Train Data

Hence, the customers participated in the survey was divided in to 3 clusters according to the answers (Customer behaviours) they had provided for the marketing campaign and k value was provided as 3 in R by using the R code as mentioned below. (Here, before running the k means code, we had to increase the print of records of data displayed by R by using the Options code and inserting a number above our dataset of 3000 so that R will display all records while not limiting it to 1000 only) (how to increase the limit for max.print in R, 2012):

>options(max.print = 999999)

>getOption(“max.print”)

>kmeans(D, 3, 1000)

Which produced the below result.

Figure 11 – Cluster Means and Clustering Vectors (Fg-1) as produced by R for Train Data

Figure 12 – Clustering Vectors (Fg-2) as produced by R for Train Data

Figure 13 – Clustering Vectors (Fg-3) as produced by R for Train Data

R has produced the result of each size of cluster for the 3 clusters as below:

K-means clustering with 3 clusters of sizes 815, 2037, 148

Cluster means:

Figure 14 – Cluster Means of Train Data

By analysing he above values, we concluded that the client decision of whether or not to subscribe for a term deposit was clearly affected by Q1, Q3, Q5 and Q9 than the rest of selected attributes represented in different Q Values and therefore, we chose the four Q values of Q1, Q3, Q5 and Q9 as listed below, for the final analysis.

Table 8 – Final Selection of 4 Attributes in reference to the original List of Attributes and Respective Q values

Attributes with Names and Q values that has a direct impact of client decision
Ref to original Dataset # New # Description
1 Q1 Age of Client
7 Q3 has the client got a loan
11 Q5 last contact duration, in seconds
21 Q9 has the client subscribed a term deposit

In order determine each vector belongs to which cluster, we copy pasted all clustering vector values to an Excel sheet. Then, as the data is in horizontal form under different lines, we converted the vector values to be in a vertical single column. This column was then pasted on the original train data set file so that we can analyze which data row belongs to which cluster.

To begin categorization of data and to give it a qualitative label in three ranges as Low, Medium and High based on the Client’s age, for ease of data analysis, on a new column, we used the Excel formula of “=PERCENTRANK([Values], a2)” which converted the values in to 3 ranges in decimals of rankings based on Clients Age as actual percentile of the data points.

As the above result is in decimals, to further determine the Label values precisely for the range of Client’s Age, the formula of “=IF(PERCENTRANK([Values], A2) < 1/3, “LOW”, IF(PERCENTRANK([Values], A2) < 2/3, “MEDIUM, “HIGH”))” was entered (McGraw-Herdeg, 2012).

Then Q3 – Has the client got a loan, Q5 – Hast contact duration, in seconds, Q9 – Has the client subscribed a term deposit findings were again converted to categorical data using the find and replace method of Excel. (No data was altered from its original value, it was only converted back to its original form from an integer representation for ease of analysis) and then categorized the data in clusters using Excel Sort and Filter option.

Figure 15 – Applying Percentrank Values on Train Dataset to Convert data from Numerical to Categorical Labels.

5.4. Data Analysis – Train Data

The aim of Data Analysis on the selected data set is to find a suitable strategy for each cluster of customers depending on their attributes to arouse their interest to sign up for the term deposit at the bank.

Hence, we considered each cluster, then filtered the data based on the outcome of term deposit subscription. Then we checked the clients loan status and call duration and the age-wise categorization which produced the following results:

Figure 16 – Final Findings of Train Data in Three Clusters

Figure 17 – Term Deposit Subscription Comparison – Cluster Wise – Train Data

Considering the above chart, Cluster 1 is high in Term Deposit Success Rate when compared to cluster 2 which is Low and Cluster 3 is Medium.

Further analysis was conducted on Q1, Q3, and Q5 where the selected attributes and behaviors of customers were assessed to identify the reasons for different outcomes of success and failure rates in each cluster.

It was based on these findings as a whole, that we have applied profitable strategies to convert customer willingness to sign up with a term deposit at the bank.

Figure 18 – Client Loan Status Comparison – Cluster Wise – Train Data

By studying the above graph, it can be noticed that Loan Status of Clients of Cluster 2 are higher in comparison to Cluster 1 which is medium and the same is Lowest in cluster 3.

Figure 19 – Mean wise Call Duration in Seconds – Comparison – Cluster Wise – Train Data

Referring to the above Chart, It can be noticed that the mean value of cluster 1 is at a very high level of 1,405.77 Seconds in comparison with cluster 2 which is medium and cluster 3 which is low.

Figure 20 – Customer Age Group Categorization – Comparison – Cluster Wise – Train Data

By looking at the above chart it can be said that the Age range of customers in cluster 1 is medium, Cluster 2 is Low and Cluster 3 is high.

5.5. Applying business strategies to clusters of Train Data based on Analysis Findings

Cluster 01:

Facts:

Term Deposit Subscription Rate and Call Duration = High,

Age group of customers, Client Loan Status = Medium

Assumptions:

Customers of this cluster has an idea about financial investments, Understands the Risk of investment but vigilant and so required more information about the bank and benefits of the term deposit before signing up for it.

Strategies:

  • Bank must provide outstanding customer and other services to customers of this cluster as this group contains potential customers in comparison to both cluster 2 and 3 considering the success rate.
  • Conduct surveys, collect feedbacks and advertise them on the Bank website as an encouragement to other savers as well as use the feedback to improve organizational products and services where necessary.

Cluster 2

Facts:

Term Deposit Subscription, Age group of customers and Call Duration = Low

Client Loan Status = High

Assumptions:

As Client age level is low, most clients of this cluster are young, in the beginning of their careers or still in college and has other financial commitments/obligations such as mortgages, Study Loans, Vehicle Loans etc. (need to save before starting to invest)

Considering the young age, the knowledge of financial investment may be low and fear for risk of investment may be high.

Strategy:

Keep following up with the clients of this cluster as the call levels in comparison to other two clusters are low.

Follow up with the clients of this cluster and during these contacts, educate them of the Term Deposit Subscription and its benefits and then encourage them to sign up while further explaining benefits of investments and benefits of early start of savings. Inform that if and once they sign up with the bank term deposit, the bank can offer them with special schemes at pensions, special rates at loans and credit card facilities, discounts and etc.

Cluster 3

Facts:

Term Deposit Subscription Rate and Call Duration = Medium

Customer Age Group = High

Client Loan Status = Low.

Assumptions:

As the Term Deposit Subscription Rate is Medium and Loan Status is Low while Age is high, it can be assumed that most elderly customers who are established and retired are in this cluster. They have an idea about financial investments and its benefits but considering their high age, they fear to invest yet, some seem to be motivated.

Strategy:

For those customers who signs up for the term deposit at the bank, The Bank must arrange a special scheme designed for the customers of this age group that is contrastingly different to the other two and provide benefits such as Elderly Home Facilities, Insurance Facilities, Funeral Support, and partnership with other organizations so that the bank can provide additional services that the bank (vehicle maintenance, Special Rates at Medical Care).

5.6. Implementing the different techniques discussed above on Test data to analyze the correctness of Train data and find the value of k.

Applying Elbow Technique using R for Test Data to determine the value of k.

Figure 21 – Test Data Set Loaded in R

Figure 22 – Elbow Value in R for Test Data

Figure 23 – Elbow Value / k Value = 3, as produced by R for Test Data

By the above result, it is proven that the k value of test data is equal to the k value of train data which is 3.

Then by using k means clustering in R, we categorized the test data into 3 clusters as below using the code of: >kmeans(D, 3, 1000)

Figure 24 – Cluster Means and Clustering Vectors (Fg-1) as produced by R for Test Data

Figure 25 – Clustering Vectors (Fg-2) as produced by R for Test Data

5.7. Data Analysis – Test Data

Attribute and Cluster-wise breakdown of findings for Test Data

Figure 26 – Final Findings of Test Data in Three Clusters

Figure 27 – Term Deposit Subscription Comparison – Cluster Wise – Test Data

Referring to the above chart, the difference of Success and Failure rate is at the lowest in Cluster 1 with 5.46% while the second highest can be noticed in cluster 3 with 33.90% and cluster 2 which is more prominent, displays a massive 61.07% difference of Failure and Success rates.

Figure 28 – Client Loan Status – Comparison – Cluster Wise – Test Data

By looking at the above graph, it can be said that the difference of all three clusters where Failure and Success rates are concerned is almost same with Cluster 1 = 32.58%, Cluster 2 = 39.01%, and Cluster 3 = 32.02%.

Figure 29 – Mean wise Call Duration in Seconds – Comparison – Cluster Wise – Test Data

Looking at the above cart, it can be noticed that the Bank has had to invest most of its time talking to customers in Cluster 3 which shows a massive mean of seconds of 938.1525 while cluster 1 is almost half of cluster 3 with 418.4595 and Cluster 2 is at the lowest with only 147.097 mean of seconds.

Figure 30 – Customer Age Group Categorization – Comparison – Cluster Wise – Test Data

Referring to the above chart, it can be noticed that medium aged customers of cluster 1 are the highest in comparison to all three clusters and in all three clusters the low age range of clients are almost same but with cluster 3 leading. At an overall level, there isn’t much difference between the age group statistically.

5.8. Comparison – Train Data Vs Test Data

Shown below is a comparison between the Train Data and the Test Data.

Figure 31 – Final Findings of Train Data Vs Test Data in Three Clusters

By analyzing and comparing the two sets of data, a difference of data can clearly be identified but this difference of result may also be due to the number of data in each data set.

5.8.1. Comparing Train Data findings to Test Data findings

Listed below is a comparison of Train and Test data for all three clusters based on their attributes and also lists the difference of each attribute in each cluster.

Figure 32 – Cluster 1 Comparison of Train Vs Test Data for Selected Attributes

Figure 33 – Cluster 2 Comparison of Train Vs Test Data for Selected Attributes

Figure 34 – Cluster 3 Comparison of Train Vs Test Data for Selected Attributes

Train Data Mean Values

Test Data Mean Values

Considering the general practice of machine learning, at analysis, to avoid overfitting, often data sets are divided in to two sets as train (2/3 Portion) and test data (1/3 Portion), in which the a selected machine learning practice is run on the train dataset to obtain a result and then the same techniques are applied on test data to analyse the correctness of the results obtained through the train data.

However, in clustering, there is no split between train and test data as with each run, the model provides a different result as the data points moves from one cluster to another hence there will be no correct answer, and for this reason, we should only apply the method in one data set only (either train or test).

When the findings of train to test data are compared, it can be noticed that the results do not match. This could possibly be for the reason mentioned above, or it could also be that the selected value of k as 3 via the Elbow Method not being the most suitable value to cluster the dataset. (It must be noted that while the Elbow produced a k value of 3, we attempted at applying a randomly selected k value of 4 and obtained almost similar results in comparison with k = 3).

05.  Conclusion:

In the current fast phased world, the ultimate goal of all businesses are to achieve competitive advantage over their opponents in the same business environment. Data plays an utmost important process in this cycle and where used right, company’s such as the Banking institution listed herein could gain outstanding results.

The term of Data discusses above, stands for the different statistical customer behaviours or attributes which the Banking Institute may use to implement profitable strategies after using suitable methods to filter and analyze the data in which they can group these customers depending on the customer behaviours in accordance to the business needs as Potential, Loyal, Impulsive, Time Wasting and High Maintenances, and then apply profitable strategies to manage each of these customer groups through customer relationship management with the aim of satisfying them while improving the profits and market share of the Bank by making the right business decisions at the right time.

As for the problems faced while applying the selected technique to the given data set, in clustering the dataset does not get divided into sections as train and test for the reason of the data points changing its cluster at each run which produces different results.

Furthermore, The R software used for applying the technique on the Dataset is very error-sensitive and centric which is not so user-friendly as an application and it often produced a lot of miscalculations which was time-consuming this could be for the reasons of R being a too complexed application or the model requiring more tuning before producing a correct result.

06.  Future Development of k means clustering in relation to ABI

The main problem in k-means clustering is predicting the number of clusters. The tune of the algorithm used to calculate the k value has a direct impact on the final result and choosing the right k value is very important to avoid operating within empty clusters which is often a result of incorrect initial cluster numbers which causes dead centroids.

As there are not many algorithms available that can calculate the k-value successfully, the algorithm calls for more improvement to successfully classify any size of dataset ranging from small to big with similar response velocity and higher degree of accuracy.

As the dataset discussed herein was Categorical, k-mods clustering – a categorical data clustering method may have also been used to process to analyze the data. But the output of this technique is unknown as this study was based on k-means.

However, the technique of k-means was instead chosen considering attributes such as Clients Age, and most importantly the Call Duration in which the data was numerical and converting same to a meaningful qualitative label may have been almost impossible.

We assume that if different attributes than what this study is based on were chosen or data manipulation was done differently, this would still produce a result (unknown as no attempt was made) which may have been different to the original result discussed herein.

If the Bank wishes to, then the institute may consider different attributes in the future to obtain a different result and apply strategies based on the new findings which could be used to increase its market share.

Furthermore, we believe that instead of using R if a software such as Weka was used, maybe the model would have produced a different result. However, the accuracy of the result is unknown.

As only there a handful of Softwares that are able to produce results in relation to ABI, more software such as R must be implemented but with higher levels of User-friendliness and Accessibility.

07. References

Can K-means clustering alone predict future trend from historical data? How to analyze the clustered data to predict future trends? (2015). Retrieved from www.quora.com: https://www.quora.com/Can-K-means-clustering-alone-predict-future-trend-from-historical-data-How-to-analyze-the-clustered-data-to-predict-future-trends

Gove, R. (2017, December 2017). Using the elbow method to determine the optimal number of clusters for k-means clustering. Retrieved from https://bl.ocks.org: https://bl.ocks.org/rpgove/0060ff3b656618e9136b

Guttag, J. (2017, May 19). Clustering, MIT OpenCourseWare. Retrieved from https://www.youtube.com: https://www.youtube.com/watch?v=esmzYhuFnds

how to increase the limit for max.print in R. (2012, March). Retrieved from www.stackoverflow.com: https://stackoverflow.com/questions/6758727/how-to-increase-the-limit-for-max-print-in-r

I.Kabacoff, R. (2017). Cluster Analysis. Retrieved from https://www.statmethods.ne: https://www.statmethods.net/advstats/cluster.html

Lecture 12: Clustering – MIT OpenCourseWare. (2016). Retrieved from www.ocw.mit.edu: https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-0002-introduction-to-computational-thinking-and-data-science-fall-2016/lecture-slides-and-files/MIT6_0002F16_lec12.pdf

MACQUEEN, J. (1967). Some Methods for Classification and Analysis of Multivariate Observations. Some Methods for Classification and Analysis of Multivariate Observations, 297.

McGraw-Herdeg, M. (2012, December 19). Excel: how do you assign top third, middle third and bottom third rankings to a list of values in excel in a way that will adjust the rankings as new values are added? Retrieved from www.quora.com: https://www.quora.com/Excel-how-do-you-assign-top-third-middle-third-and-bottom-third-rankings-to-a-list-of-values-in-excel-in-a-way-that-will-adjust-the-rankings-as-new-values-are-added

Santini, M. (2016). Machine. Advantage & Disadvantages of k-Means and Hierarchical Clustering (Unsupervised Learning), 5.

Singh, K., Malik, D., & Sharma, N. (2011). Evolving Limitation in K-means algorithm in data mining and their removal. IJCEM International Journal of Computational Engineering & Management, 109.

The R Project for Statistical Computing. (n.d.). Retrieved from www.r-project.org: https://www.r-project.org/

Vercellis, C. (2009). Business Intelligence: Data Mining and Optimization for Decision Making. West Sussex: John Wiley & Sons Ltd.

Voges, K. E., & Pope, N. K. (2006). Business Applications and Computational Intelligence. Hershey: Idea Group Publishing.

Professor

You must be logged in to post a comment