Implementation of Decision Support System for Automatic Approval of Loan by Analysing Applicants Credit Payment Behaviour

Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of

Implementation of Decision Support System for Automatic Approval of Loan by Analysing Applicants Credit Payment Behaviour.


The paper evaluates the behaviour of credit card holders in Taiwan and estimates the consumer credit worthiness by employing various machine learning techniques including Logistic Regression, Random Trees, Bayesian Network and Neural Network on customer credit dataset. For this research project, dataset is extracted from `UCI Machine Learning Repository’ (Lichman; 2013) and then partitioned into training and testing dataset respectively for analysis and evaluation purpose.

The objective of this project is to implement the decision support system which can help organizations in approving loan automatically by analysing the credit payment history of customer. In order to minimize the credit risk from banking perspective, proposed study concentrates on predicting probability of default customers.

Each employed algorithm chooses important predictor variables to train predictive model. To improve performance of the predictive model, other variables except from customer payment status are also taken into consideration from dataset to forecast the default customers. Performance of the implemented predictive models is evaluated by comparing prediction accuracy rate of each model for both training and testing dataset. Among the four algorithms used to build a predictive model, It has been observed that the Logistic Regression algorithm is having the highest ability in predicting default customers.

Keywords: Credit Card, Loan Approval, Machine Learning, Customer Behaviour.

1 Introduction

According to (Yeh and Lien; 2009), credit card holders in Taiwan su_ered from a major

credit card debt crisis in the year of 2006 and the same crisis was expected to increase

in third quarter of that year. To raise the market share, banks in Taiwan exceeded their

credit limit and o_ered more credit to disquali_ed candidates. Within the same period of

time, usage of credit cards for personal requirements became increased regardless of their

payment capacity which resulted into accumulation of high credit amount in context to

their personal bank account and this situation gave rise to critical economic condition


for both banks and credit card users so as to manage a clean cashow (Yeh and Lien;

2009). Well organized _nancial institution focuses more on predicting the _nancial risk

factor than managing the economic crisis (Borodzicz; 2005). Financial transactions and

customer payment history are the main source of information for analysing the behaviour

of consumer credit payment and to forecast the default customers.

Data mining terminology comprises various methods to explore the data and present

this data into meaningful knowledge (Jiawei and Kamber; 2003). In the domain of Inform-

ation technology, data mining plays signi_cant role in identifying the trends from data

and unseen relationships between various attributes which are part of that data. Machine

learning procedures take revealed pattern as an input data for analysis and can be used

for building clusters, classifying data and selection of features (Cios et al.; 1998). Ac-

cording to (Venkatesh and Jacob; 2016), application of data mining methods in banking

area is increasing continuously as the machine learning algorithms has greater capability

of capturing meaningful perception from the data. Various classi_cation algorithm comes

under the machine learning environment and it can be utilized to segregate the data into

proposed categories. (Venkatesh and Jacob; 2016) stated that, credit card transactions

data is increasing on a daily basis in banking sector. In this situation, computer techno-

logy is playing an important role for banks in managing the credit risk and to deal with

secure transactions by applying machine learning methods and building prediction model

for credit risk evaluation.

(Clover; 2013) presented study on developing an auto loan approval system for banks

to minimize the credit risk and gaining pro_t from customer credits. (Agarwal et al.;

2008; Heit_eld and Sabarwal; 2004) claimed that most of the developed auto loan ap-

proval systems make decisions on applicants demographic and application information

but this data does not identify the consumer credit behaviour which is more important

for auto loan approval system. To overcome this problem, banks added consumer credit

data into auto loan approval systems for predicting default customers by analysing cus-

tomers credit payment behaviour (Clover; 2013). In order to manage the credit risk in

banking sector, predicting probability of default as well as non-default customer is much

important as dividing customers into good and bad categories because every bank sets

their own criteria while o_ering credit limit to customers based on number of applicants

who applied for loan. According to (Baesens, Setiono, Mues and Vanthienen; 2003), eval-

uating predicted probability, whether it is real or not being a major challenge because

probability is mysterious. To tackle with this challenge, (Clover; 2013) proposed `Sorting

Smoothing Method’ to determine the actual probability of default by estimating variance

between classi_cation accuracy of various data mining methods involved. This proposed

approach enabled researchers for further analysis on credit risk evaluation and predicting

probabilty of defaulters.

1.1 Research Question: How can data mining supports in implementing decision

system for automatic approval of applicants loan in banking sector by analysing consumer

credit payment behaviour?

1.2 Project Purpose: Objective of this project is to implement prediction model

using various machine learning algorithms and evaluate probability value for both default

and non-default customers and _nally conclude the best algorithm with maximum pre-

diction accuracy rate.

1.3 Paper Structure: The research project paper is divided into six sections. First

section gives the brief introduction about the research project. Second section summarizes

the literature review related to credit risk evaluation models derived by researchers. Third

section describes the methodology part in which strategy which is followed to implement

the project is explained. Fourth section gives information about tools and algorithms used

for project implementation. Fifth section evaluates and compares the results produced

from implemented model. Last section concludes the best prediction model and suggests

the future work which needs to be done.

2 Related Work

According to (Krichene and Krichene; 2017), after failure of banks in the Asian continent,

investigation on credit risk assessment took a stage ahead. Identifying risk in banking

and _nance domain is an important part to acknowledge and reduce the uncertainty in

future for small and medium scale organizations. As stated by (Wu et al.; 2014), Business

Intelligence is playing an important role in analysing consumer credit data and helping

in determining the most inuential parameters of risk. (Kasiyanto; 2016) stated that,

credit card transaction data is increasing in signi_cant amount due to rapid expansion

of online payment systems. For example, PayPal online payment system has captured

market globally and they had around 170 million customers until September 2015 (Perez;

2015). In year 2009, from around 200 countries, 2.5 trillion transactions were made and

payment transactions through card across the globe is estimated to be around 10000

every second. (Source: American Bankers Association, March 2009). In the recent years

data mining has proven its signi_cant importance in various sectors including consumer

behavioural scoring, fraud recognition and risk evaluation. Neural network has played

signi_cant role in analysing trend in data and unrevealing composite association between

the parameters (Jiawei and Kamber; 2003).

Financial institution and banking framework is very complicated across the globe and

very tough to recognize. This complex framework creates barrier in organizations devel-

opment. Risk is very uncertain as it directly depends upon the economy. This critical

situation enabled researchers and analyst to derive predictive model for risk computa-

tion. To overcome such a scenario, (Ni; 2010) implemented model of component selection

which classi_es risks of similar characteristics. This technique had chosen components

from data, based on the likeness between parameters and eliminated unwanted data to

re_ne the prediction result. This method works on the concept of _ltering and wrapping

process. Algorithm assigns the grade value to each selected group and checks recursively

for least error value. Credit risk is a major cause to create other risks in the banking

sector. However, it is not feasible to destroy the risk absolutely but it can be reduced to

a pleasant point which can give some con_dence to banks for making safe transactions.

(Shiri et al.; 2012) designed model for credit risk evaluation and fraud detection but this

model did not _t to determine the intensity of risk and this gap enables for further study

of credit risk assessment.

According to (Yu et al.; 2010), various machine learning methods were explored for

credit risk assessment including Decision Tree, Arti_cial Neural Network and Support

Vector Machines. This explored algorithms were applied and evaluated on German and

Australian credit data. In this experiment, precision and recall values were compared

of each algorithm to choose best prediction model which has maximum accuracy rate

and results shown that, Support Vector Machines did not give satis_ed outcomes. In

data mining, recall is total number of outcomes predicted and precision is total number

outcomes which are predicted correctly. (Ghatasheh; 2014) claimed that, Support Vector

Machines does not produce signi_cant results when the training dataset is small. On the

other hand, Decision Tree method is easy to understand and it has greater capability of

predicting outcomes when training dataset is large as compared to Support Vector Ma-

chines method. Support Vector Machine is machine learning technique which performs

analysis on data for classi_cation purpose (Cortes and Vapnik; 1995). Decision Tree is

also a machine learning technique and it has structure similar to tree, which contains

root and leaf nodes. Each root evaluates the input data based on certain conditions and

classi_es into categories. The aim of decision tree algorithm is to design a system which

will forecast value of target variable by analysing input dataset (Rokach and Maimon;

2014). In addition, (Yu et al.; 2010) proposed model for credit risk evaluation by com-

bining Decision Tree and Support Vector Machine Technique.

Nave Bayes is a classi_cation technique and it is derived from the Bayes theorem with

assumption that predictor variables are independent (Zhang; 2004). It is more e_cient

and commonly used method for building classi_cation guidelines. (Freitas; 2014) applied

the concept of Nave Bayes algorithm for credit score evaluation and examined the import-

ance of Nave Bayes classi_cation method in credit risk assessment, which further enabled

researchers to estimate credit score depending on customers credit payment behaviour.

Nave Bayes method is employed by analyst on Kenyan private bank dataset to improve

the e_ectiveness of the classi_cation model and to evaluate its performance (Wagacha;

2002). In this analysis, use of appropriate attributes shown e_ective classi_cation res-

ults from developed classi_er. Author (Malekipirbazari and Aksakalli; 2015) proposed

model of Random Forest algorithm for credit risk evaluation but this model did not

predicted customers payment behaviour. Though, Random Forest algorithm performs

well in classifying data but implemented model classi_ed good customers in bad category

and vice versa. Random forest is machine learning algorithm which is use for classi_ca-

tion of data. Algorithm builds decision trees from training dataset and divides the target

variables into speci_c category based on the decision rule of each tree (Liaw et al.; 2002a).

(Vallini et al.; 2009) applied Multiple Discriminant Analysis (MDA) and Arti_cial

Neural Network (ANN) method on Italian organizations dataset, to forecast the possib-

ility of risk for small and medium scale organizations. Both MDA and ANN are data

classi_cation techniques in data mining. Prediction accuracy rate of generated output

from the techniques applied was 65.9% and 68.4% respectively. These results were not

signi_cant, in order to deploy MDA and ANN model for risk computation. However, nu-

merous tactics are taken into consideration to forecast credit risk but their complications

are not explained by considerable accuracy measurement. As stated by (Migu_eis et al.;

2013), regardless of deep analysis on credit risk evaluation, there is no agreement on

most suitable classi_cation methods to apply. (Baesens, Van Gestel, Viaene, Stepanova,

Suykens and Vanthienen; 2003) discovered that conicts can arise while comparing the

results of various methods. This situation, forced researchers to continue with invest-

igation for credit risk assessment. This paper followed the approach of (Clover; 2013)

by incorporating various machine learning methods which were examined from literature

review to implement the prediction model for customer behavioural analysis.

3 Methodology

3.1 Selection of Implementation Strategy

The research project focused on implementation of model to predict customers behaviour

by analysing individual credit transaction history in banking sector. To develop predictive

model for proposed study, various methodologies were reviewed. After understanding

scope of the project, CRISP-DM model is followed to implement the project. The strategy

used to build predictive model is `Cross Industry Standard Process for Data Mining’ and

it is generally recognized by its short form CRISP-DM (Shearer; 2000). This prototype

is popularly used by data mining professionals to _nd the various solutions associated.

Survey was undertaken to decide the best model for data mining process implementation

and CRISP-DM model majorly voted as best model from survey (Piatetsky-Shapiro;

2014). Below diagram represents, ow of the research project implementation.

Figure 1: CRISP-DM Model. (Image Source: Wikipedia)

As shown in the above diagram, process is divided into six stages. Flow of the model is

designed in recursive approach to make necessary changes in any stage whenever required.

3.2 Problem Identi_cation and Data Acquisition

Problem understanding is the basic step of this project implementation and it is con-

sidered as an important stage to de_ne the aim of project. Objective of this project is

to prevent banks from _nancial loss. This paper presents study on minimizing _nancial

loss for banks by evaluating customers credit payment history and predicting default cus-

tomer from analysis. According to project statement, multiple datasets were looked up

to decide the most appropriate dataset for our proposed study. Data which was required

for research project should contain enough demographic information of customer and it

should contain minimum 6 months customers credit payment history. Other extracted

datasets were of small size as compared to data which was used for our study purpose.

Among several sources, credit dataset of one _nancial institution in Taiwan has been

_nalized for this research project. Data for this project is extracted from the `UCI Ma-

chine Learning Repository’ (Lichman; 2013). Extracted dataset holds 30000 records and

25 variables.

3.3 Data Preparation

Preparing accurate dataset is a very important stage in the entire data analysis process

because usage of wrong data for analysis can lead to incorrect path and ultimately results

into production of erroneous output. Hence, to prepare quality data for analysis is an

important task (Pyle; 1999). For this project, considering size and number of attributes

of the dataset, applications including RStudio, SPSS Modeler and SPSS Statistics has

been used for data pre-processing, analysis and model building purpose. With the help of

application RStudio, data has been veri_ed to check missing and duplicate values. SPSS

Statistics has been used to encode the variable names. Each variable was encoded to

speci_c value for the ease of use. SPSS Modeler has been used to de_ne the data type of

variables and to crosscheck missing values. The data has been assured to be in accurate

format before applying any techniques, leading to the implementation of project on a

correct path. Below is the graphical output, generated from RStudio to check missing

values from data and the graph depicts that data does not contain any missing values.

Figure 2: Graph to check missing values.

3.4 Outlier Detection

After the validation of checking of missing values, outlier detection test is performed on

the input data. In data analytics, outlier detection is a test which helps in recognizing

data entries which are di_erent from general observation values (Maddala and Lahiri;

1992). For example, age value should not be like 200 in age column. Here, scatter plot is

used in SPSS Statistics to detect outliers and below is the result for the same. From the

output generated, it seems that data does not contain any outlier value and it has been

represented in below graph. Hence data is appropriate for further processing.

Figure 3: Outlier Detection

3.5 Prediction Modelling and Evaluation

Data which was prepared in earlier stage is taken as an input for demonstrating the pre-

dictive model. Input data is partitioned into training (80%) and testing (20%) data by

using partition node in SPSS modeler. Four machine learning techniques were applied

on credit dataset which includes Logistic Regression, Bayesian Network, Random trees

and Neural Network. All four models were trained using training (80%) dataset and

later validation was performed on testing dataset. Demographic statistics and suitable

graphs were discovered to show important features from dataset which is discussed more

detailed further below, in descriptive part of implementation section. Each model eval-

uated percentage rate of customers predicted correctly and wrongly. Performance of the

implemented prediction model was estimated by comparing prediction accuracy rate of

each algorithm. Architecture shown below is the designed model of our project imple-

mentation and it is a combination of all the prediction models used in this project. This

architecture is developed with the help of SPSS modeler.

Figure 4: Predictive Models Architecture

4 Implementation

The implementation procedure of the project is divided into two parts. The _rst part

presents a descriptive analysis of the dataset and the second part involves a comparat-

ive study which evaluates the predictive algorithms in order to come up with the best

algorithm that would best predict a default customer.

The data used in this project for analysis is made up of 25 variables. It consists of the

customers demographic information, payment history for a period of six months, and, the

total amount of credit given for both individual credit and supplementary credit. The

dataset consists of a total of 24 explanatory variables, 14 Continuous variables and 10

categorical variables, and 1 dependent dichotomous variable. More information on the

dataset is presented in the table below.

Figure 5: Variables used in the study and their De_nition

For instance, Pay 0 to Pay 6 column represents the customers payment status from

April to September. The status has been categorized into 10 categories based on the

payment status of the credit, -2 stands for no consumption, -1 stands for a loan that was

paid in full, and 1 stands for a credit that has been delayed for one month and above.

Data pre-processing was conducted on the data using the data audit node in SPSS

Modeler, where the data was found to be 100% complete. The data consists of 30000

cases. The partition node in SPSS Modeller was used to split the data into two partitions.

80% of the data,23929 cases, were used for training and model building, and 20% of the

data, 6071 cases were used for testing and validation of the model.

4.1 Descriptive Analysis

Descriptive analysis is a method which explains features from data. This method rep-

resents summary of quantitative data and produces graphs for the same, which helps in

pattern understanding from the dataset (Mann; 2007). Descriptive analysis for our data-

set was performed with the help of SPSS statistics data mining tool. Initially, null values

were checked for demographic variables. Further, frequencies for demographic and other

variables were calculated. Output generated from the frequency evaluation is explained


Table shown below represents that data for all the demographic categorical variables

did not have missing values.

Figure 6: Valid cases against missing cases

The table below represents frequencies of the demographic variables, for both the

default customers and those who were not.

From the table, we can see that 77.9% of the customers were not defaulters and on

the other hand 22.1% were defaulters. It can be deduced from the table that more fe-

male goes for credit than male, 60.4% were female and 39.6% were male. We can also

see that most of the people who took up credit have their highest level of education is

university and it is 46.8% followed by those who hold masters at 35.3% and for High

school attender rate is only 16.4%. The table shows that the highest number of people

who acquired credit were single at 53.2% followed by married people at 45.5%. Those

who are divorced were only 1.1%.

Figure 7: Demographic statistics of the Customers

The table below, depicts that the average amount of credit balance was 167,484 NT

Dollars with a standard deviation of 129,747 NT Dollars. The lowest credit limit was

10000 NT dollars and the highest credit limit balance was 1,000,000 NT dollars. On the

other hand, the average age of those who took up credit was of 35 years with a standard

deviation of 9 years. It can be inferred that most of the people who take up credits are

middle aged. The youngest person who took up a credit was of age 21 years and the

oldest person was of 79 years.

Figure 8: Demographic Statistics for Continuous Variables

Below table illustrates that most of the customers were using revolving funds and was

of 53.2%. On the other hand, there are no customers delaying payment for 9 months and

above, the highest number of delayed were of 8 months and that was 0.02%.

Figure 9: A Cross Tabulation of the Customers Payment Status and Default Payment

from April to September

From the below graph, we can infer that most of the defaulters were using revolving

funds. It can also be seen that as the number of delayed months increase the chances

of defaulters also increases. We can also see that more than 50% of the customers who

delayed for two months and above would result in default.

Figure 10: Distribution of Customers Across Payment Status

4.2 Implementation of Predictive Models

The following are the four algorithms to be evaluated:

1. Logistic Regression

2. Bayesian Network

3. Random Tree

4. Neural Network

The algorithms performance was measured based on its overall prediction accuracy.

Finally, a conclusion was drawn for the best predictor model. Analysis was conducted by

engaging the various nodes in SPSS Modeller.

4.2.1 Logistic Regression

First prediction model which was preferred to build for our project is Logistic Regres-

sion. Logistic Regression is a classi_cation method in data mining and it is most popular

among all the classi_cation techniques. This method is mostly used in the case when the

output of target variable is required to be in binary format such as good/bad, boy/girl

etc. (Walker and Duncan; 1967). Total 23929 cases were used for building the model

which is 80% of data and 6071 cases used for testing and validation purpose, which is

approximately 20% of data. Customers were randomly assigned to the two groups using

partition node in SPSS Modeller. Case processing summary indicated that data had 0%

missing values. Using the logistic regression node in the SPSS Modeler, an automated

forward stepwise procedure was used in order to come up with a model that has the

strongest predictor variables. At each step, a variable is tested for its importance to the

model using Chi-square. Chi-square is a test, which is used to evaluate relation between

variables. This model is designed in a way such that new variables can be added in future,

if required. Also, upon inclusion of the new variable to the model, it is compared with

the existing predictor variable to check whether the newly entered variable is better in

explaining the behaviour of credit default. If the newly entered predictor variable was

found to be better in terms of prediction, then the existing predictor variable in the model

would be removed. The forward stepwise procedure continues until all the predictor vari-

ables are tested and if it satis_es the criteria, the same is included or if not then removed

respectively. While performing parameter estimation for the model, below variables are

found to be statistically signi_cant.

i. Amount of given credit in NT dollars

ii. Sex

iii. Education level

iv. Marital Status

v. Age in years

vi. Repayment Status in September

vii. Amount of bill statement in August

viii. Amount of previous payment in September

ix. Amount of previous payment in August

x. Amount of previous payment in April

By following the simple logistic regression equation, loan default model is evaluated,

From the above equation,Loan Default Model is estimated as below;

Where Bi is the coe_cient estimated in variable selection process.

Table shown below gives Pearson and Deviance goodness of _t result in which test

evaluates whether the predicted probability is varying from the observation values or not.

Figure 11: Goodness of Fit Logistic Regression

As the signi_cant values are greater than 0.05 for Pearson and Deviance, therefore

model _t is appropriate.

4.2.2 Neural Network

The next technique used for model building is Neural Network. It is a machine learning

method having a strong capability of identifying and representing relationships between

variables. Inspiration behind implementing neural network is to build intelligence system

which functions like a human brain. Multilayer Perceptron (MLP) model is followed to

develop the predictive model. This procedure trains the system using historical data and

binds association between input and output data and attempt to generate outcome when

output is mysterious(Yegnanarayana; 2009).

Neural Net node was used in SPSS Modeler to train model. The data used for training

in Neural Network method was same as that used for training for logistic model, 80%

training data and 20% testing data. Main objective while building a default customer

predictive model is to achieve a model that has the highest accuracy. The enhanced

model accuracy option was selected to boost the predictive model. Probability was used

to enhance model accuracy in determining the most valuable inputs.

Table shown below represents the signi_cant predictor variables which was accom-

plished through the highest probability wins technique. Column V5 shows the probab-

ilities of categories combination of variables; value nearest to 1 means more valuable is

the variable.

Figure 12: Signi_cant Predictor Variables- Neural Network

Model Gain:

The graph shown below depicts the _tness of Neural Network model for prediction.

The red diagonal line represents a random model and the blue line represents our model.

Random model is an arbitrary assumed virtual model. Graph explains that blue line

model is better than red line model in perspective of % gain as 60 percentiles would

result in 70.2% gain.

Figure 13: Model Gain: Neural Network

4.2.3 Bayesian Network

Bayesian Network is mostly preferred method used in machine learning where the problem

is uncertain and in which probability is important factor (Murphy; 1998). Graph repres-

ented by Bayesian network holds, nodes and lines. Nodes denotes the random parameters

and line denotes the association between parameters. For this project Bayesian Network

could visualize association between target variable, default payment and predictor vari-


Figure 14: Markov Bayesian Network.

The Bayes Net node was used in SPSS Modeler to build the Bayesian Network pre-

dictive model. Data used for training was the same as the previous models. Markov

Blanket was used to structure the Bayesian Network Model where target node is guarded

by all children and parental node. `’Markov Blanket” is supervised method used to form

Bayesian Network and it assists in predicting behaviour of the target variable (Pearl;

2014). Figure 14 is the model generated for credit input dataset.

Figure shown above represents the Markov Bayesian Network, the box shows the

distribution of the respective explanatory variables. The importance of the predictor

variable on default customer prediction is represented by the concentration of the blue

colour on the bars. The darker the blue colour the more important the variable is. From

the _gure above the following variables were found to be important

i. Credit repayment status in August

ii. Repayment Status in September

iii. Repayment status in July

iv. Sex

v. Age

vi. April payment Status

vii. April Bill Amount

viii. August bill Amount

ix. May Bill amount

x. July Bill Amount

Model Gain:

Graph shown below depicts the _tness of Bayesian Network model for prediction.

The red diagonal line represents a random model and the blue line represents our model.

Graph explains that blue line model is better than red line model in perspective of %

gain as 60 percentiles would result in 69.2% gain.

Figure 15: Model Gain: Bayesian Network

4.2.4 Random Trees

The Random Tree procedure is an enhanced method of classifying target variables in

which algorithm use generated trees for predicting the outcome of target variable when

observed values are new. This method tries to determine the most signi_cant decision

rule which has the high forecasting rate (Liaw et al.; 2002b).

The random tree node in SPSS Modeler was used in growing the random trees and

pruning. Data used for training was the same as that of the earlier models. The following

predictor variables were found to be important from results.

i. September repayment Status

ii. Amount of Previous Payment in September

iii. Amount of Previous payment in August

iv. Repayment status in June

v. Amount of given credit in NT Dollars

Table shown below gives the details about top decision rules identi_ed by the random

tree algorithm. `Interestingness Index’ column in the table represents the probability rate

of accurate prediction of default customer, derived from the decision rule. Based on the

prediction accuracy probability value, top 5 decision rules are displayed by default.

Figure 16: Top Decision Rules for Default Customer Prediction

Model Gain:

Graph shown below depicts the _tness of Random Tree model for prediction. Using the

Graphical Evaluation node in SPSS Modeler, model gain was generated for Random Trees

model. Graph explains that a 60 percenttile would lead to 68.65% gain.

Figure 17: Model Gain- Random Tress

5 Evaluation

In this section all models were evaluated to decide the best model for predicting the default

customers. Evaluation of models is performed based on the prediction accuracy of all four

algorithms which are employed. For each method, total number of customers predcited

accurately and wrongly is calculated for both training and testing dataset respectivelly.

The accurate and erroneus predictions are represented by numbers and percentile.

Figure 18: Predictive Models Evaluation: Prediction Accuracy

From the above model prediction analysis of both the training and testing data,

Logistic Regression produced the best model with 81.54% correct prediction, and 18.46%

misclassi_cations. It was followed closely by Neural net, with 81.5 correct prediction and

18.5 misclassi_cations.

6 Conclusion and Future Work

The main objective of this study is to implement the various predictive models for fore-

casting default as well as non-default customers in banking and other _nancial organiz-

ations and also approving customer loan application automatically by analysing a indi-

vidual’s credit payment behaviour. After a thorough investigation of the various data

mining algorithms, the Logistic Regression technique was found to have the highest level

of prediction accuracy. The implemented system evaluates the probability value of a

default customer. The probability evaluation can assist banks to set a speci_c criteria

while approving the client’s loan application. In future, performance of the implemented

system can be improved by training predictive model using larger dataset than size of

existing ones. Future researches should include more explanatory variables in the model

.This would go a long way in improving the model prediction accuracy.


I would like to express my sincere and faithful gratitude to my supervisor Prof. Keith

Maycock for the continuous support that he had given to me during the completion of

my Masters Thesis. His guidance helped me immensely throughout the time of research

and writing. I will be thankful to my guide ever.


Agarwal, S., Ambrose, B. W. and Chomsisengphet, S. (2008). Determinants of automobile

loan default and prepayment.

Baesens, B., Setiono, R., Mues, C. and Vanthienen, J. (2003). Using neural network

rule extraction and decision tables for credit-risk evaluation, Management science

49(3): 312{329.

Baesens, B., Van Gestel, T., Viaene, S., Stepanova, M., Suykens, J. and Vanthienen,

J. (2003). Benchmarking state-of-the-art classi_cation algorithms for credit scoring,

Journal of the operational research society 54(6): 627{635.

Borodzicz, E. (2005). Risk, crisis and security management, Wiley.

Cios, K. J., Pedrycz, W. and Swiniarski, R. W. (1998). Rough sets, Data Mining Methods

for Knowledge Discovery, Springer, pp. 27{71.

Clover, M. (2013). Yeh, tsun-siou lee., The role of credit card behavior in auto loan grant

decision. An application of survival table. Banks and Bank Systems 8(1): 112.

Cortes, C. and Vapnik, V. (1995). Support-vector networks, Machine learning 20(3): 273{


Freitas, A. A. (2014). Comprehensible classi_cation models: a position paper, ACM

SIGKDD explorations newsletter 15(1): 1{10.

Ghatasheh, N. (2014). Business analytics using random forest trees for credit risk predic-

tion: A comparison study, International Journal of Advanced Science and Technology

72: 19{30.

Heit_eld, E. and Sabarwal, T. (2004). What drives default and prepayment on subprime

auto loans?, The Journal of real estate _nance and economics 29(4): 457{477.

Jiawei, H. and Kamber, M. (2003). Data mining: Concepts and techniques, (the morgan

kaufmann series in data management systems), vol. 2.

Kasiyanto, S. (2016). Security issues of new innovative payments and their regulatory

challenges, Bitcoin and Mobile Payments, Springer, pp. 145{179.

Krichene, A. and Krichene, A. (2017). Using a naive bayesian classi_er methodology for

loan risk assessment: Evidence from a tunisian commercial bank, Journal of Economics,

Finance and Administrative Science 22(42): 3{24.

Liaw, A., Wiener, M. et al. (2002a). Classi_cation and regression by randomforest, R

news 2(3): 18{22.

Liaw, A., Wiener, M. et al. (2002b). Classi_cation and regression by randomforest, R

news 2(3): 18{22.

Lichman, M. (2013). UCI machine learning repository.


Maddala, G. S. and Lahiri, K. (1992). Introduction to econometrics, Vol. 2, Macmillan

New York.

Malekipirbazari, M. and Aksakalli, V. (2015). Risk assessment in social lending via

random forests, Expert Systems with Applications 42(10): 4621{4631.

Mann, P. S. (2007). Introductory statistics, John Wiley & Sons.

Migu_eis, V. L., Benoit, D. F. and Van den Poel, D. (2013). Enhanced decision support

in credit scoring using bayesian binary quantile regression, Journal of the Operational

Research Society 64(9): 1374{1383.

Murphy, K. (1998). A brief introduction to graphical models and bayesian networks.

Ni, H. (2010). Consumer credit risk evaluation by logistic regression with self-organizing

map, Natural Computation (ICNC), 2010 Sixth International Conference on, Vol. 1,

IEEE, pp. 205{209.

Pearl, J. (2014). Probabilistic reasoning in intelligent systems: networks of plausible

inference, Morgan Kaufmann.

Perez, S. (2015). Paypal launches paypal. me, a simpler way to request money using your

own personalized url.

Piatetsky-Shapiro, G. (2014). Kdnuggets methodology poll.

Pyle, D. (1999). Data preparation for data mining, Vol. 1, morgan kaufmann.

Rokach, L. and Maimon, O. (2014). Data mining with decision trees: theory and applic-

ations, World scienti_c.

Shearer, C. (2000). The crisp-dm model: the new blueprint for data mining, Journal of

data warehousing 5(4): 13{22.

Shiri, M. M., Amini, M. T. and Raftar, M. B. (2012). Data mining techniques and predict-

ing corporate _nancial distress, Interdisciplinary Journal of Contemporary Research in

Business 3(12): 61{68.

Vallini, C., Ciampi, F. and Gordini, N. (2009). Using arti_cial neural networks analysis

for small enterprise default prediction modeling: Statistical evidence from italian _rms,

2009 Oxford Business & Economics Conference Proceedings, Association for Business

and Economics Research (ABER), pp. 1{26.

Venkatesh, A. and Jacob, S. G. (2016). Prediction of credit-card defaulters: A comparat-

ive study on performance of classi_ers, International Journal of Computer Applications


Wagacha, P. W. (2002). Machine learning notes on: I. classi_er learn-

ing and generalization, ii. data preparation, iii. validation methods, In-

stitute of Computer Science, University of Nairobi, http://www. uonbi. ac.

ke/acad depts/ics/course material/machine learning/MLNotes. pdf .

Walker, S. H. and Duncan, D. B. (1967). Estimation of the probability of an event as a

function of several independent variables, Biometrika 54(1-2): 167{179.

Wu, D. D., Chen, S.-H. and Olson, D. L. (2014). Business intelligence in risk management:

Some recent progresses, Information Sciences 256: 1{7.

Yegnanarayana, B. (2009). Arti_cial neural networks, PHI Learning Pvt. Ltd.

Yeh, I.-C. and Lien, C.-h. (2009). The comparisons of data mining techniques for the

predictive accuracy of probability of default of credit card clients, Expert Systems with

Applications 36(2): 2473{2480.

Yu, H., Huang, X., Hu, X. and Cai, H. (2010). A comparative study on data mining

algorithms for individual credit risk evaluation, Management of e-Commerce and e-

Government (ICMeCG), 2010 Fourth International Conference on, IEEE, pp. 35{38.

Zhang, H. (2004). The optimality of naive bayes, AA 1(2): 3.


Leave a Reply