Implementation of Decision Support System for Automatic Approval of Loan by Analysing Applicants Credit Payment Behaviour
Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of UKDiss.com.
Implementation of Decision Support System for Automatic Approval of Loan by Analysing Applicants Credit Payment Behaviour.
Abstract
The paper evaluates the behaviour of credit card holders in Taiwan and estimates the consumer credit worthiness by employing various machine learning techniques including Logistic Regression, Random Trees, Bayesian Network and Neural Network on customer credit dataset. For this research project, dataset is extracted from `UCI Machine Learning Repository’ (Lichman; 2013) and then partitioned into training and testing dataset respectively for analysis and evaluation purpose.
The objective of this project is to implement the decision support system which can help organizations in approving loan automatically by analysing the credit payment history of customer. In order to minimize the credit risk from banking perspective, proposed study concentrates on predicting probability of default customers.
Each employed algorithm chooses important predictor variables to train predictive model. To improve performance of the predictive model, other variables except from customer payment status are also taken into consideration from dataset to forecast the default customers. Performance of the implemented predictive models is evaluated by comparing prediction accuracy rate of each model for both training and testing dataset. Among the four algorithms used to build a predictive model, It has been observed that the Logistic Regression algorithm is having the highest ability in predicting default customers.
Keywords: Credit Card, Loan Approval, Machine Learning, Customer Behaviour.
1 Introduction
According to (Yeh and Lien; 2009), credit card holders in Taiwan su_ered from a major
credit card debt crisis in the year of 2006 and the same crisis was expected to increase
in third quarter of that year. To raise the market share, banks in Taiwan exceeded their
credit limit and o_ered more credit to disquali_ed candidates. Within the same period of
time, usage of credit cards for personal requirements became increased regardless of their
payment capacity which resulted into accumulation of high credit amount in context to
their personal bank account and this situation gave rise to critical economic condition
1
for both banks and credit card users so as to manage a clean cashow (Yeh and Lien;
2009). Well organized _nancial institution focuses more on predicting the _nancial risk
factor than managing the economic crisis (Borodzicz; 2005). Financial transactions and
customer payment history are the main source of information for analysing the behaviour
of consumer credit payment and to forecast the default customers.
Data mining terminology comprises various methods to explore the data and present
this data into meaningful knowledge (Jiawei and Kamber; 2003). In the domain of Inform-
ation technology, data mining plays signi_cant role in identifying the trends from data
and unseen relationships between various attributes which are part of that data. Machine
learning procedures take revealed pattern as an input data for analysis and can be used
for building clusters, classifying data and selection of features (Cios et al.; 1998). Ac-
cording to (Venkatesh and Jacob; 2016), application of data mining methods in banking
area is increasing continuously as the machine learning algorithms has greater capability
of capturing meaningful perception from the data. Various classi_cation algorithm comes
under the machine learning environment and it can be utilized to segregate the data into
proposed categories. (Venkatesh and Jacob; 2016) stated that, credit card transactions
data is increasing on a daily basis in banking sector. In this situation, computer techno-
logy is playing an important role for banks in managing the credit risk and to deal with
secure transactions by applying machine learning methods and building prediction model
for credit risk evaluation.
(Clover; 2013) presented study on developing an auto loan approval system for banks
to minimize the credit risk and gaining pro_t from customer credits. (Agarwal et al.;
2008; Heit_eld and Sabarwal; 2004) claimed that most of the developed auto loan ap-
proval systems make decisions on applicants demographic and application information
but this data does not identify the consumer credit behaviour which is more important
for auto loan approval system. To overcome this problem, banks added consumer credit
data into auto loan approval systems for predicting default customers by analysing cus-
tomers credit payment behaviour (Clover; 2013). In order to manage the credit risk in
banking sector, predicting probability of default as well as non-default customer is much
important as dividing customers into good and bad categories because every bank sets
their own criteria while o_ering credit limit to customers based on number of applicants
who applied for loan. According to (Baesens, Setiono, Mues and Vanthienen; 2003), eval-
uating predicted probability, whether it is real or not being a major challenge because
probability is mysterious. To tackle with this challenge, (Clover; 2013) proposed `Sorting
Smoothing Method’ to determine the actual probability of default by estimating variance
between classi_cation accuracy of various data mining methods involved. This proposed
approach enabled researchers for further analysis on credit risk evaluation and predicting
probabilty of defaulters.
1.1 Research Question: How can data mining supports in implementing decision
system for automatic approval of applicants loan in banking sector by analysing consumer
credit payment behaviour?
1.2 Project Purpose: Objective of this project is to implement prediction model
using various machine learning algorithms and evaluate probability value for both default
and non-default customers and _nally conclude the best algorithm with maximum pre-
diction accuracy rate.
1.3 Paper Structure: The research project paper is divided into six sections. First
section gives the brief introduction about the research project. Second section summarizes
the literature review related to credit risk evaluation models derived by researchers. Third
section describes the methodology part in which strategy which is followed to implement
the project is explained. Fourth section gives information about tools and algorithms used
for project implementation. Fifth section evaluates and compares the results produced
from implemented model. Last section concludes the best prediction model and suggests
the future work which needs to be done.
2 Related Work
According to (Krichene and Krichene; 2017), after failure of banks in the Asian continent,
investigation on credit risk assessment took a stage ahead. Identifying risk in banking
and _nance domain is an important part to acknowledge and reduce the uncertainty in
future for small and medium scale organizations. As stated by (Wu et al.; 2014), Business
Intelligence is playing an important role in analysing consumer credit data and helping
in determining the most inuential parameters of risk. (Kasiyanto; 2016) stated that,
credit card transaction data is increasing in signi_cant amount due to rapid expansion
of online payment systems. For example, PayPal online payment system has captured
market globally and they had around 170 million customers until September 2015 (Perez;
2015). In year 2009, from around 200 countries, 2.5 trillion transactions were made and
payment transactions through card across the globe is estimated to be around 10000
every second. (Source: American Bankers Association, March 2009). In the recent years
data mining has proven its signi_cant importance in various sectors including consumer
behavioural scoring, fraud recognition and risk evaluation. Neural network has played
signi_cant role in analysing trend in data and unrevealing composite association between
the parameters (Jiawei and Kamber; 2003).
Financial institution and banking framework is very complicated across the globe and
very tough to recognize. This complex framework creates barrier in organizations devel-
opment. Risk is very uncertain as it directly depends upon the economy. This critical
situation enabled researchers and analyst to derive predictive model for risk computa-
tion. To overcome such a scenario, (Ni; 2010) implemented model of component selection
which classi_es risks of similar characteristics. This technique had chosen components
from data, based on the likeness between parameters and eliminated unwanted data to
re_ne the prediction result. This method works on the concept of _ltering and wrapping
process. Algorithm assigns the grade value to each selected group and checks recursively
for least error value. Credit risk is a major cause to create other risks in the banking
sector. However, it is not feasible to destroy the risk absolutely but it can be reduced to
a pleasant point which can give some con_dence to banks for making safe transactions.
(Shiri et al.; 2012) designed model for credit risk evaluation and fraud detection but this
model did not _t to determine the intensity of risk and this gap enables for further study
of credit risk assessment.
According to (Yu et al.; 2010), various machine learning methods were explored for
credit risk assessment including Decision Tree, Arti_cial Neural Network and Support
Vector Machines. This explored algorithms were applied and evaluated on German and
Australian credit data. In this experiment, precision and recall values were compared
of each algorithm to choose best prediction model which has maximum accuracy rate
and results shown that, Support Vector Machines did not give satis_ed outcomes. In
data mining, recall is total number of outcomes predicted and precision is total number
outcomes which are predicted correctly. (Ghatasheh; 2014) claimed that, Support Vector
Machines does not produce signi_cant results when the training dataset is small. On the
other hand, Decision Tree method is easy to understand and it has greater capability of
predicting outcomes when training dataset is large as compared to Support Vector Ma-
chines method. Support Vector Machine is machine learning technique which performs
analysis on data for classi_cation purpose (Cortes and Vapnik; 1995). Decision Tree is
also a machine learning technique and it has structure similar to tree, which contains
root and leaf nodes. Each root evaluates the input data based on certain conditions and
classi_es into categories. The aim of decision tree algorithm is to design a system which
will forecast value of target variable by analysing input dataset (Rokach and Maimon;
2014). In addition, (Yu et al.; 2010) proposed model for credit risk evaluation by com-
bining Decision Tree and Support Vector Machine Technique.
Nave Bayes is a classi_cation technique and it is derived from the Bayes theorem with
assumption that predictor variables are independent (Zhang; 2004). It is more e_cient
and commonly used method for building classi_cation guidelines. (Freitas; 2014) applied
the concept of Nave Bayes algorithm for credit score evaluation and examined the import-
ance of Nave Bayes classi_cation method in credit risk assessment, which further enabled
researchers to estimate credit score depending on customers credit payment behaviour.
Nave Bayes method is employed by analyst on Kenyan private bank dataset to improve
the e_ectiveness of the classi_cation model and to evaluate its performance (Wagacha;
2002). In this analysis, use of appropriate attributes shown e_ective classi_cation res-
ults from developed classi_er. Author (Malekipirbazari and Aksakalli; 2015) proposed
model of Random Forest algorithm for credit risk evaluation but this model did not
predicted customers payment behaviour. Though, Random Forest algorithm performs
well in classifying data but implemented model classi_ed good customers in bad category
and vice versa. Random forest is machine learning algorithm which is use for classi_ca-
tion of data. Algorithm builds decision trees from training dataset and divides the target
variables into speci_c category based on the decision rule of each tree (Liaw et al.; 2002a).
(Vallini et al.; 2009) applied Multiple Discriminant Analysis (MDA) and Arti_cial
Neural Network (ANN) method on Italian organizations dataset, to forecast the possib-
ility of risk for small and medium scale organizations. Both MDA and ANN are data
classi_cation techniques in data mining. Prediction accuracy rate of generated output
from the techniques applied was 65.9% and 68.4% respectively. These results were not
signi_cant, in order to deploy MDA and ANN model for risk computation. However, nu-
merous tactics are taken into consideration to forecast credit risk but their complications
are not explained by considerable accuracy measurement. As stated by (Migu_eis et al.;
2013), regardless of deep analysis on credit risk evaluation, there is no agreement on
most suitable classi_cation methods to apply. (Baesens, Van Gestel, Viaene, Stepanova,
Suykens and Vanthienen; 2003) discovered that conicts can arise while comparing the
results of various methods. This situation, forced researchers to continue with invest-
igation for credit risk assessment. This paper followed the approach of (Clover; 2013)
by incorporating various machine learning methods which were examined from literature
review to implement the prediction model for customer behavioural analysis.
3 Methodology
3.1 Selection of Implementation Strategy
The research project focused on implementation of model to predict customers behaviour
by analysing individual credit transaction history in banking sector. To develop predictive
model for proposed study, various methodologies were reviewed. After understanding
scope of the project, CRISP-DM model is followed to implement the project. The strategy
used to build predictive model is `Cross Industry Standard Process for Data Mining’ and
it is generally recognized by its short form CRISP-DM (Shearer; 2000). This prototype
is popularly used by data mining professionals to _nd the various solutions associated.
Survey was undertaken to decide the best model for data mining process implementation
and CRISP-DM model majorly voted as best model from survey (Piatetsky-Shapiro;
2014). Below diagram represents, ow of the research project implementation.
Figure 1: CRISP-DM Model. (Image Source: Wikipedia)
As shown in the above diagram, process is divided into six stages. Flow of the model is
designed in recursive approach to make necessary changes in any stage whenever required.
3.2 Problem Identi_cation and Data Acquisition
Problem understanding is the basic step of this project implementation and it is con-
sidered as an important stage to de_ne the aim of project. Objective of this project is
to prevent banks from _nancial loss. This paper presents study on minimizing _nancial
loss for banks by evaluating customers credit payment history and predicting default cus-
tomer from analysis. According to project statement, multiple datasets were looked up
to decide the most appropriate dataset for our proposed study. Data which was required
for research project should contain enough demographic information of customer and it
should contain minimum 6 months customers credit payment history. Other extracted
datasets were of small size as compared to data which was used for our study purpose.
Among several sources, credit dataset of one _nancial institution in Taiwan has been
_nalized for this research project. Data for this project is extracted from the `UCI Ma-
chine Learning Repository’ (Lichman; 2013). Extracted dataset holds 30000 records and
25 variables.
3.3 Data Preparation
Preparing accurate dataset is a very important stage in the entire data analysis process
because usage of wrong data for analysis can lead to incorrect path and ultimately results
into production of erroneous output. Hence, to prepare quality data for analysis is an
important task (Pyle; 1999). For this project, considering size and number of attributes
of the dataset, applications including RStudio, SPSS Modeler and SPSS Statistics has
been used for data pre-processing, analysis and model building purpose. With the help of
application RStudio, data has been veri_ed to check missing and duplicate values. SPSS
Statistics has been used to encode the variable names. Each variable was encoded to
speci_c value for the ease of use. SPSS Modeler has been used to de_ne the data type of
variables and to crosscheck missing values. The data has been assured to be in accurate
format before applying any techniques, leading to the implementation of project on a
correct path. Below is the graphical output, generated from RStudio to check missing
values from data and the graph depicts that data does not contain any missing values.
Figure 2: Graph to check missing values.
3.4 Outlier Detection
After the validation of checking of missing values, outlier detection test is performed on
the input data. In data analytics, outlier detection is a test which helps in recognizing
data entries which are di_erent from general observation values (Maddala and Lahiri;
1992). For example, age value should not be like 200 in age column. Here, scatter plot is
used in SPSS Statistics to detect outliers and below is the result for the same. From the
output generated, it seems that data does not contain any outlier value and it has been
represented in below graph. Hence data is appropriate for further processing.
Figure 3: Outlier Detection
3.5 Prediction Modelling and Evaluation
Data which was prepared in earlier stage is taken as an input for demonstrating the pre-
dictive model. Input data is partitioned into training (80%) and testing (20%) data by
using partition node in SPSS modeler. Four machine learning techniques were applied
on credit dataset which includes Logistic Regression, Bayesian Network, Random trees
and Neural Network. All four models were trained using training (80%) dataset and
later validation was performed on testing dataset. Demographic statistics and suitable
graphs were discovered to show important features from dataset which is discussed more
detailed further below, in descriptive part of implementation section. Each model eval-
uated percentage rate of customers predicted correctly and wrongly. Performance of the
implemented prediction model was estimated by comparing prediction accuracy rate of
each algorithm. Architecture shown below is the designed model of our project imple-
mentation and it is a combination of all the prediction models used in this project. This
architecture is developed with the help of SPSS modeler.
Figure 4: Predictive Models Architecture
4 Implementation
The implementation procedure of the project is divided into two parts. The _rst part
presents a descriptive analysis of the dataset and the second part involves a comparat-
ive study which evaluates the predictive algorithms in order to come up with the best
algorithm that would best predict a default customer.
The data used in this project for analysis is made up of 25 variables. It consists of the
customers demographic information, payment history for a period of six months, and, the
total amount of credit given for both individual credit and supplementary credit. The
dataset consists of a total of 24 explanatory variables, 14 Continuous variables and 10
categorical variables, and 1 dependent dichotomous variable. More information on the
dataset is presented in the table below.
Figure 5: Variables used in the study and their De_nition
For instance, Pay 0 to Pay 6 column represents the customers payment status from
April to September. The status has been categorized into 10 categories based on the
payment status of the credit, -2 stands for no consumption, -1 stands for a loan that was
paid in full, and 1 stands for a credit that has been delayed for one month and above.
Data pre-processing was conducted on the data using the data audit node in SPSS
Modeler, where the data was found to be 100% complete. The data consists of 30000
cases. The partition node in SPSS Modeller was used to split the data into two partitions.
80% of the data,23929 cases, were used for training and model building, and 20% of the
data, 6071 cases were used for testing and validation of the model.
4.1 Descriptive Analysis
Descriptive analysis is a method which explains features from data. This method rep-
resents summary of quantitative data and produces graphs for the same, which helps in
pattern understanding from the dataset (Mann; 2007). Descriptive analysis for our data-
set was performed with the help of SPSS statistics data mining tool. Initially, null values
were checked for demographic variables. Further, frequencies for demographic and other
variables were calculated. Output generated from the frequency evaluation is explained
below.
Table shown below represents that data for all the demographic categorical variables
did not have missing values.
Figure 6: Valid cases against missing cases
The table below represents frequencies of the demographic variables, for both the
default customers and those who were not.
From the table, we can see that 77.9% of the customers were not defaulters and on
the other hand 22.1% were defaulters. It can be deduced from the table that more fe-
male goes for credit than male, 60.4% were female and 39.6% were male. We can also
see that most of the people who took up credit have their highest level of education is
university and it is 46.8% followed by those who hold masters at 35.3% and for High
school attender rate is only 16.4%. The table shows that the highest number of people
who acquired credit were single at 53.2% followed by married people at 45.5%. Those
who are divorced were only 1.1%.
Figure 7: Demographic statistics of the Customers
The table below, depicts that the average amount of credit balance was 167,484 NT
Dollars with a standard deviation of 129,747 NT Dollars. The lowest credit limit was
10000 NT dollars and the highest credit limit balance was 1,000,000 NT dollars. On the
other hand, the average age of those who took up credit was of 35 years with a standard
deviation of 9 years. It can be inferred that most of the people who take up credits are
middle aged. The youngest person who took up a credit was of age 21 years and the
oldest person was of 79 years.
Figure 8: Demographic Statistics for Continuous Variables
Below table illustrates that most of the customers were using revolving funds and was
of 53.2%. On the other hand, there are no customers delaying payment for 9 months and
above, the highest number of delayed were of 8 months and that was 0.02%.
Figure 9: A Cross Tabulation of the Customers Payment Status and Default Payment
from April to September
From the below graph, we can infer that most of the defaulters were using revolving
funds. It can also be seen that as the number of delayed months increase the chances
of defaulters also increases. We can also see that more than 50% of the customers who
delayed for two months and above would result in default.
Figure 10: Distribution of Customers Across Payment Status
4.2 Implementation of Predictive Models
The following are the four algorithms to be evaluated:
1. Logistic Regression
2. Bayesian Network
3. Random Tree
4. Neural Network
The algorithms performance was measured based on its overall prediction accuracy.
Finally, a conclusion was drawn for the best predictor model. Analysis was conducted by
engaging the various nodes in SPSS Modeller.
4.2.1 Logistic Regression
First prediction model which was preferred to build for our project is Logistic Regres-
sion. Logistic Regression is a classi_cation method in data mining and it is most popular
among all the classi_cation techniques. This method is mostly used in the case when the
output of target variable is required to be in binary format such as good/bad, boy/girl
etc. (Walker and Duncan; 1967). Total 23929 cases were used for building the model
which is 80% of data and 6071 cases used for testing and validation purpose, which is
approximately 20% of data. Customers were randomly assigned to the two groups using
partition node in SPSS Modeller. Case processing summary indicated that data had 0%
missing values. Using the logistic regression node in the SPSS Modeler, an automated
forward stepwise procedure was used in order to come up with a model that has the
strongest predictor variables. At each step, a variable is tested for its importance to the
model using Chi-square. Chi-square is a test, which is used to evaluate relation between
variables. This model is designed in a way such that new variables can be added in future,
if required. Also, upon inclusion of the new variable to the model, it is compared with
the existing predictor variable to check whether the newly entered variable is better in
explaining the behaviour of credit default. If the newly entered predictor variable was
found to be better in terms of prediction, then the existing predictor variable in the model
would be removed. The forward stepwise procedure continues until all the predictor vari-
ables are tested and if it satis_es the criteria, the same is included or if not then removed
respectively. While performing parameter estimation for the model, below variables are
found to be statistically signi_cant.
i. Amount of given credit in NT dollars
ii. Sex
iii. Education level
iv. Marital Status
v. Age in years
vi. Repayment Status in September
vii. Amount of bill statement in August
viii. Amount of previous payment in September
ix. Amount of previous payment in August
x. Amount of previous payment in April
By following the simple logistic regression equation, loan default model is evaluated,
From the above equation,Loan Default Model is estimated as below;
Where Bi is the coe_cient estimated in variable selection process.
Table shown below gives Pearson and Deviance goodness of _t result in which test
evaluates whether the predicted probability is varying from the observation values or not.
Figure 11: Goodness of Fit Logistic Regression
As the signi_cant values are greater than 0.05 for Pearson and Deviance, therefore
model _t is appropriate.
4.2.2 Neural Network
The next technique used for model building is Neural Network. It is a machine learning
method having a strong capability of identifying and representing relationships between
variables. Inspiration behind implementing neural network is to build intelligence system
which functions like a human brain. Multilayer Perceptron (MLP) model is followed to
develop the predictive model. This procedure trains the system using historical data and
binds association between input and output data and attempt to generate outcome when
output is mysterious(Yegnanarayana; 2009).
Neural Net node was used in SPSS Modeler to train model. The data used for training
in Neural Network method was same as that used for training for logistic model, 80%
training data and 20% testing data. Main objective while building a default customer
predictive model is to achieve a model that has the highest accuracy. The enhanced
model accuracy option was selected to boost the predictive model. Probability was used
to enhance model accuracy in determining the most valuable inputs.
Table shown below represents the signi_cant predictor variables which was accom-
plished through the highest probability wins technique. Column V5 shows the probab-
ilities of categories combination of variables; value nearest to 1 means more valuable is
the variable.
Figure 12: Signi_cant Predictor Variables- Neural Network
Model Gain:
The graph shown below depicts the _tness of Neural Network model for prediction.
The red diagonal line represents a random model and the blue line represents our model.
Random model is an arbitrary assumed virtual model. Graph explains that blue line
model is better than red line model in perspective of % gain as 60 percentiles would
result in 70.2% gain.
Figure 13: Model Gain: Neural Network
4.2.3 Bayesian Network
Bayesian Network is mostly preferred method used in machine learning where the problem
is uncertain and in which probability is important factor (Murphy; 1998). Graph repres-
ented by Bayesian network holds, nodes and lines. Nodes denotes the random parameters
and line denotes the association between parameters. For this project Bayesian Network
could visualize association between target variable, default payment and predictor vari-
ables.
Figure 14: Markov Bayesian Network.
The Bayes Net node was used in SPSS Modeler to build the Bayesian Network pre-
dictive model. Data used for training was the same as the previous models. Markov
Blanket was used to structure the Bayesian Network Model where target node is guarded
by all children and parental node. `’Markov Blanket” is supervised method used to form
Bayesian Network and it assists in predicting behaviour of the target variable (Pearl;
2014). Figure 14 is the model generated for credit input dataset.
Figure shown above represents the Markov Bayesian Network, the box shows the
distribution of the respective explanatory variables. The importance of the predictor
variable on default customer prediction is represented by the concentration of the blue
colour on the bars. The darker the blue colour the more important the variable is. From
the _gure above the following variables were found to be important
i. Credit repayment status in August
ii. Repayment Status in September
iii. Repayment status in July
iv. Sex
v. Age
vi. April payment Status
vii. April Bill Amount
viii. August bill Amount
ix. May Bill amount
x. July Bill Amount
Model Gain:
Graph shown below depicts the _tness of Bayesian Network model for prediction.
The red diagonal line represents a random model and the blue line represents our model.
Graph explains that blue line model is better than red line model in perspective of %
gain as 60 percentiles would result in 69.2% gain.
Figure 15: Model Gain: Bayesian Network
4.2.4 Random Trees
The Random Tree procedure is an enhanced method of classifying target variables in
which algorithm use generated trees for predicting the outcome of target variable when
observed values are new. This method tries to determine the most signi_cant decision
rule which has the high forecasting rate (Liaw et al.; 2002b).
The random tree node in SPSS Modeler was used in growing the random trees and
pruning. Data used for training was the same as that of the earlier models. The following
predictor variables were found to be important from results.
i. September repayment Status
ii. Amount of Previous Payment in September
iii. Amount of Previous payment in August
iv. Repayment status in June
v. Amount of given credit in NT Dollars
Table shown below gives the details about top decision rules identi_ed by the random
tree algorithm. `Interestingness Index’ column in the table represents the probability rate
of accurate prediction of default customer, derived from the decision rule. Based on the
prediction accuracy probability value, top 5 decision rules are displayed by default.
Figure 16: Top Decision Rules for Default Customer Prediction
Model Gain:
Graph shown below depicts the _tness of Random Tree model for prediction. Using the
Graphical Evaluation node in SPSS Modeler, model gain was generated for Random Trees
model. Graph explains that a 60 percenttile would lead to 68.65% gain.
Figure 17: Model Gain- Random Tress
5 Evaluation
In this section all models were evaluated to decide the best model for predicting the default
customers. Evaluation of models is performed based on the prediction accuracy of all four
algorithms which are employed. For each method, total number of customers predcited
accurately and wrongly is calculated for both training and testing dataset respectivelly.
The accurate and erroneus predictions are represented by numbers and percentile.
Figure 18: Predictive Models Evaluation: Prediction Accuracy
From the above model prediction analysis of both the training and testing data,
Logistic Regression produced the best model with 81.54% correct prediction, and 18.46%
misclassi_cations. It was followed closely by Neural net, with 81.5 correct prediction and
18.5 misclassi_cations.
6 Conclusion and Future Work
The main objective of this study is to implement the various predictive models for fore-
casting default as well as non-default customers in banking and other _nancial organiz-
ations and also approving customer loan application automatically by analysing a indi-
vidual’s credit payment behaviour. After a thorough investigation of the various data
mining algorithms, the Logistic Regression technique was found to have the highest level
of prediction accuracy. The implemented system evaluates the probability value of a
default customer. The probability evaluation can assist banks to set a speci_c criteria
while approving the client’s loan application. In future, performance of the implemented
system can be improved by training predictive model using larger dataset than size of
existing ones. Future researches should include more explanatory variables in the model
.This would go a long way in improving the model prediction accuracy.
Acknowledgements
I would like to express my sincere and faithful gratitude to my supervisor Prof. Keith
Maycock for the continuous support that he had given to me during the completion of
my Masters Thesis. His guidance helped me immensely throughout the time of research
and writing. I will be thankful to my guide ever.
References
Agarwal, S., Ambrose, B. W. and Chomsisengphet, S. (2008). Determinants of automobile
loan default and prepayment.
Baesens, B., Setiono, R., Mues, C. and Vanthienen, J. (2003). Using neural network
rule extraction and decision tables for credit-risk evaluation, Management science
49(3): 312{329.
Baesens, B., Van Gestel, T., Viaene, S., Stepanova, M., Suykens, J. and Vanthienen,
J. (2003). Benchmarking state-of-the-art classi_cation algorithms for credit scoring,
Journal of the operational research society 54(6): 627{635.
Borodzicz, E. (2005). Risk, crisis and security management, Wiley.
Cios, K. J., Pedrycz, W. and Swiniarski, R. W. (1998). Rough sets, Data Mining Methods
for Knowledge Discovery, Springer, pp. 27{71.
Clover, M. (2013). Yeh, tsun-siou lee., The role of credit card behavior in auto loan grant
decision. An application of survival table. Banks and Bank Systems 8(1): 112.
Cortes, C. and Vapnik, V. (1995). Support-vector networks, Machine learning 20(3): 273{
297.
Freitas, A. A. (2014). Comprehensible classi_cation models: a position paper, ACM
SIGKDD explorations newsletter 15(1): 1{10.
Ghatasheh, N. (2014). Business analytics using random forest trees for credit risk predic-
tion: A comparison study, International Journal of Advanced Science and Technology
72: 19{30.
Heit_eld, E. and Sabarwal, T. (2004). What drives default and prepayment on subprime
auto loans?, The Journal of real estate _nance and economics 29(4): 457{477.
Jiawei, H. and Kamber, M. (2003). Data mining: Concepts and techniques, (the morgan
kaufmann series in data management systems), vol. 2.
Kasiyanto, S. (2016). Security issues of new innovative payments and their regulatory
challenges, Bitcoin and Mobile Payments, Springer, pp. 145{179.
Krichene, A. and Krichene, A. (2017). Using a naive bayesian classi_er methodology for
loan risk assessment: Evidence from a tunisian commercial bank, Journal of Economics,
Finance and Administrative Science 22(42): 3{24.
Liaw, A., Wiener, M. et al. (2002a). Classi_cation and regression by randomforest, R
news 2(3): 18{22.
Liaw, A., Wiener, M. et al. (2002b). Classi_cation and regression by randomforest, R
news 2(3): 18{22.
Lichman, M. (2013). UCI machine learning repository.
URL: http://archive.ics.uci.edu/ml
Maddala, G. S. and Lahiri, K. (1992). Introduction to econometrics, Vol. 2, Macmillan
New York.
Malekipirbazari, M. and Aksakalli, V. (2015). Risk assessment in social lending via
random forests, Expert Systems with Applications 42(10): 4621{4631.
Mann, P. S. (2007). Introductory statistics, John Wiley & Sons.
Migu_eis, V. L., Benoit, D. F. and Van den Poel, D. (2013). Enhanced decision support
in credit scoring using bayesian binary quantile regression, Journal of the Operational
Research Society 64(9): 1374{1383.
Murphy, K. (1998). A brief introduction to graphical models and bayesian networks.
Ni, H. (2010). Consumer credit risk evaluation by logistic regression with self-organizing
map, Natural Computation (ICNC), 2010 Sixth International Conference on, Vol. 1,
IEEE, pp. 205{209.
Pearl, J. (2014). Probabilistic reasoning in intelligent systems: networks of plausible
inference, Morgan Kaufmann.
Perez, S. (2015). Paypal launches paypal. me, a simpler way to request money using your
own personalized url.
Piatetsky-Shapiro, G. (2014). Kdnuggets methodology poll.
Pyle, D. (1999). Data preparation for data mining, Vol. 1, morgan kaufmann.
Rokach, L. and Maimon, O. (2014). Data mining with decision trees: theory and applic-
ations, World scienti_c.
Shearer, C. (2000). The crisp-dm model: the new blueprint for data mining, Journal of
data warehousing 5(4): 13{22.
Shiri, M. M., Amini, M. T. and Raftar, M. B. (2012). Data mining techniques and predict-
ing corporate _nancial distress, Interdisciplinary Journal of Contemporary Research in
Business 3(12): 61{68.
Vallini, C., Ciampi, F. and Gordini, N. (2009). Using arti_cial neural networks analysis
for small enterprise default prediction modeling: Statistical evidence from italian _rms,
2009 Oxford Business & Economics Conference Proceedings, Association for Business
and Economics Research (ABER), pp. 1{26.
Venkatesh, A. and Jacob, S. G. (2016). Prediction of credit-card defaulters: A comparat-
ive study on performance of classi_ers, International Journal of Computer Applications
145(7).
Wagacha, P. W. (2002). Machine learning notes on: I. classi_er learn-
ing and generalization, ii. data preparation, iii. validation methods, In-
stitute of Computer Science, University of Nairobi, http://www. uonbi. ac.
ke/acad depts/ics/course material/machine learning/MLNotes. pdf .
Walker, S. H. and Duncan, D. B. (1967). Estimation of the probability of an event as a
function of several independent variables, Biometrika 54(1-2): 167{179.
Wu, D. D., Chen, S.-H. and Olson, D. L. (2014). Business intelligence in risk management:
Some recent progresses, Information Sciences 256: 1{7.
Yegnanarayana, B. (2009). Arti_cial neural networks, PHI Learning Pvt. Ltd.
Yeh, I.-C. and Lien, C.-h. (2009). The comparisons of data mining techniques for the
predictive accuracy of probability of default of credit card clients, Expert Systems with
Applications 36(2): 2473{2480.
Yu, H., Huang, X., Hu, X. and Cai, H. (2010). A comparative study on data mining
algorithms for individual credit risk evaluation, Management of e-Commerce and e-
Government (ICMeCG), 2010 Fourth International Conference on, IEEE, pp. 35{38.
Zhang, H. (2004). The optimality of naive bayes, AA 1(2): 3.