Developing credit scorecards using logistic regression and classification and regression trees
MetadataShow full item record
Financial institutes receive thousands of credit applications daily; thus, consumer credit has become increasingly important in the economy. Credit scoring is the evaluation of the risk associated with granting credit to applicants. Credit scoring is used to predict the probability that a prospective loan applicant or current loan applicant will default or will become delinquent, in other words, it is used to distinguish between good and bad payers. A Scorecard is the tool used in credit scoring, a scorecard is a statistical model which considers the correlation between all different characteristics of an historic behaviour of the applicant and tries to predict the applicant’s future behaviour. Various data mining techniques are used to build a scorecard. Before developing the scorecard, the data needs to be extracted and cleaned. A Masterfile analysis is then conducted to determine the Good/Bad definition; to achieve this the performance window is used to monitor accounts opened in that period to determine if they went bad or not. The sample window is the period used to develop the scorecard. The roll rates analysis is used to confirm the definition. This is done by comparing the worst delinquency status in a specific month 𝑥 to the delinquency status in the next month and by then calculating a percentage of accounts that maintained their delinquency status, “rolled forward” into the next delinquency status or got better. Once the definition is confirmed, development of the scorecard may begin. Logistic regression is the most commonly used technique in the market for the development of a scorecard. In a logistic regression the dependent variable makes the assumption that the event of interest has occurred or has not occurred. When building a credit scoring model using a logistic regression model; outliers are not present because continuous predictors are converted to uniform scores, no correlation of more than 0.5 may be present between predictors. The aim of logistic regression is to find the best fitting model to describe the relationship between the dependent variable and a set of independent variables, the outcome variable of this model is binary. Although logistic regression is the most commonly used statistical technique in building scorecards, other techniques can also be used, such as Classification and Regression Trees (CART), it is a machine leaning technique which is non-parametric and is generally used in predictive modelling. It is a step by step process which constructs a decision tree by splitting or not splitting each node on a tree into two daughter nodes. CART can discover complex interactions between predictors which might be impossible when using traditional techniques. CART uses binary recursive splitting, where the dependent variable is categorical and the “class” is where the dependent variable falls into, is classified by the tree. Scorecards were developed using the logistic regression and CART methods respectively. The logistic regression method had performed better than the CART method. When using the methodology proposed in the following research the logistic regression model performed better; the logistic regression produced a stronger Gini, selected variables that were more stable over time and selected variables that had no correlation between them.