Variable selection in logistic regression using exact optimisation approaches
Abstract
Logistic regression modelling has been and still remains one of the most frequently used methods for the solving of binary classifcation problems, where the target variable of interest can take on one of two values. Futher more, the logistic regression model formulation can also be extended to multi-class problems, where the response variable in question assumes more than two categorical levels. The extensive use of logistic regression models can most likely be attributed to various beneficial properties that these models exhibit over more advanced machine learning algorithms, such as their overall simplicity and their ability to produce descriptive end-solutions that can easily be deciphered. For this reason, logistic regression modelling is especially popular in application domains like medical research and the financial industry.
As is the case with most machine learning approaches and other statistical modelling techniques, variable selection is often required when developing a logistic regression model. In fact, in most problem settings the input data set will consist of many potential predictor variables, where it is up to the modeller to find a suitable subset of these features which describes the problem in the most accurate and best possible manner. Obtaining a model that is based on a smaller set of inputs is generally considered as good practice and entails many benefits, such as the ability to yield inter-pretable final models and produce predictions that are more stable over time. Many variable selection techniques in logistic regression modelling applications exist, including computationally friendly approaches such as step-wise regression or penalised
regression methods like the lasso and the elastic net. However, the work contained within this thesis is specifically directed towards the concept of best subset selection in regression modelling, which involves selecting a maximum of q variables from a total of p possible features in the in-put space and subsequently obtaining the most optimal q-variable model amongst all possible models consisting of q predictors. Best subset selection is much more resource intensive and time consuming than more conventional variable selection techniques, even for moderately sized datasets. However, it can produce mathematically proven optimal models. In this thesis, a linearised approximation of the log-likelihood objective function is presented as a potential alternative to iterative fitting methods employed by logistic regression. The log-
likelihood objective function is solved using linear programming techniques, such as the well-known simplex method. A modified version of the linearised logistic regression model is proposed, which facilitates best subset variable selection. The resulting model is a mixed integer linear programming problem that incorporates a cardinality constraint on the number of variables. The suggested approach maintains many attractive properties, such as its ability to quantify the quality of the final variable selection solution, its independence of the subjective choice of p-values inherent to typical step-wise variable selection approaches and its capability to edge closer to optimality within increasingly reduced computing times when the correct settings are applied, even for large input datasets. Computational results are presented to demonstrate the advantages of employing an exact mathematical programming approach towards variable selection in logistic regression applications. Empirical evidence suggests that the resulting model produces accurate and parsimonious solutions that are similar to or sometimes better than the benchmark, while still maintaining the beneficial properties listed above. Ultimately, the results documented in this thesis suggest that viable solutions can be obtained for hard optimisation problems, such as best subset selection, within appropriate time frames using an ordinary computer.