An experimental analysis of handling missing data : multiple imputation and maximum likelihood approaches
Abstract
The study evaluated the performance of the noble missing data handling methods which are multiple imputation (MI) and maximum likelihood (ML). These methods are seldom compared to each other in literature but are instead compared to other traditional methods like list-wise deletion and mean substitution to mention a few. The intent of the study was to use model selection criteria of multiple regression analysis (MLR) to determine if there is bias between the actual selection criteria (from complete dataset) and the estimated selection criteria (obtained after applying MI and ML). In addition the study was meant to assess if the severity of bias will vary under MT and ML and to determine if the degree of missingness has an impact in the performance of these missing data handling methods. The data used in the study were collected by Statistics South Africa (Stats SA) through the Income and Expenditure Survey (JES). A total of 25328 observations were used. The missingness mechanism of interest was that of data that are missing at random (MAR). About l 0
datasets were obtained from the original datasets through a simulation of 2 MAR scenarios with each scenario comprising 5 datasets. The datasets under each MAR scenario differed in the degree of missingness which ranged from l 0% to 50%. The simulated missing data were addressed using MI and ML; and step-wise regression was used to estimate the model parameters and selection criteria after the missing data were addressed. The algorithms used for handling missing data were Full Information Maximum Likelihood (FIML)
and Expectation Maximisation (EM) for ML; and Fully Conditional Specification (FCS) and Markov Chain Monte Carlo (MCMC) for MI. The Akaike Information Criterion (ATC) and the Bayesian Information Criterion (BIC) were chosen to represent the sufficient and consistent selection criteria respectively. Selection criteria from the complete dataset were only used for comparison purposes. A low absolute error value of AIC or BIC indicated that the algorithm performs better than the others. The results of the study revealed that for both MAR scenarios, FIML performed better when l 0% data were missing but the error rate increased as the missingness degree increased. EM algorithm generally performed the poorly for both MAR scenarios with the highest error rate across all the missingness degrees. The MI algorithms generally performed well and their error rate did not differ remarkably from each other. The error rate for both FIML and MCMC decreased gradually as the rate of
missingness degree increased.