Comparative study of neural networks and design of experiments to the classification of HIV status
MetadataShow full item record
This research addresses the novel application of design of experiment, artificial neural net-works and logistic regression to study the effect of demographic characteristics on the risk of acquiring HIV infection among the antenatal clinic attendees in South Africa. The annual antenatal HIV survey is the only major national indicator for HIV prevalence in South Africa. This is a vital technique to understand the changes in the HIV epidemic over time. The annual antenatal clinic data contains the following demographic characteristics for each pregnant woman; age (herein called mother's age), partner's age (herein father's age), population group (race), level of education, gravidity (number of pregnancies), parity (number of children born), HIV and syphilis status. This project applied a screening design of experiment technique to rank the effects of individual demographic characteristics on the risk of acquiring an HIV infection. There are a various screening design techniques such as fractional or full factorial and Plackett-Burman designs. In this work, a two-level fractional factorial design was selected for the purposes of screening. In addition to screening designs, this project employed response surface methodologies (RSM) to estimate interaction and quadratic effects of demographic characteristics using a central composite face-centered and a Box-Behnken design. Furthermore, this research presents the novel application of multi-layer perceptron’s (MLP) neural networks to model the demographic characteristics of antenatal clinic attendees. A review report was produced to study the application of neural networks to modelling HIV/AIDS around the world. The latter report is important to enhance our understanding of the extent to which neural networks have been applied to study the HIV/AIDS pandemic. Finally, a binary logistic regression technique was employed to benchmark the results obtained by the design of experiments and neural networks methodologies. The two-level fractional factorial design demonstrated that HIV prevalence was highly sensitive to changes in the mother's age (15-55 years) and level of her education (Grades 0-13). The central composite face centered and Box-Behnken designs employed to study the individual and interaction effects of demographic characteristics on the spread of HIV in South Africa, demonstrated that HIV status of an antenatal clinic attendee was highly sensitive to changes in pregnant mother's age and her educational level. In addition, the interaction of the mother's age with other demographic characteristics was also found to be an important determinant of the risk of acquiring an HIV infection. Furthermore, the central composite face centered and Box-Behnken designs illustrated that, individual-ally the pregnant mother's parity and her partner's age had no marked effect on her HIV status. However, the pregnant woman’s parity and her male partner’s age did show marked effects on her HIV status in “two way interactions with other demographic characteristics”. The multilayer perceptron (MLP) sensitivity test also showed that the age of the pregnant woman had the greatest effect on the risk of acquiring an HIV infection, while her gravidity and syphilis status had the lowest effects. The outcome of the MLP modelling produced the same results obtained by the screening and response surface methodologies. The binary logistic regression technique was compared with a Box-Behnken design to further elucidate the differential effects of demographic characteristics on the risk of acquiring HIV amongst pregnant women. The two methodologies indicated that the age of the pregnant woman and her level of education had the most profound effects on her risk of acquiring an HIV infection. To facilitate the comparison of the performance of the classifiers used in this study, a receiver operating characteristics (ROC) curve was applied. Theoretically, an ROC analysis provides tools to select optimal models and to discard suboptimal ones independent from the cost context or the classification distribution. SAS Enterprise MinerTM was employed to develop the required receiver-of-characteristics (ROC) curves. To validate the results obtained by the above classification methodologies, a credit scoring add-on in SAS Enterprise MinerTM was used to build binary target scorecards comprised of HIV positive and negative datasets for probability determination. The process involved grouping variables using weights-of-evidence (WOE), prior to performing a logistic regression to produce predicted probabilities. The process of creating bins for the scorecard enables the study of the inherent relationship between demographic characteristics and an in-dividual’s HIV status. This technique increases the understanding of the risk ranking ability of the scorecard method, while offering an added advantage of being predictive.