Application of data mining and machine learning techniques for geohydrological datasets in South Africa C de Bruyn orcid.org/0000-0003-3011-8563 Dissertation accepted in fulfilment of the requirements for the degree Master of Science in Environmental Sciences with Hydrology and Geohydrology at the North-West University Supervisor: Dr SR Dennis Graduation October 2023 24963623 Centre for Water Sciences and Management – North-West University, South Africa ACKNOWLEDGEMENTS Firstly, I would like to thank YHVH, my Creator, and his Son, Yeshua, for giving me the strength to push onwards through this endeavour. I am thankful to Dr Rainier Dennis and the Centre of Water Sciences and Management who granted me this opportunity and who guided me through this process, equipping me with the proper tools and knowledge. To my parents, for all the love and support through this tough time. I would not have been able to pursue this degree, let alone finish it without them. Also, my brother for giving me advice on atopic which was completely new to me at the start of this study. And my fiancé, for encouraging me to finish what I started. Finally, I am grateful for having a friend in Lohan Bredenhann, who supported me and gave technical advice regarding this academic pursuit. i Centre for Water Sciences and Management – North-West University, South Africa ABSTRACT A desktop study was conducted to research data-driven modelling techniques to classify relationships between borehole parameters and the relevant geological setting. Borehole surveying and drilling is a costly endeavour and by applying data mining and machine learning techniques to national groundwater databases and other available national datasets such as spatial data, better insight and improvements on management of groundwater resources can result. Five machine learning algorithms were tested on a consolidated dataset and their performances compared in order to establish which algorithm yielded the most accurate results. It was established that Random Forest Regression and Classification could be used to model yield, and Support Vector Regression and Random Forest Classification could model static water levels. The algorithm was tested on three case study areas, based on Vegter regions. The results indicated that static water levels could be modelled with high rates of accuracy, but yield modelling was not as successful, and a lot of uncertainty still remains as to the drivers behind water strike yield. Keywords: data mining, machine learning, groundwater resource management, geohydrological datasets, data-driven modelling; water level modelling, yield modelling. ii Centre for Water Sciences and Management – North-West University, South Africa TABLE OF CONTENTS ACKNOWLEDGEMENTS .......................................................................................................... I ABSTRACT .............................................................................................................................. II LIST OF TABLES .................................................................................................................... IX LIST OF FIGURES .................................................................................................................... X LIST OF EQUATIONS ........................................................................................................... XIV LIST OF ABBREVIATIONS .................................................................................................... XV CHAPTER 1: INTRODUCTION ................................................................................................. 1 1.1 Background ...................................................................................................... 1 1.2 Problem statement ........................................................................................... 2 1.3 Aims and objectives ......................................................................................... 3 1.3.1 Aims .................................................................................................................................. 3 1.3.2 Objectives......................................................................................................................... 3 1.4 Basic hypothesis .............................................................................................. 3 1.5 Scope of research ............................................................................................ 3 1.6 Assumptions and limitations ........................................................................... 4 1.7 Research contribution ...................................................................................... 4 1.8 Dissertation structure ...................................................................................... 4 CHAPTER 2: LITERATURE REVIEW ...................................................................................... 6 2.1 Introduction ...................................................................................................... 6 2.2 Data mining ....................................................................................................... 6 2.2.1 Datasets ........................................................................................................................... 8 iii Centre for Water Sciences and Management – North-West University, South Africa 2.2.2 Data-mining methods ................................................................................................... 11 2.3 Modelling and forecasting of geohydrological settings .............................. 12 2.3.1 Model types .................................................................................................................... 13 2.3.1.1 Process-based modelling ............................................................................................ 13 2.3.1.2 Data-driven modelling .................................................................................................. 14 2.4 Data-driven modelling techniques ................................................................ 15 2.4.1 Decision tree (model trees) ......................................................................................... 15 2.4.2 Naive Bayes / Bayesian Classifiers............................................................................ 16 2.4.3 Artificial Neural Networks ............................................................................................. 17 2.4.3.1 Structure of an artificial neural network ..................................................................... 17 2.4.4 K-Nearest neighbours .................................................................................................. 18 2.4.5 Support Vector Machines ............................................................................................ 20 2.4.6 Linear regression .......................................................................................................... 24 2.4.7 Fuzzy Logic / Fuzzy rule-based systems (FRBS) .................................................... 24 2.5 Statistical evaluation/ model evaluation ....................................................... 25 2.5.1 Metrics for regression ................................................................................................... 26 2.5.1.1 Mean square error/ root mean square error ............................................................. 26 2.5.1.2 Mean absolute error/ mean absolute percentage error .......................................... 26 2.5.1.3 R square/ adjusted R square ...................................................................................... 27 2.5.2 Confusion matrix and associated metrics for classification .................................... 27 2.6 Borehole parameters / geohydrological characterisation ........................... 31 iv Centre for Water Sciences and Management – North-West University, South Africa 2.7 Geohydrological studies already conducted by using machine learning ........................................................................................................... 31 2.8 Machine learning in the context of South African policy ............................. 32 2.9 Conclusion ...................................................................................................... 32 CHAPTER 3: NATIONAL GROUNDWATER DATASETS ...................................................... 33 3.1 Data quality ..................................................................................................... 33 3.1.1 Measuring Data Quality ............................................................................................... 34 3.2 National Groundwater Datasets and Data Availability ................................. 36 3.2.1 National Groundwater Archive .................................................................................... 36 3.2.2 Groundwater Resources Information Project ........................................................... 40 3.3 Available Data Discussion ............................................................................. 42 3.3.1 National Groundwater Archive .................................................................................... 42 3.3.1.1 Completeness ................................................................................................................ 42 3.3.1.1.1 Schema completeness ................................................................................................. 42 3.3.1.1.2 Column completeness .................................................................................................. 43 3.3.1.2 Consistency ................................................................................................................... 44 3.3.1.3 Free-of-error................................................................................................................... 44 3.3.2 Groundwater Resources Information Project ........................................................... 44 3.3.2.1 Completeness ................................................................................................................ 45 3.3.2.1.1 Schema completeness ................................................................................................. 45 3.3.2.1.2 Column completeness .................................................................................................. 46 3.3.2.2 Consistency ................................................................................................................... 46 v Centre for Water Sciences and Management – North-West University, South Africa 3.3.2.3 Free-of-error................................................................................................................... 47 3.4 Spatial datasets .............................................................................................. 47 CHAPTER 4: METHODOLOGY .............................................................................................. 49 4.1 Data acquisition .............................................................................................. 49 4.1.1 NGA data acquisition process ..................................................................................... 50 4.1.2 GRIP data acquisition process.................................................................................... 50 4.1.3 GIS data acquisition process ...................................................................................... 52 4.2 Data processing .............................................................................................. 53 4.2.1 Data processing - Phase 1 .......................................................................................... 53 4.2.1.1 NGA ................................................................................................................................ 53 4.2.1.2 GRIP ............................................................................................................................... 54 4.2.1.3 GIS .................................................................................................................................. 54 4.2.2 Data processing - Phase 2 .......................................................................................... 54 4.2.3 Data processing - Phase 3 .......................................................................................... 54 4.3 Computer methods ......................................................................................... 57 4.4 Algorithms ...................................................................................................... 57 4.4.1 Static Water Level ......................................................................................................... 58 4.4.1.1 Regression ..................................................................................................................... 58 4.4.1.1.1 Multiple Linear Regression .......................................................................................... 59 4.4.1.1.2 Support Vector Regression ......................................................................................... 59 4.4.1.1.3 Decision Tree Regression ........................................................................................... 59 4.4.1.1.4 Random Forest Regression ........................................................................................ 59 vi Centre for Water Sciences and Management – North-West University, South Africa 4.4.1.1.5 Regression model selection ........................................................................................ 59 4.4.1.1.6 Comparison with established geohydrological software ......................................... 60 4.4.1.2 Classification .................................................................................................................. 60 4.4.1.2.1 K-Nearest neighbour classification ............................................................................. 61 4.4.1.2.2 Support vector classification........................................................................................ 61 4.4.1.2.3 Naive Bayes classification ........................................................................................... 61 4.4.1.2.4 Decision-tree classification .......................................................................................... 61 4.4.1.2.5 Random-forest classification ....................................................................................... 61 4.4.1.2.6 Classification model selection ..................................................................................... 61 4.4.2 Average water strike yield ........................................................................................... 62 4.4.2.1 Regression and model selection ................................................................................ 62 4.4.2.2 Classification and model selection ............................................................................. 63 4.5 Assumptions and limitations ......................................................................... 64 CHAPTER 5: CASE STUDIES ................................................................................................ 65 5.1 Lowveld case study ........................................................................................ 65 5.1.1 Background .................................................................................................................... 67 5.1.2 Water Level Predictions ............................................................................................... 71 5.1.3 Yield predictions ............................................................................................................ 77 5.2 Eastern Bushveld Complex Case study ........................................................ 78 5.2.1 Background .................................................................................................................... 78 5.2.2 Water level predictions ................................................................................................. 84 5.2.3 Yield predictions ............................................................................................................ 90 vii Centre for Water Sciences and Management – North-West University, South Africa 5.3 Taung-Prieska Belt case study ...................................................................... 91 5.3.1 Background .................................................................................................................... 93 5.3.2 Water level predictions ................................................................................................. 96 5.3.3 Yield predictions .......................................................................................................... 100 CHAPTER 6: RESULTS AND DISCUSSION ........................................................................ 103 6.1 Water level modelling ................................................................................... 103 6.2 Yield modelling ............................................................................................. 105 CHAPTER 7: CONCLUSIONS AND RECOMMENDATIONS ............................................... 106 BIBLIOGRAPHY ................................................................................................................... 108 ANNEXURES ........................................................................................................................ 116 8.1 Annexure A – NGA database ....................................................................... 116 8.2 Annexure B – GRIP database example ....................................................... 120 8.3 Annexure C – Model Scripts ........................................................................ 121 8.3.1 Regression ................................................................................................................... 121 8.3.2 Classification ................................................................................................................ 125 8.4 Annexure D – Maps ...................................................................................... 130 viii Centre for Water Sciences and Management – North-West University, South Africa LIST OF TABLES Table 2-1: Difference between continuous and categorical data. Excerpt from the UCI Machine learning repository dataset ‘Adult’ (Kohavi & Becker, 1996) ..... 10 Table 2-2: Kappa value partitioning and associated labels (Landis & Koch, 1977) .......... 29 Table 3-1: Data quality dimensions (Pipino et al., 2002) ....................................................... 33 Table 3-2: Schema completeness results for a selection of the NGA located in the Limpopo Province .................................................................................................... 43 Table 3-3: Schema completeness results for the GRIP ........................................................ 45 Table 4-1: Assigned yield classes ............................................................................................ 55 Table 4-2: Water level regression model performance metrics ........................................... 60 Table 4-3: Water level classification model performance metrics ....................................... 62 Table 4-4: Yield regression model performance metrics ...................................................... 63 Table 4-5: Yield classification model performance metrics .................................................. 63 Table 5-1: Borehole data distribution for chosen Vegter regions ........................................ 65 Table 5-2: Borehole density for the Lowveld region .............................................................. 67 Table 5-3: Borehole density for the Eastern Bushveld Complex region ............................. 79 Table 5-4: Borehole density for the Taung-Prieska Belt region ........................................... 93 Table 6-1: Static water level model results obtained from case studies ........................... 103 Table 6-2: Yield model results obtained from case studies ................................................ 105 Table 8-1: NGA available features for export ........................................................................ 116 Table 8-2: Column completeness results for the NGA ........................................................ 118 Table 8-3: Column completeness results for the GRIP ....................................................... 119 ix Centre for Water Sciences and Management – North-West University, South Africa LIST OF FIGURES Figure 2-1: CRISP-DM standard process (adapted from Larose (2005)). ............................. 7 Figure 2-2: Intersections of disciplines that influence data mining and machine learning (adapted from Mitchell-Guthrie (2014)). ................................................ 11 Figure 2-3: Data-mining methods (adapted from García et al., 2015). ................................. 12 Figure 2-4: Example structure of a decision tree (Tehrany et al., 2013). ............................. 15 Figure 2-5: Basic structure of a neural network (Larose & Larose, 2019). .......................... 18 Figure 2-6: K-nearest neighbour illustration (Alaliyat, 2008). ................................................. 19 Figure 2-7: Support vector machine classification for a binary class problem. (a) Possible separating hyperplanes. (b) Maximum-margin hyperplane (Russell & Norvig, 2010). ........................................................................................ 21 Figure 2-8: Support vector machine classification for a linear inseparable problem. (a) Two-dimensional dataset with a circular decision boundary. (b) The same dataset mapped into a three-dimensional space. The data takes on a cone shape and the circular decision boundary becomes linear. (c) One-dimensional dataset with no clear decision boundary. (d) Two- dimensional space due to applied kernel function (Russell & Norvig, 2010; Noble, 2006). ................................................................................................. 23 Figure 2-9: Confusion matrix structure (a) for a 2-class classification and (b) for a 4- class classification problem (Sirsat, 2019; Diez, 2018). ..................................... 28 Figure 2-10: Example output confusion matrix of a spam filter. (a) Sensitivity, (b) specificity, (c) precision and (d) accuracy (Sirsat, 2019). .................................. 30 Figure 3-1: Annual growth in NGDB and NGA records from 1985 to 2008 as adapted from DWA (2009) ..................................................................................................... 36 Figure 3-2: NGA borehole distribution and density per 10’ x 10’ grid (DWS, 2020). ......... 38 Figure 3-3: NGA Site Map (NGA, s.a.(c)). ................................................................................. 39 x Centre for Water Sciences and Management – North-West University, South Africa Figure 3-4: GRIP borehole distribution ...................................................................................... 41 Figure 3-5: Column completeness overview for a selection of the NGA ............................. 43 Figure 3-6: Column completeness overview for the GRIP ..................................................... 46 Figure 4-1: Distribution and overlap of boreholes from both the NGA and the GRIP databases .................................................................................................................. 51 Figure 4-2: Assignment process of GIS data to a single borehole ........................................ 52 Figure 4-3: Distribution of yield values in different size classes ............................................ 56 Figure 4-4: Types of machine-learning algorithms and the R libraries used in each ......... 57 Figure 5-1: Locality map of the Lowveld groundwater region ............................................... 66 Figure 5-2: Borehole distribution in the Lowveld region – static water levels and yield .... 68 Figure 5-3: Borehole distribution in the Lowveld region – pumping test parameters ........ 69 Figure 5-4: Time series water levels for borehole 2329BB00004 ......................................... 70 Figure 5-5: Lowveld static water level and elevation correlation .......................................... 71 Figure 5-6: Lowveld elevation and drainage map ................................................................... 72 Figure 5-7: Lowveld predicted water level correlation............................................................ 73 Figure 5-8: Lowveld numerical water level predictions .......................................................... 75 Figure 5-9: Lowveld water level classification prediction ....................................................... 76 Figure 5-10: Lowveld predicted yield .......................................................................................... 77 Figure 5-11: Lowveld yield classification confusion matrix ...................................................... 78 Figure 5-12: Locality map of the Eastern Bushveld Complex groundwater region .............. 80 Figure 5-13: Borehole distribution in the Eastern Bushveld Complex region – static water levels and yield .............................................................................................. 81 xi Centre for Water Sciences and Management – North-West University, South Africa Figure 5-14: Borehole distribution in the Eastern Bushveld Complex region – pumping test parameters ........................................................................................................ 82 Figure 5-15: Time series water levels for borehole 2429BDC0001 ........................................ 83 Figure 5-16: Eastern Bushveld Complex static water level correlation .................................. 84 Figure 5-17: Eastern Bushveld Complex elevation and drainage map ................................... 85 Figure 5-18: Eastern Bushveld Complex predicted water level correlation ........................... 86 Figure 5-19: Eastern Bushveld Complex predicted water level prediction correlation ........ 88 Figure 5-20: Eastern Bushveld Complex water level classification confusion matrix ........... 89 Figure 5-21: Eastern Bushveld Complex predicted yield ......................................................... 90 Figure 5-22: Eastern Bushveld Complex yield classification confusion matrix ..................... 91 Figure 5-23: Locality map of the Taung-Prieska Belt groundwater region ............................ 92 Figure 5-24: Borehole distribution in the Taung-Prieska Belt region – static water levels and yield ......................................................................................................... 94 Figure 5-25: Time series water levels for borehole 2624DC00033 ........................................ 95 Figure 5-26: Taung-Prieska Belt static water level correlation ................................................ 96 Figure 5-27: Taung-Prieska Belt elevation and drainage map ................................................. 97 Figure 5-28: Taung-Prieska Belt predicted water level correlation ......................................... 98 Figure 5-29: Taung-Prieska Belt water level prediction correlation ........................................ 99 Figure 5-30: Taung-Prieska Belt water level classification confusion matrix ....................... 100 Figure 5-31: Taung-Prieska Belt predicted yield ..................................................................... 101 Figure 5-32: Taung-Prieska Belt yield classification confusion matrix .................................. 102 Figure 8-1: Eastern Bushveld Complex – Baseflow .............................................................. 130 Figure 8-2: Eastern Bushveld Complex - Lithology ............................................................... 131 xii Centre for Water Sciences and Management – North-West University, South Africa Figure 8-3: Eastern Bushveld Complex – Geology ................................................................ 132 Figure 8-4: Eastern Bushveld Complex - Precipitation ......................................................... 133 Figure 8-5: Eastern Bushveld Complex - Recharge .............................................................. 134 Figure 8-6: Eastern Bushveld Complex - Runoff.................................................................... 135 Figure 8-7: Eastern Bushveld Complex - Storativity ............................................................. 136 Figure 8-8: Lowveld - Baseflow ................................................................................................ 137 Figure 8-9: Lowveld - Lithology ................................................................................................ 138 Figure 8-10: Lowveld - Geology ................................................................................................. 139 Figure 8-11: Lowveld - Precipitation .......................................................................................... 140 Figure 8-12: Lowveld - Recharge ............................................................................................... 141 Figure 8-13: Lowveld - Runoff .................................................................................................... 142 Figure 8-14: Lowveld - Storativity .............................................................................................. 143 Figure 8-15: Taung-Prieska Belt - Baseflow ............................................................................. 144 Figure 8-16: Taung-Prieska Belt - Lithology ............................................................................. 145 Figure 8-17: Taung-Prieska Belt - Geology .............................................................................. 146 Figure 8-18: Taung-Prieska Belt - Precipitation ....................................................................... 147 Figure 8-19: Taung-Prieska Belt - Recharge ............................................................................ 148 Figure 8-20: Taung-Prieska Belt - Runoff .................................................................................. 149 Figure 8-21: Taung-Prieska Belt - Storativity ............................................................................ 150 xiii Centre for Water Sciences and Management – North-West University, South Africa LIST OF EQUATIONS (2-1) Naive Bayes mechanism ......................................................................................... 16 (2-2) Simpel linear regression model ............................................................................. 24 (2-3) Multiple linear regression model ............................................................................ 24 (2-4) Root mean square error .......................................................................................... 26 (2-5) Mean absolute error ................................................................................................ 27 (2-6) Classification accuracy for a 2-class confusion matrix ....................................... 29 (2-7) Classification accuracy for a multi-class confusion matrix ................................. 29 (2-8) Cohen’s Kappe coefficient ...................................................................................... 29 (2-9) Theoretical expected classification accuracy ...................................................... 29 (2-10) F1 score ..................................................................................................................... 30 (3-1) Completeness factor of a dataset .......................................................................... 35 (3-2) Consistency factor of a dataset .............................................................................. 35 (3-3) Free-of-error metric of a dataset ........................................................................... 35 xiv Centre for Water Sciences and Management – North-West University, South Africa LIST OF ABBREVIATIONS ACC Classification Accuracy ANN Artificial Neural Network CRISP-DM Cross-Industry Standard Process for Data Mining CSV Comma Separated Value DDM Data-driven modelling DT Decision Tree DWS Department of Water and Sanitation FN False Negative FP False Positive FRBS Fuzzy Rule Based Systems GIS Geographic Information System GRIP Groundwater Resources Information Project IBL Instance-based learning IDE Integrated Development Environment KDD Knowledge Discovery in Databases K-NN K-Nearest Neighbour MAE Mean Absolute Error MAP Mean Annual Precipitation MAPE Mean Absolute Percentage Error MLP Multilayer Perceptron MLR Multiple Linear Regression MODFLOW Modular Finite-Difference Groundwater Flow Model MS Microsoft MSE Mean Square Error NGA National Groundwater Archive NGDB National Groundwater Database RBF Radial basis functions RFC Random Forest Classification RFR Random Forest Regression RMSE Root Mean Square Error SVM Support Vector Machines SVR Support Vector Regression TN True Negative TP True Positive USGS United States Geological Survey xv Centre for Water Sciences and Management – North-West University, South Africa CHAPTER 1: INTRODUCTION This chapter introduces the research problem and the context in which the research took place, and it outlines the aims and objectives for the present project. It also serves as a ‘road map’ for topics that will be discussed. 1.1 Background Water has been an indispensable resource since the dawn of time. Surface water was the first to be utilised, mainly for fishing and hunting. Upon the advent of agriculture and animal husbandry, ancient civilizations dating back to biblical times realised that more water would be required for the sustenance of an expanding agriculture. The first book of the Bible, Genesis, mentions that the patriarch Isaac dug wells with great success. Such interventions led to substantial growth in agriculture and irrigation, especially in the arid regions of southern Asia and northern Africa (Meinzer, 1934). Villholth and Giordano (2007) note that surface water became more important over time and that the public was more occupied with surface water resources than groundwater. However, with the increase in awareness regarding water quality and quantity, the public interest in groundwater grew. Groundwater can be more accessible than surface water depending on geographic locations, and drilling and pumping techniques improved, making groundwater a favourable option in agriculture and industry. It should be noted, however, that the increase in usage of this resource generates the need for appropriate management practices to mitigate the plethora of groundwater problems that results from this (Villholth & Giordano, 2007). Africa, especially sub-Saharan Africa (Taylor et al., 2009), is rapidly urbanising, and a tripling of the population is predicted from 2000 to 2050. This increase is directly proportional to the demand for easily accessible, potable water resources. Groundwater has become the primary source of water for most domestic households across Africa, since it offers a low-cost alternative, enjoys wide spatial distribution and has a generally potable quality. Taylor et al., (2009) note that reliable groundwater data are scarce in some parts, restricting the ability to formulate abstraction policies to manage the aquifers being abstracted from. Large portions of Southern Africa fall within hyper-arid, arid, or semi-arid climate parameters (Xu & Beekman, 2019). Due to this, groundwater resources are of increasing importance, not least also since many regions in South Africa have been struck by severe droughts throughout history 1 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa and surface water resources have variable reliance. Groundwater ostensibly acts as a buffer between drought conditions and water supply, given that groundwater is generally reliable and exhibits resistance to hydrological droughts, generally has good quality, and is mostly free of organic matter, while it can typically be found in near proximity to where it is required (Allwright et al., 2013; Shirmohammadi et al, 2013). Various sectors involved in development and management of the natural environment require yield estimates of boreholes to sustainably meet water demands and manage groundwater resources (Allwright et al, 2013). Shirmohammadi et al. (2013) note that groundwater overexploitation is a significant issue for developing countries and that groundwater level plays a key role in the sustainable yield of groundwater resources. The modelling of groundwater levels in South Africa is therefore vital for continuous management of sub-surface water resources. Data generation is rapidly increasing in this digital era and a legion of discoveries are yet to be made in terms of this endless source. The process of computer-assisted analysis and extraction of new insights from large quantities of data are known as data mining or knowledge discovery (Babovic, 2005). Zaki & Meira Jr (2014) indicate that data mining is an interdisciplinary field that combines study areas such as database systems, machine learning, pattern recognition, and statistics. By combining data mining with the environmental sector, new patterns can be generated for analysis, and novel management techniques may be discovered for implementation. Zhu et al. (2022) discuss the application of machine learning in the water quality domain from modelling the movement of pollutants in surface and groundwater to management of water supply systems by predicting changes in water production given certain parameters as well as the monitoring of wastewater quality to streamline wastewater treatment plant (WWTP) management. One infers that further insights may well be gained in die field of groundwater studies by utilising and mining massive databases such as the National Groundwater Archive (NGA) and other publicly available databases such as the Groundwater Resources Information Project (GRIP). 1.2 Problem statement Physical surveying methods for groundwater supply in poorly understood regions are resource intensive and incur great costs, with high risk of being either unsuccessful, or not able to supply the water demand in full. This may lead to the need for additional borehole drilling. Drilling should primarily focus on areas with high probabilities of water bearing subsurface units with higher yields to ensure water demands will be met with minimal costs (Khan et al., 2023). Therefore, a need 2 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa exists to identify the most appropriate area easily and efficiently for successful water supply without the need to survey an extensive area. Identifying potential surveying areas and drilling boreholes with a high rate of accuracy relative to the desired water level and water supply are key aspects when it comes to ensuring that this costly endeavour will have a valuable outcome. The NGA and GRIP databases are readily available for exploitation and, by researching data mining and machine learning techniques, a cost-effective analysis of data can be obtained that will aid decision making. 1.3 Aims and objectives 1.3.1 Aims The principal aim of this study is to implement data mining and machine learning techniques to classify relationships between borehole parameters and their relevant geological settings. Drilling of boreholes is costly and, by mining the national databases, enhanced insight can be gained that will improve the management of boreholes and wellfields. 1.3.2 Objectives The objectives of this study are as follows: 1. Compiling a single database containing borehole and other relevant information from the national groundwater datasets as well as data obtained from geographic information systems. 2. Testing data mining techniques and machine-learning algorithms on the created database. 3. Validating identified methods by applying these to predefined case studies where actual data are available for these. 1.4 Basic hypothesis By applying data-mining and machine-learning techniques on borehole data, geohydrological characterisation of unexplored areas could be enhanced. 1.5 Scope of research This study will mainly focus on predicting two aspects of groundwater, namely groundwater level and yield as these are the critical parameters when siting new boreholes. Although the prediction of actual aquifer parameters is not attempted, borehole yield and water level are related to the 3 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa afore mentioned parameters. Groundwater level and yield prediction as a result of in situ geological conditions is the main focus of this study. The study solely rely on exiting data in the national datasets and no additional data acquisition was undertaken. 1.6 Assumptions and limitations The assumptions made in this study is that there exists an underlying relationship amongst parameters in the database that can be leveraged to successfully predict the parameters in question for this study, i.e., water level and yield. The intrinsic limitation to any study is data. Since machine learning required an adequate dataset to extract relationships, a possible limitation is the completeness of the dataset used. In an attempt to reduce the impact of the identified limitation, the Limpopo dataset was targeted to test the methodology, as this is the most complete borehole dataset across a large area in South Africa. 1.7 Research contribution Khan et al. (2023) notes that borehole drilling and surveying are very costly and optimal borehole location selection is key to ensure the sustainable management of this vital resource. The research is considered a building block in the development of a tool or a means of assistance to aid in the decision-making process to sustainably develop the groundwater supply. The application of machine learning to existing databases could expedite this process, delivering insights not previously known about areas, and ensure optimisation of field surveys and reducing costs associated with well field development. 1.8 Dissertation structure The dissertation is structured as follows: 1. Chapter 1: Introduction a. General introduction describing the background of why the research was done, the problem statement, scope of work, and the aims and objectives of the study. 2. Chapter 2: Literature Review a. Data-mining, machine learning and geohydrological modelling background is briefly discussed to familiarise the reader with what data-mining can achieve in the context of geohydrology and the different types of modelling that is used in the sector. b. Various data-driven techniques are explored and discussed. 4 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa c. Statistical evaluation and metrics of accuracy of models are discussed. d. Geohydrological characterisation is briefly discussed along with studies conducted relating to the topic of machine learning and the geohydrological study field. 3. Chapter 3: National Groundwater Datasets a. Three databases, namely the NGA, GRIP and GIS, were discussed on the context of data quality and data availability. 4. Chapter 4: Methodology a. A methodology was compiled based on the literature review findings and discussed along with assumptions and limitations. 5. Chapter 5: Case Studies a. Three study areas were selected, and each discussed in the following contexts: i. Background, locality, and groundwater specific data analysis. ii. Water level predictions based on the methodology. iii. Yield predictions based on the methodology. 6. Chapter 6: Results and Discussion a. The findings obtained during application of the methodology to each study area are discussed and validity of the hypothesis is examined. 7. Chapter 7: Conclusions and Recommendations a. A conclusion is reached based on the results and as it relates to other studies. b. Recommendations are discussed for future research. 5 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa CHAPTER 2: LITERATURE REVIEW 2.1 Introduction The appropriate management of freshwater resources, especially groundwater, is critical in order to attain the maximum value from these assets without crippling their sustainability (Maliva, 2016). Maliva (2016) suggests that numerical groundwater modelling is a crucial tool for evaluating and managing groundwater. Kenda et al. (2018) note that the conventional process-based models, as stipulated by Maliva (2016), rely on prior and intricate knowledge of the aquifer dynamics so that extremely specific sets of data are required. These factors have caused a shift towards data- driven modelling. Computer-assisted analysis such as machine learning and data mining are powerful and useful tools in any scientific field where vast amounts of data are available. Sahoo et al. (2017) comment on the intricacy of accurately modelling a system with complex underlying physical processes due to the substantial amounts data needed for development and calibration, as is the case with geohydrological modelling. Therefore, it has become popular to explore data-driven modelling techniques, that is, machine learning, to interpret large datasets without prior or deep knowledge about the subject matter. This literature review aims to investigate the topic of machine learning and the popular algorithms used for predictive analysis and how it has been and can be used in the field of geohydrological modelling. 2.2 Data mining Vast amounts of new data are being generated every day as a result of increased internet usage, business related services, surveys, academic studies, and the progress in storage and connection of technology (García et al., 2015). These datasets are too massive for manual analyses, and this has led to a need for gathering useful information and structured knowledge through the utilization of data mining (García et al., 2015). Data mining is the practice of detecting underlying patterns in data through data acquisition and preparation as well as processing by means of mathematical or statistical techniques and, finally, analysis (Aggarwal, 2015; Larose, 2005). Hand et al. (2001) define data mining as ‘the analysis of (often large) observational datasets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner’. 6 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa Data mining examines data to discover previously unseen patterns and plays a critical role in knowledge discovery in databases (KDD) (Bramer, 2016; García et al., 2015; Hand et al., 2001). Associations and summaries engendered by data mining are referred to as models or patterns, which include linear equations, rules, clusters, graphs, tree structures, and recurrent patterns in time series (Hand et al. 2001). Data mining is normally applied to secondary data, meaning that the data used was primarily collected for another purpose (Hand, 2013). Data mining is a pipeline process with approximately six phases (Larose, 2005; Aggarwal, 2015), the latter which will be discussed later. According to Larose (2005), data mining requires something analogous to a standard operating procedure. This is known as cross-industry standard process for data mining, or CRISP-DM, which is illustrated in Figure 2-1. García et al. (2015) however note that the steps of data mining are different for each individual or project. Nonetheless, CRISP-DM offers a good framework for structuring a data-mining project. Figure 2-1: CRISP-DM standard process (adapted from Larose (2005)). Based on Aggarwal (2015), who divided the workflow into three steps, García et al. (2015), who incorporated a hybridised version of the KDD process, and the CRISP-DM standards (Larose, 2005), an aggregated workflow is presented below. Problem statement or understanding phase, the stage in which the problem is identified and the method of application determined. Proper understanding of relevant concepts is attained 7 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa through research and integrated into the problem-solving process. García et al. (2015) note that relevant prior knowledge, as obtained from experts, is vital to the reliability of the application. Data collection is the process in which data are collected from various sources and arranged into databases for processing. Although this step might seem affable, it is very important to make good selections, as the data chosen will considerably impact the data-mining process. Data preparation includes cleaning of noisy and erratic data, combining multiple data sources by means of a data dump, implementing data transformation where data are converted into standard useable formats for the chosen data-mining method, and data reduction. Aggarwal (2015) terms this phase feature extraction and data cleaning. Data are seldom sourced in a ‘ready to use’ form, that is, functional for data mining algorithms. It is a crucial step in the process so as to ensure that data used in the mining process are useful. Data-mining-appropriate formats include multidimensional, time series, or semi-structured data. Missing and incorrect data may be cleaned by estimation or correction or these may be omitted from the set. Modelling is the stage where new information and patterns are derived from the data by using proper methods. These include choosing the appropriate data mining task, and Bramer (2016) notes four main types of data mining: numerical prediction, clustering, classification, or association. Consider that various models may be built. Once adequate techniques have been selected, parameter calibration can be performed and the model must continuously be validated. Evaluation is a critical step for determining the quality of the results from the preceding stage and test the validity of the created models. In other words, circle back to the initial phase and insure that all objectives have been met by the model results (Larose, 2005). Result exploitation is the direct application of the knowledge gained, integrating it for another purpose, or creating tools for others to use. 2.2.1 Datasets Hand et al. (2001) define datasets as ‘a set of measurements taken from some environment or process’. A dataset consists of a group of n objects (entities, individuals, cases or records) where, for each object, the same p measurement is structured in an n × p data matrix. Multiple p measurements could be taken and are known as variables or attributes (Hand et al. 2001). The form in which data is available will be specific to each situation: nevertheless, there are distinctions 8 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa to be made. Hand et al. (2001) and Aggarwal (2015), for instance, differentiate between quantitative and categorical data. Quantitative attributes, also termed continuous or numeric ones, can be measured on a numerical scale and, depending on the type of measurement it represents, can be any value (Hand et al. 2001). Aggarwal (2015) notes that some numerical values are numeric in the sense of having a natural order. Quantitative data are the most common type and also the most useful to work with from a statistical standpoint, as numerous mathematical calculations can be done with such data. Any other type of data may not necessarily represent a numeric value which, in turn, makes it more difficult to incorporate these into a dataset that will be usable for an algorithm (Aggarwal, 2015). Categorical attributes can only consist of discrete values (Hand et al., 2001). Santner and Duffy (1989) and Hand et al. (2001) comment that the measurement scales of discrete data find themselves on the ordinal or nominal scale. Ordinal scales categorise data into groups and orders within the group, meaning that the data can have a natural order such as low/ medium/ high) whereas nominal scales merely categorise data into groups where no particular order is present, that is, involving only discrete categories such as true of false (Santner & Duffy, 1989, Hand et al., 2001). Aggarwal (2015) notes that, more often than not, categorical data can be of a binary nature, meaning that only two categories are present. This can be converted into useable numeric values for an algorithm in the form of 0 or 1. Table 2-1 reflects these distinctions: columns in purple are quantitative variables, whereas columns in blue are examples of categorical variables. 9 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa Table 2-1: Difference between continuous and categorical data. Excerpt from the UCI Machine learning repository dataset ‘Adult’ (Kohavi & Becker, 1996) education- marital- capital- capital- hours- native- income age workclass fnlwgt education occupation relationship race sex num status gain loss per-week country threshold Never- United- 39 State-gov 77516 Bachelors 13 Adm-clerical Not-in-family White Male 2174 0 40 <= 50k married States Self-emp- Married-civ- Exec- United- 50 83311 Bachelors 13 Husband White Male 0 0 13 <= 50k not-inc spouse managerial States Handlers- United- 38 Private 215646 HS-grad 9 Divorced Not-in-family White Male 0 0 40 <= 50k cleaners States th Married-civ- Handlers- United-53 Private 234721 11 7 Husband Black Male 0 0 40 <= 50k spouse cleaners States Married-civ- 28 Private 338409 Bachelors 13 Prof-specialty Wife Black Female 0 0 40 Cuba <= 50k spouse Married-civ- Exec- United- 37 Private 284582 Masters 14 Wife White Female 0 0 40 <= 50k spouse managerial States Married- 49 Private 160187 9th 5 spouse- Other-service Not-in-family Black Female 0 0 16 Jamaica <= 50k absent Self-emp- Married-civ- Exec- United- 52 209642 HS-grad 9 Husband White Male 0 0 45 > 50k not-inc spouse managerial States Never- United- 31 Private 45781 Masters 14 Prof-specialty Not-in-family White Female 14084 0 50 > 50k marries States Married-civ- Exec- United- 42 Private 159449 Bachelors 13 Husband White Male 5178 0 40 > 50k spouse managerial States 10 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa Hand et al. (2001) note that data can occur in various relationships and configurations. Data could be arranged sequentially in time series or they can describe spatial relationships. In the case of the former, data mining might address the entire time series or just a section thereof, whereas the latter considers singular instances only in the context of others. Structures of datasets play an integral part in data mining. Complex data structures require complex models and algorithms (Hand et al., 2001). 2.2.2 Data-mining methods Thus far, we have ascertained that data mining aims to establish patterns within large datasets in order to gain a deeper understanding of the particular data. Various means by which this can be accomplished exist and are used for different applications. Machine learning is one such method that will be explored. Figure 2-2 demonstrates the various disciplines intersecting within the computer science and statistics field as originally illustrated by Mitchell-Guthrie (2014). Figure 2-2: Intersections of disciplines that influence data mining and machine learning (adapted from Mitchell-Guthrie (2014)). 11 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa Teng and Gong (2018) define machine learning as learning techniques that automate the process of gaining knowledge. Therefore, data mining is the study of gaining knowledge and machine learning is the method used for the acquisition of said knowledge. Data mining can be divided into two major method categories: those relating to prediction and those relating to description (García et al., 2015). Figure 2-3 below illustrates reflects the different techniques available for prediction and description. Figure 2-3: Data-mining methods (adapted from García et al., 2015). For this study, the feasibility of the following prevalent machine learning techniques and algorithms will be researched: 1. Decision trees 2. Bayesian classifiers 3. Neural networks 4. K-Nearest neighbour 5. Support vector machines 6. Linear regression 7. Fuzzy logic 2.3 Modelling and forecasting of geohydrological settings According to Wheater et al. (2007), a model is ‘a simplified representation of a real world system’. Goltz and Huang (2017) define a model as an approximation of a system based on assumptions and simplified. Devi et al. (2015) state that the purpose of modelling is to better understand underlying processes of systems and predicting their behaviour, which is reiterated by Goltz and Huang (2017). 12 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa Goltz and Huang (2017) note that a system such as that of the subsurface is tremendously challenging to model due to the numerous unknowns and uncertainties that accompany its existence. Desirable models are those that simulate reality best with the minimum amount of parameters and model complexity. Therefore, if an understanding of governing processes can be gained, geohydrological modelling will be an essential tool for water-resource management in complex systems (Devi et al., 2015). 2.3.1 Model types Hydrological and geohydrological models can be broadly categorised into empirical models, conceptual models, and physically based models (Wheater et al., 2007; Goltz and Huang, 2017). Physical models are small-scale representations of reality, whereas conceptual models are based on a theory or perceived logic of the system (Goltz and Huang, 2017). Elefteriadou (2014) notes that the difference between empirical models and mathematical models is that the former is based on field observations, whereas the latter is based on mathematical equations that describe relationships in the system. Solomatine and Ostfeld (2008) propose a condensed approach and classify models into process-based or data-driven ones, which will be discussed in brief further detail below. Oyebode et al. (2014) note that, due to commonly applied process-based techniques for modelling hydrological settings and scenarios, data-driven techniques have not been entirely incorporated into the field of hydrology. Sun et al. (2022) state that, although data- driven groundwater models have become increasingly popular at small scale, not enough research is being done at the local or regional spatial scales. 2.3.1.1 Process-based modelling Process-based modelling, also referred to as ‘knowledge-driven’ modelling, is based on comprehensive descriptions of hydrological processes and first-order principles of physics. These are conceptual and physically based models (Solomatine & Ostfeld, 2008). Wheater et al., (2007) discuss conceptual models, and notes that these are based on prior information in the form of conceptual representations of processes that are deemed to be important. The model must be calibrated in accordance with observed data of the catchment of interest to obtain a set of parameters that characterise the catchment being modelled. Physically based models rely on catchment processes and equations of motion that are numerically solved by using a grid (Wheater et al., 2007). 13 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa Modelling is extremely complex and requires wide-ranging levels of detail, causing the resultant models to simplify some processes; it will suffer from mis-calibration, over-parameterization, parameter instability, insensitivity or redundancy, high computational requirements, and huge data demand (Oyebode et al., 2014). The most popular process-based groundwater model is MODFLOW. The modular finite-difference groundwater flow model (MODFLOW) was developed by the U.S. Geological Survey (USGS), and its feature simulations include water flow of confined and unconfined aquifers and recharge from precipitation, evapotranspiration, rivers, and streams (Provost et al., 2009). There are a variety of versions of MODFLOW, where each focuses on a different area of specialisation (Kumar, 2019). With a view to the massive amounts of data required for process-based modelling, Wheater et al. (2007) note that considerable computing power is needed to run the modelling task. Depending on the model used, a certain amount of computing power and time are needed. Therefore, the need arises to achieve the result of physically based models by means of quicker development and ease of use (Oyebode et al., 2014), and data-driven modelling could be a possible solution. 2.3.1.2 Data-driven modelling Data-driven modelling (DDM) aims to establish correlations between input variables and output objective data by means of statistical regression, and it does not take into consideration any physical processes of the modelled system (Jing et al., 2022). Therefore, the input data are analysed to characterise a system with a limited assumption in order to establish connections between the input and output variables. These include statistical models and machine learning methods (Solomatine & Ostfeld, 2008; Oyebode et al., 2014). Solomatine and Ostfeld (2008) discuss the advantage contemporary methods over empirical modelling, noting that the former solves numerical prediction problems, allows for recreating nonlinear functions, classification and grouping of data, and the construction of rule-based systems. Solomatine and Ostfeld (2008) note that the water resource community have reservations about the relevance of data-driven models as these are not associated to the physical principles of the system. While traditional statistical models are considered accurate enough, every situation is unique, and the most adequate model must be selected (Solomatine & Ostfeld, 2008). DDMs include artificial neural networks (ANNs) comprising the multilayer perceptron (MLP), radial basis functions (RBFs), fuzzy rule-based systems (FRBSs), instance-based learning (IBL), tree- based methods, evolutionary computational methods (gene expression programming), and support vector machines (SVMs) (Oyebode et al., 2014; Solomatine & Ostfeld, 2008). 14 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa 2.4 Data-driven modelling techniques Sivakumar and Berndtsson (2010) note that one of the foremost reasons for the escalation of mathematical techniques in science is the advancement in computer and measurement technology. Increases in computational power and data of higher quality have turned data-driven modelling into a preferred approach (Sivakumar & Berndtsson, 2010). This section will discuss the different data mining techniques that are listed in Section 0 above. 2.4.1 Decision tree (model trees) A decision tree (DT) uses the structure of a tree and its branches to represent possible decision paths and their respective outcomes (Grus, 2015). Larose and Larose (2019) defines a decision tree as containing a set of decision nodes, linked by branches spreading out towards a terminating leaf node, as depicted in Figure 2-4. Figure 2-4: Example structure of a decision tree (Tehrany et al., 2013). Larose and Larose (2019) explain that the purpose of a decision tree is to terminate in a set of leaf nodes where the records contained in each leaf node has the exact same classification. Root nodes are placed at the top of the decision tree, whereby variables are tested at the subsequent decision nodes, and each outcome results in a branch. Branches can either lead to another decision node or a terminating leaf node (Larose & Larose, 2019). Grus (2015) divides decision trees into classification trees and regression trees which return categorical and numerical outputs respectively. Advantages in using decision trees is that the process by which they classify data is immediately apparent to the user and therefore make them easy to understand and interpret. Decision trees can handle mixed attributes, such as quantitative, categorical, and missing ones with ease (Grus, 15 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa 2015; Hand et al., 2001). Hand et al. (2001) also note the speed by which they can classify new records and that they are powerfully predictive tools, since they are flexible. Hand et al. (2001) discuss the problem of overfitting training data in decision trees. This occurs when the splitting of the decision nodes continues until a leaf node only contains a distinct data point or data points with identical input variables. García (2015) argues that noisy training data could impact the overfitting of a decision tree, and suggests an algorithm such as C4.5 that uses pruning strategies to reduce overfitting. C4.5 is one of many decision tree algorithms, and Saha (2018) indicates that it has significant advantages over other decision tree algorithms as it mitigates overfitting, can be used for both classification and regression, and can process incomplete data. 2.4.2 Naive Bayes / Bayesian Classifiers The Bayes and naive Bayes classifiers are used for probabilistic classification tasks and makes use of the Bayes theorem (Zaki & Meira Jr., 2014). Joyce (2019) defines Bayes’ theorem as a mathematical formula calculating conditional probabilities. Conditional probability could be described as ‘the probability of a hypothesis H conditional on a given body of data E is the ratio of the unconditional probability of the conjunction of the hypothesis with the data to the unconditional probability of data alone’ (Joyce, 2019). Zaki & Meira Jr. (2014) explain that the Bayes classifier uses the Bayes theorem to predict the class based on the label that maximises the probability. Grus (2015) explains the mechanism of Naive Bayes by example of spam filtering. Event S is ‘the message is spam’ and event V is ‘the message contains the word Earn $’. Bayes’ Theorem predicts the probability P that spam messages contain the word ‘Earn $’ using Equation (2-1). 𝑃(𝑆 | 𝑉) = [𝑃(𝑉 |𝑆)𝑃(𝑆)]/[𝑃(𝑉 |𝑆)𝑃(𝑆) + 𝑃(𝑉 |¬ 𝑆)𝑃(¬ 𝑆)] (2-1) Zaki & Meira Jr. (2014) note that the full Bayes classifier ineffectually deals with datasets with large number of dimensions, while it suffers from estimation-related problems according to them. Naive Bayes is a surprisingly effective classifier due to the simple assumption that is made, namely that all the attributes of the dataset is independent. This is the key difference between the Bayes classifier and the naive Bayes classifier. 16 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa 2.4.3 Artificial Neural Networks Neural networks aim to replicate the data processing and non-linear learning abilities of a network of neurons (Rojas, 1996; Larose & Larose, 2019). Neural networks are models based on the neuron structure of the human brain and are used for cognitive tasks such as learning and optimisation (Müller et al., 1995). Larose and Larose (2019) explain that the basic function of a neuron is to gathering inputs from other neurons or, in case of artificial neural networks, a dataset, combining 𝑛 inputs by means of a combination function, producing a non-linear response by means of an activation function, and sending it forward to other neurons. Grus (2015) notes that a neuron fires only when the calculation exceeds some threshold, that is: if the activation function produces an adequate response, it will send the response forward; otherwise, it will produce no output. Neural networks are used on account of their forecasting and classification abilities, non- parametric nature, and capacity to generalize (Gaur, 2012). Although artificial neural networks are robust for using complicated non-linear data (Larose & Larose, 2019), they do not lend insight into how exactly they are solving the problem (Grus, 2015; Rojas, 1996). 2.4.3.1 Structure of an artificial neural network Larose and Larose (2019) explain that the structure of an artificial neural network (ANN) consists of nodes, layers, connections, and weights (Figure 2-5). Layers contain nodes, where every node connects to every other node in the next layer. Nodes within the same layer, however, remain unconnected. Connections between nodes has an associated weight (W1A) that is arbitrarily allocated a value between 0 and 1 at initialisation (Larose & Larose, 2019). Müller et al. (1995) define neural network models as a directed graph with four distinct properties: 1) a variable (𝑛𝑖) associated with each node 𝑖; 2) links (𝑖𝑘) between nodes 𝑖 and 𝑘 that have an associated real-value weight (𝑤𝑖𝑘); 3) a real-valued bias (𝜗𝑖) for each node i; and 4) a transfer function (𝑓𝑖[𝑛𝑘 , 𝑤𝑖𝑘 , 𝜗𝑖(𝑘 ≠ 𝑖)]) defined for each node 𝑖. 17 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa Figure 2-5: Basic structure of a neural network (Larose & Larose, 2019). According to Gaur (2012), neural networks can be divided into three categories: feed-forward networks, feedback networks, and self-organisation networks. Feed-forward networks are predominantly used for prediction and pattern recognition (Gaur, 2012), and this is the focus of the present section. Feedback networks are largely used for associative memory and optimisation calculation, whereas self-organising networks are used for cluster analysis. Feed-forward networks only allow for single direction flow from input towards output, without the possibility of looping (Larose & Larose, 2019). Rojas (1996) indicates that, in the absence of cycling (looping), results of the computation is overt and no synchronisation of the computing units are necessary. As mentioned, and as stated also by Mijwel (2018), ANN does not lend any insight into the behaviour of the system being modelled. Mijwel (2018) further explains the disadvantages of ANN: 1) the networks are excessively dependent on hardware and processing power, 2) the optimum result is not necessarily achieved due to the duration of the network, and 3) the experience of the user influences the integrity of the structure. 2.4.4 K-Nearest neighbours K-Nearest neighbour (K-NN) methods belong to the class of instance-based learning (IBL) algorithms (Oyebode et al., 2014). Nearest-neighbour methods are fairly simple and rely on the principle of predicting a new data point by only considering those closest to it (Grus, 2015). 18 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa Oyebode et al. (2014) explain that IBL algorithms store information from training samples, while the information is subsequently applied to render a classification for a new instance. This is possible due to the retrieval of relevant information from the nearest neighbours (Oyebode et al., 2014). To classify a data point with similar input vectors as those of the adjacent points, the k nearest points are examined for a particular input vector and assign the new point to the class majority (Hand et al. 2001). Closest data points are calculated by using Euclidean distance, which is the measurement of the proximity of a feature vector of a specified distance and a training samples’ feature vector. Hand et al. (2001) explain that k-nearest neighbour is based on probabilities. In Figure 2-6, the centre circle represents a new data point, while squares and triangles are training data consisting of two distinct classes. The solid circle indicates k = 3, therefore the three nearest points are used to classify the new point. The dotted circle indicates k = 5, then the five nearest points will be used for classification (Alaliyat, 2008). Figure 2-6: K-nearest neighbour illustration (Alaliyat, 2008). Theoretically, a small portion of variables clustered around the new data point are used, with a radius equal to the distance to the kth nearest neighbour. Subsequently, probability proportions are calculated for the likelihood of the point belonging to each possible class in the small portion. 19 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa The maximum probability class is then assigned to the new point (Hand et al., 2001). Class or labels can be true or false, which is predicated on there being a condition to fulfil; alternatively, they can be categorical (Grus, 2015). Hand et al. (2001) explain the process of choosing a k value. At the very basic form k = 1, but this does not make for a stable classifier, as it has high variance. Reliable predictions can be made by steadily increasing k, keeping in mind the distance of points included with a higher k value, that is, it reduces variance but increases bias. Data-adaptive approaches appear to be the best technique for choosing k. Try several values, noting the misclassification rate of each, choosing k based upon the best performing value. The performance can then be verified on the testing data (Hand et al., 2001). Increasing dimensionality (adding variables) causes the data to become sparser which ultimately influences the true probability (Hand et al., 2001). Grus (2015) notes that K-NN makes no mathematical assumptions: it only requires a distance aspect and the assumption that grouped points are similar. Therefore, it is easily programmable and requires no optimisation (Hand et al., 2001). Hand et al. (2001) also note that K-NN is well adapted to manage missing values. High-dimensionality is a cause for concern in most models, and K-NN performs poorly with large amounts of variables (Hand et al., 2001; Grus, 2015). A potential consequence of using k-nearest neighbour is not knowing the drivers of the phenomenon or system being studied (Grus, 2015) as it does not build a model, but depends on recalling all training data (Hand et al., 2001). There are also the problems of computing time and storage requirements. For large training sets consisting of n data points, each data point is visited and p operations performed to calculate distance. This process requires considerable time and memory (Hand et al., 2001). 2.4.5 Support Vector Machines In its simplest form, a support vector machine (SVM) is an algorithm that learns through example and assigns labels to new data points (Noble, 2006). Noble (2006) explains that an SVM is fundamentally a mathematical unit capable of capitalising on a mathematical function concerning a particular dataset. Neelamegam (2013) notes that the SVM is an effective method for classification, pattern recognition, and regression, due to its high generalisation capacity concerning input data with high dimensionality. Noble (2006) specifies that SVM classification consists of four basic concepts: 1) the separating hyperplane, 2) the maximum-margin hyperplane, 3) the soft margin, and 4) the kernel function. 20 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa The separating hyperplane is a linear separator passing through the dataset in order to separate two classes (Neelamegam, 2013; Noble, 2006). Russell and Norvig (2010) note that many separating hyperplanes can exist, as displayed in Figure 2-7a. Comparing logistic regression with SVM, one finds that logistic regression establishes a separating hyperplane based on all the data points, minimising the loss. Alternatively, SVM calculates the hyperplane to use based on a small selection of points that are considered to be more significant than the rest. Except for the support vectors, that is, the points closest to the hyperplane, all other points therefore have an associated weight of zero. This allows SVM to minimise generalisation loss and is known as the maximum- margin hyperplane (Russell & Norvig, 2010), as depicted in Figure 2-7b. (a) (b) Figure 2-7: Support vector machine classification for a binary class problem. (a) Possible separating hyperplanes. (b) Maximum-margin hyperplane (Russell & Norvig, 2010). Russell & Norvig (2010) argue that the purpose of the maximum-margin hyperplane is to minimise generalisation loss through selecting the hyperplane that is farthest away from the training data points. The margin is the width of the area between the dashed lines of Figure 2-7b. Choosing a hyperplane that is in close proximity to the purple squares, but further from the blue triangles, may result in a situation where testing data points of the purple class fall outside the decision boundary and are incorrectly classified as belonging to blue (Russell & Norvig, 2010). However, this does not mean that all data can be faultlessly separated linearly without an anomalous example, as explained by Noble (2006). SVM algorithms can be adapted by means of 21 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa the use of a soft margin. A soft margin is a membrane of sorts that allows a certain number of data points to reside on the other side of the hyperplane without affecting the final result. The number of data points allowed across and the distance from the hyperplane must be specified by the user (Noble, 2006). The final concept, and arguably the most useful, is the kernel function. Figure 2-8a illustrates a linearly inseparable dataset. The input data can be re-expressed and mapped to a new input space with appropriately higher dimension: subsequently, and two-dimensional data can be defined by three features, as indicated in Figure 2-8b (Russell & Norvig, 2010). Noble (2006) also notes that one-dimensional data (Figure 2-8c) can be mapped to a two-dimensional input space. This is illustrated in Figure 2-8d, where the original expression values have merely been squared, so that a linear distinction can be made between the purple and blue instances. Thus, the kernel function can mathematically project low-dimensional data into a high-dimensional input space (Noble, 2006). 22 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa (a) (b) (c) (d) Figure 2-8: Support vector machine classification for a linear inseparable problem. (a) Two-dimensional dataset with a circular decision boundary. (b) The same dataset mapped into a three-dimensional space. The data takes on a cone shape and the circular decision boundary becomes linear. (c) One-dimensional dataset with no clear decision boundary. (d) Two-dimensional space due to applied kernel function (Russell & Norvig, 2010; Noble, 2006). 23 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa 2.4.6 Linear regression Montgomery et al. (2012) define regression analysis as ‘a statistical technique for investigating and modelling the relationship between variables’. A model with unbent regression parameters is considered to be linear regression (Yan & Su, 2009). Linear-regression models can be either simple or multiple in nature, while other models include polynomial regression ones and nonlinear- regression ones. Simple linear-regression models are models with a single regressor and a response variable that form a straight line (Montgomery et al., 2012). Equation (2-2) is a simple linear-regression model. 𝑦 = 𝛽0 + 𝛽1𝑥 + 𝜀 (2-2) Where β0 = intercept β1 = slope ε = random error component Multiple linear-regression models are those that comprise two or more regressors, while their response variable may possibly be related to k regressors (Montgomery et al., 2012). Equation (2-3) is a multiple linear-regression model. 𝑦 = 𝛽0 + 𝛽1𝑥1 + 𝛽2𝑥2 + ⋯ + 𝛽𝑘𝑥𝑘 + 𝜀 (2-3) 2.4.7 Fuzzy Logic / Fuzzy rule-based systems (FRBS) Zadeh (1988) states that ‘fuzzy logic is concerned with the formal principles of approximate reasoning’, as opposed to the exact reasoning used in classical logic systems, although precise reasoning could be used for limitation. In essence, fuzzy logic focuses on modelling imprecise reasoning in order to retrieve an estimated answer based on partial, inaccurate, or undependable knowledge. Fuzzy logic and fuzzy-logic-based process control has been implemented in a variety of ways, viz. automatic train operation, robot control, speech recognition, and stabilization control (Zadeh, 1988). Classical logic systems fall short in two ways: firstly, they do not make available a structure in which the meaning of proposals articulated in the statement are characterised. Secondly, if meanings can be characterised by means of symbolic representation, there are no tools for interpretation. Fuzzy logic alleviates these problems by characterising variables from the 24 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa statement as elastic constraints, and the question is answered by inference from the propagation of the elastic constraints (Zadeh, 1988). IF-THEN rules are central to fuzzy logic and fuzzy set theory, and information can be denoted in a system of IF-THEN rules where antecedents and outcomes are fuzzy rather than explicit (Aranibar, 1994). Set theory is a branch in mathematical logic pertaining to the theory of well- defined collections of objects (sets) where objects are called members of the set (Bagaria, 2019). Sammut and Webb (2017) define fuzzy sets as those that are distinguished by having a membership function that allocates a degree of membership to all objects in the set. Membership has a value in the range of [0, 1], where 0 is no certain membership and 1 is a certain membership, and all values in between signify partial membership. Thus, fuzzy logic can account for concepts that are more efficiently represented in a spectrum rather than a binary true-or-false classification (Sammut & Webb, 2017). Aranibar (1994) further notes that rules in the systems are activated relative to the membership function of the match between the antecedents and the input, allowing for basic interpolation due to the imprecise nature of the antecedents. This interpolation reduces the number of IF-THEN rules required to define the input-output relationship (Aranibar, 1994). According to Sammut and Webb (2017), fuzzy systems are computing structures based on the concepts of fuzzy logic and fuzzy sets. These structures are partitioned into four main components: 1) a knowledge base, 2) a fuzzification interface, 3) an inference engine, and 4) a defuzzification interface. The knowledge base includes the fuzzy rules and a database defining the linguistic terms of each linguistic input and output variable. The fuzzification interface converts the precise input variables into imprecise fuzzy variables. This is achieved by assigning computed membership values to each variable according to the linguistic terms defined in the knowledge base. The inference engine computes the activation degree and the output of each rule defined in the knowledge base. The defuzzification interface does the inverse of the fuzzification interface by transforming the fuzzy variables into precise outputs (Sammut & Webb, 2017). (Kapitanova, et al., 2012) note that a significant disadvantage of fuzzy logic is that it generates a large rule-base, which requires significant amounts of memory and processing power. The reason for this large rule-base is that the number of rules increases exponentially with the number of variables used. 2.5 Statistical evaluation/ model evaluation Models have to be evaluated according to how accurately they perform and what the error rate is (Larose & Larose, 2019). 25 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa 2.5.1 Metrics for regression The most commonly used metrics to evaluate the performance of regression models include mean square error (MSE) or root mean square error (RMSE), mean absolute error (MAE) or mean absolute percentage error (MAPE), and R squared (R2) or adjusted R squared (Wu, 2020; Brownlee, 2021). 2.5.1.1 Mean square error/ root mean square error Mean square error is an outright measurement of how well the model fits the observed system (Wu, 2020). Mean square error is calculated by summing the square of the prediction error (observed – predicted) and dividing by the total number of data entries in the set (Wu, 2020). Melville and Sindhwani (2017) note that (R)MSE emphasises greater absolute errors. Cichosz (2015) indicates that MSE has a slight disadvantage, namely the effect of changed scale due to squaring. This complicates that understanding of the errors. RMSE is simply the square root of the MSE value and is calculated by using equation (2-4) (Melville and Sindhwani, 2017). It is used more frequently due to its ease of interpretation, as the value is smaller (Wu, 2020). Cichosz (2015) explains that the monotonic nature of the root square function measures uniformly, therefore making RMSE and MSE virtually the same. The only difference occurs around the ease of interpretation (Cichosz, 2015). Σ{𝑖}(𝑃𝑖 − 𝑟𝑖) 2 𝑅𝑀𝑆𝐸 = √ (2-4) 𝑁 Where Pi = predicted value at i ri = observed value at i N = total number of entries 2.5.1.2 Mean absolute error/ mean absolute percentage error Melville and Sindhwani (2017) note that MAE is the most commonly used metric, which may be due to its straightforward nature (Cichosz, 2015). MAE is the averaged absolute difference between a set of observed values and predicted values and is given in equation (2-5) (Melville and Sindhwani, 2017). Wu (2020) states that MAE treats all errors in the same manner, as opposed to MSE, which squares errors to give larger penalisations to larger errors. 26 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa Σ{ 𝑖 }|𝑃𝑖 − 𝑟𝑖| 𝑀𝐴𝐸 = (2-5) 𝑁 Where Pi = predicted value at i ri = observed value at i N = total number of entries 2.5.1.3 R square/ adjusted R square R square, also known as the coefficient of determination, indicates the variance explained by the model (Akossou & Palm, 2013). Values closer to 1 indicate a perfect model fit, whereas lower to negative values indicate a poor to inadequate model (Cichosz, 2015). R squared is biased, as noted by Akossou and Palm (2013), because it gradually increases as new variables are added to the model. Adjusted R square was introduced because R square did not cater for overfitting; added independent variables were penalised to curb this (Wu, 2020). 2.5.2 Confusion matrix and associated metrics for classification The confusion matrix is an appropriate starting point for evaluation, as it can accommodate 2- class or M-class classification problems and registers the associations between the classifier outputs and the actual label (Diez, 2018). Sirsat (2019) explains that the four outputs of a 2-class classification confusion matrix include true positive (TP), false positive (FP), true negative (TN), and false negative (FN), as illustrated in Figure 2-9a. TP represents the number of data points predicted correctly as positive, whereas false positives are the number of data points that have been incorrectly predicted as being positive where, in reality, they are negative. TN represents the number of data points predicted correctly as being negative, while a FN is the number of data points incorrectly predicted as negative where, in reality, they are positive (Sirsat, 2019). 27 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa (a) (b) Figure 2-9: Confusion matrix structure (a) for a 2-class classification and (b) for a 4-class classification problem (Sirsat, 2019; Diez, 2018). Diez (2018) explains that the M-class confusion matrix has elements nij – where i denotes row identifier and j the column identifier – that are indicative of cases correctly classified. Figure 2-9b depicts the correctly classified nij elements in diagonal shaded squares, whereas all other elements are misclassified. Even though the confusion matrix neatly displays all classifier output information, it is not convenient for comparison and discussion purposes (Diez, 2018). Therefore, additional metrics must be extracted from the confusion matrix. The mathematically expressed metrics facilitate in-depth evaluation criteria for a model (Sirsat, 2019). Sirsat (2019) differentiates among sensitivity, specificity, accuracy, and precision. Sensitivity measures the TP rate, which is the positive data points labelled as positive. Ideally, the TP value should be greater than that of the FN value so as to ensure a high sensitivity (Figure 2-10a). Specificity measures the TN rate, which is the negative data points labelled as negative. As in the case of sensitivity, specificity should have a high value (Figure 2-10b). Precision is the proportion of the total number of correctly predicted positive data points and the total number of predicted positive data points (Figure 2-10c) (Sirsat, 2019). Sirsat (2019) defines accuracy as the ratio of total number (probability) of predictions that are correctly predicted (Figure 2-10d), which can be calculated by using Equation (2-6) or Equation (2-7) as found in Diez (2018). Diez (2018) explains that classification accuracy (ACC) must be sensibly examined for the reason that it depends on the number of classes and cases. For instance, a 2-class classification problem will have a 50% chance of a case belonging in either class (Diez, 2018). 28 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa 𝑇𝑃 + 𝑇𝑁 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = (2-6) (𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁) ∑𝑀 𝑖=1 𝑛𝑖𝑖 𝐴𝐶𝐶 = (2-7) 𝑁 Diez (2018) notes that Equation (2-7) has two drawbacks, namely that diagonal values are omitted and classes with reduced numbers of cases have a lower weight in the calculation. In contrast, Cohen’s kappa coefficient (ᴋ) utilises the entire confusion matrix and is calculated by using Equation (2-8) (Diez, 2018), where p0 is classification accuracy (ACC). 𝑝0 − 𝑝𝑒 𝜅 = (2-8) 1 − 𝑝𝑒 Equation (2-9) is used to calculate pe, a theoretical expected classification accuracy (Landis & Koch, 1977) where n:i is the sum of i-th column and ni: the sum of i-th row. ∑𝑀 𝑖=1 𝑛:𝑖𝑛𝑖: 𝑝 = (2-9) 𝑒 𝑁2 Landis and Koch (1977) partition the 0.00 to 1.00 scale of the kappa value into six categories, labelling each category as a measure of strength of agreement (Table 2-2). Table 2-2: Kappa value partitioning and associated labels (Landis & Koch, 1977) Kappa value Strength of agreement < 0.00 Poor 0.00 – 0.20 Slight 0.21 – 0.40 Fair 0.41 – 0.60 Moderate 0.61 – 0.80 Substantial 0.81 – 1.00 Almost Perfect 29 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa Sirsat (2019) notes that the F1 score, that is, Equation (2-10), is a valuable measure to distinguish between models based upon their sensitivity and precision values and is calculated for each class or label. 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 𝐹1 𝑆𝑐𝑜𝑟𝑒 = 2 × (2-10) 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 (a) Predicted Class (b) Predicted Class Spam Not-Spam Spam Not-Spam Spam TP = 45 FN = 20 Spam TP = 45 FN = 20 Actual Class Actual Class Not-Spam FP = 5 TN = 30 Not-Spam FP = 5 TN = 30 (c) Predicted Class (d) Predicted Class Spam Not-Spam Spam Not-Spam Spam TP = 45 FN = 20 Spam TP = 45 FN = 20 Actual Class Actual Class Not-Spam FP = 5 TN = 30 Not-Spam FP = 5 TN = 30 Figure 2-10: Example output confusion matrix of a spam filter. (a) Sensitivity, (b) specificity, (c) precision and (d) accuracy (Sirsat, 2019). 30 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa 2.6 Borehole parameters / geohydrological characterisation The present study focuses on water level and yield. Taylor and Alley (2001) and Kenda et al. (2018) state that water levels, particularly those found in observation wells, are primary indicators of hydrologic stresses that affect aquifers which, in turn, affect recharge, storage, and discharge. Sahoo et al. (2017) also comment on the importance of groundwater levels for sustainable planning and management. Taylor and Alley (2001) note a variety of factors that may influence groundwater levels. These include aquifer types such as confined and unconfined ones, the balance of recharge, storage and discharge from the aquifer, and the physical characteristics of the aquifer-- the latter which include porosity, permeability, and the thickness and composition of the geological units. Climatic and other hydrological factors also play a significant role in groundwater level fluctuations, such as the intensity and duration of precipitation events, the extent to which baseflow contributes to surface water bodies, and evapotranspiration. Countless variables are present in the complex system of groundwater dynamics, and Sahoo et al. (2017) recognise this: therefore, the need arises to investigate possible solutions to gather more knowledge regarding the inner workings of and relationships among these variables. One of the many imperative functions of a boreholes is water supply (Gaaloul et al., 2018). Yield is an important aspect to keep in mind in terms of a borehole used for water supply, which is dependent on the physical environment within which a borehole is located. Freeze and Cherry (1979) define groundwater yield as the maximum abstraction rate that may be allowed to safeguard against declining water levels where they are deemed to be unacceptable. Therefore, an indirect relationship exists between water levels and groundwater yield. As mentioned, fluctuations in water level indicate stresses on the system, which could impact water supply. 2.7 Geohydrological studies already conducted by using machine learning Numerous studies focusing on different fields within hydrology and geohydrology have been conducted with the aim of modelling and forecasting groundwater level fluctuations by using data- driven techniques. Sahoo et al. (2017) made use of the Multilayer Perceptron network to model fluctuations of groundwater levels in agricultural regions in the United States. Kenda et al. (2018) used a variety of regression algorithms to predict groundwater level fluctuations in the Ljubljana Polje aquifer in central Slovenia. Arabameri et al. (2019) achieved good predictions of groundwater potential in Iran by using machine-learning algorithms and a variety of different input data. 31 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa Caté et al. (2017) used six machine-learning algorithms to predict the presence of gold in drill logs. Although this study does not interface directly with geohydrology, the same data-driven models may be used to characterise geological units that greatly influence groundwater. For instance, Bougher (2009) used gamma ray logs to train K-nearest neighbour classifiers to predict stratigraphic units. The possibilities for the implementation of machine learning are nearly infinite when it comes to data-driven models. The input data can be altered to suit virtually any need, and the algorithms can be tweaked, adjusted, or completely changed to find the best fit for the purpose at hand. That is why research that employs machine learning and data mining is becoming popular. 2.8 Machine learning in the context of South African policy Policies regarding water are widespread in South Africa. The research does not target any specific policy, but does support the National Water Act (RSA, 1998) in terms of sustainable use. Important principles of the National Water Act include the sustainable use water that allows for social and economic development, that every citizen of South Africa have access to water, and that the water being used must not be wasted. The National Water Act (NWA) is the foundation of water management within South Africa, with a large focus on ensuring water availability to all users without stressing water resources, such as groundwater over abstraction. The NWA considers classification, reserve, international obligations, inter-basin transfers, strategic use, and future use before authorising any further water use. The machine learning techniques could assist in developing the groundwater supply in underdeveloped areas of good yielding aquifers to ensure the overall sustainability of the groundwater supply. 2.9 Conclusion Extant literature evidences that cost-effective and timely predictions are key aspects when it comes to the management of water resources. Data-driven modelling in a geohydrological context is a worthy endeavour in this respect and could, with a respectable degree of probability, facilitate the surveying and drilling of new boreholes. 32 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa CHAPTER 3: NATIONAL GROUNDWATER DATASETS This chapter examines the two main South African national groundwater datasets, namely the National Groundwater Archive (NGA) and the Groundwater Resources Information Project (GRIP). It is centred on data availability and discusses the available data and their quality. Spatial datasets that cover the whole of South Africa will also be discussed, as they may well contain critical data that are not necessarily captured by the other databases. 3.1 Data quality Data quality is an essential part of research, as it determines the overall quality as well as the replicability of the results (Oliveira et al., 2005; Rosli et al., 2016). Batini and Scannapieco (2016) state that the term quality has been defined as the ‘totality of characteristics of a product that bear on its ability to satisfy stated or implied needs’ or ‘fitness for intended use’. Rosli et al. (2018) note that any conclusions built upon poor quality data may be invalid. Therefore, it is crucial to gain understanding of that which data quality entails and how data can be scrutinised to ensure that these are of good quality. Batini and Scannapieco (2016) observe that data quality is frequently associated exclusively with accuracy, while they point out that it also relies on data completeness, consistency, and currency. According to Rosli et al. (2016) and Rosli et al. (2018), countless research has been conducted on the basis of publicly available data repositories and that issues have been raised about the quality of these datasets and how to overcome these. Quality issues include noise, missing data, incorrect data, duplicate data, and inconsistent data. Pipino et al. (2002) define various data quality dimensions as presented in Table 3-1 below. Table 3-1: Data quality dimensions (Pipino et al., 2002) Dimensions Definitions Accessibility the extent to which data are available, or easily and quickly retrievable the extent to which the volume of data are appropriate for the task at Appropriate amount of data hand Believability the extent to which data are regarded as true and credible the extent to which data are not missing and are of sufficient breadth Completeness and depth for the task at hand 33 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa Concise representation the extent to which data are compactly represented Consistent representation the extent to which data are presented in the same format the extent to which data are easy to manipulate and apply to different Ease of manipulation tasks Free-of-error the extent to which data are correct and reliable the extent to which data are presented in appropriate languages, Interpretability symbols, and units, and clear definitions Objectivity the extent to which data are unbiased, unprejudiced, and impartial Relevancy the extent to which data are applicable and helpful for the task at hand the extent to which data are highly regarded in terms of their source or Reputation content the extent to which access to data are restricted appropriately to Security maintain their security the extent to which the data are sufficiently up-to-date for the task at Timeliness hand Understandability the extent to which data are easily comprehended the extent to which data are beneficial and provide advantages from its Value-Added use Rosli et al. (2016) further explain the need for additional information known as metadata. These the purpose, meaning, and context of data, therefore facilitating a better understanding of the data in question. Metadata also aim to avoid any misinterpretation that could arise from data (Rosli et al., 2016). 3.1.1 Measuring Data Quality Pipino et al. (2002) discuss three prevalent functional forms for the performance of objective assessments: simple ratio, minimum or maximum operation, and weighted average. The present project employs simple ratio, as the weighted average approach relies on a weighting factor assigned to a variable based on its overall importance, whereas the purpose of this study is to let the data mining algorithms detect the more important variables organically. Simple ratio quantifies the ratio between desired or undesired outcomes and total outcomes, and are represented in the convention of 1 and 0 where 1 is most and 0 least desirable. Data quality 34 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa dimensions evaluated by using simple ratio are completeness, concise representation, consistent representation, ease of manipulation, free-of-error, and relevancy (Pipino et al., 2002). Completeness can be categorised into three types: schema completeness, column completeness, and population completeness. Schema completeness is the most abstract perspective and measures the degree to which entries within rows and columns as a collective are complete and not absent. Column completeness is viewed from a data perspective and considers the absent values of individual columns within the table. Population completeness suggests that a column should contain a range of values entailing that, when values within the range are absent, the population is incomplete. Completeness of each of these three types can be calculated by using the ratio of incomplete items to total number of items, and subtracting the ration from 1, as found in Equation (3-1) (Pipino et al., 2002). 𝐼𝑛𝑐𝑜𝑚𝑝𝑙𝑒𝑡𝑒 𝑖𝑡𝑒𝑚𝑠 𝐶𝑜𝑚𝑝𝑙𝑒𝑡𝑒𝑛𝑒𝑠𝑠 = 1 − (3-1) 𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑖𝑡𝑒𝑚𝑠 Consistent representation centres on identical data values that are represented in the same format throughout the entire table. Consistency can be measured by taking the ratio of violations (with regard to a consistency type) to total number of consistency checks and subtracting the ratio from 1, as found in Equation (3-2) (Pipino et al., 2002). 𝑉𝑖𝑜𝑙𝑎𝑡𝑖𝑜𝑛𝑠 𝐶𝑜𝑛𝑠𝑖𝑠𝑡𝑒𝑛𝑐𝑦 = 1 − (3-2) 𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑜𝑛𝑠𝑖𝑠𝑡𝑒𝑛𝑐𝑦 𝑐ℎ𝑒𝑐𝑘𝑠 The free-of-error metric portrays data accuracy and correctness and can be calculated by dividing the number of erroneous data units by the total number of data units and subtracting from 1, as found in Equation (3-3) (Pipino et al., 2002). Pipino et al. (2002) also note that clearly defined sets of criteria are required to establish that which is a data unit and subsequently that which will be an error for that data unit. Thus, a degree of precision must be specified to ensure a threshold so as to determine when a data unit is correct, erroneous, or tolerable in a certain circumstance. 𝐸𝑟𝑟𝑜𝑛𝑒𝑜𝑢𝑠 𝑑𝑎𝑡𝑎 𝑢𝑛𝑖𝑡𝑠 𝐹𝑟𝑒𝑒-𝑜𝑓-𝑒𝑟𝑟𝑜𝑟 = 1 − (3-3) 𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑎𝑡𝑎 𝑢𝑛𝑖𝑡𝑠 35 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa 3.2 National Groundwater Datasets and Data Availability 3.2.1 National Groundwater Archive The web-enabled National Groundwater Archive (NGA), managed by the directorate of Surface and Groundwater Information, is a centralised database containing most of South Africa’s boreholes, referred to as geosites by the NGA (NGA, s.a.(d)). The NGA was preceded by the National Groundwater Database (NGDB) comprising an estimated 225 000 boreholes (DWA, 2009) and registered users were allowed to capture information to further expand the database as well as update, view, and extract data (NGA, s.a.(d)). The NGA currently contains approximately 270 000 geosites. The purpose of the NGA is to assess national and regional groundwater resources with a view to the sustainable management of these valuable assets (NGA, s.a.(d)). Consider that, according to a report by DWA (2009), it is not compulsory to capture geosite information on the database. Therefore, no incentive exists for users to upload valuable groundwater information that they might possess. The report by DWA (2009) notes that a common problem with the NGDB centres on the lack of quality. Few geosite records are complete, critical data such as pumping tests and aquifer information are scarce, and a noticeable decline is seen in geosite capture data (Figure 3-1). This is a major cause for concern regarding data mining, as vast amounts of data are preferable to conduct analysis. Annual growth in records on groundwater database 30000 25000 20000 15000 10000 5000 0 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 Year Figure 3-1: Annual growth in NGDB and NGA records from 1985 to 2008 as adapted from DWA (2009) 36 Chané de Bruyn, M.Sc. Dissertation Number of records Centre for Water Sciences and Management – North-West University, South Africa The distribution of the geosites contained within the NGA is depicted in Figure 3-2. It is evident that a majority of the higher density distribution areas occur in the Limpopo Province and neighbouring provinces, which is also the area covered by the GRIP database. The latter will be discussed in the subsequent section. The NGA has a data disclaimer stating that the use of data is limited to academic, research, and personal purposes only. Permission should be requested from the Directorate: Surface and Groundwater Information if data are to be used for commercial purposes. The NGA also discloses that the data supplied has no implied warranty as to the suitability for purpose, accuracy, or completeness. Errors may be reported for corrections or enhancements to the Department of Water and Sanitation (DWS) (NGA, s.a.(a)). The NGA also provides a glossary that acts as metadata of a kind describing and clarifying the different attributes found in the database (NGA, s.a.(b)). The export options are quite extensive and geosites can be selected based on many different criteria including, but not limited to, drainage region and farm name or map number. Drainage region filters in terms of geosites based on quaternary catchment names, but several other criteria are available to filter by. After selecting the desired geosites, various attributes can be exported, including geosite information, water levels, abstractions, lithology, and so on. The site map (NGA, s.a.(c)) reflects the attributes available for export, as depicted in Figure 3-3. A comma-separated- value file (CSV) will be emailed to the user upon processing. Data can also be requested by means of email by completing a data request form. These include groundwater data and groundwater-chemistry data. 37 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa Figure 3-2: NGA borehole distribution and density per 10’ x 10’ grid (DWS, 2020). 38 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa Figure 3-3: NGA Site Map (NGA, s.a.(c)). 39 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa 3.2.2 Groundwater Resources Information Project The Groundwater Resources Information Project (GRIP) was initially introduced in the Limpopo Province to collect and verify groundwater-related data with the goal of presenting these to engineers and planners for incorporation into studies (DWA, 2009). It aims to maintain a broad- gauged groundwater dataset consisting of verified data with the goal of making these freely available to its users (GRIP, s.a.) and ensuring the assimilation of groundwater data into the management of water resources. The DWA report (DWA, 2009) suggests that a GRIP for every province is desirable, and it has been introduced in Kwazulu-Natal and the Eastern Cape. However, implementation is hampered due to a lack of resources. The report goes on to mention that the GRIP in the Eastern Cape Province has been severely hindered as a result of the lack of resources such as funding and human resources (DWA, 2009). The GRIP database is divided into online data that are readily available for export and a request for further data. The online data acquisition process includes selecting an area such as a district municipality, H-area, local municipality or settlement, and exporting an Excel spreadsheet containing the desired information. The online data includes borehole name, alternative borehole names, coordinates, depth, latest water levels, yield, duty cycle, equipment, and water class. Data that are only available upon request include geology, borehole test data, chemical analysis, borehole construction logs, equipment, time series water levels, and photos. The distribution of boreholes within the GRIP database is illustrated in Figure 3-4. 40 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa Figure 3-4: GRIP borehole distribution 41 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa 3.3 Available Data Discussion Data available from each database will be discussed in the context with the data quality measurements discussed in Section 0 - Rosli et al. (2016) further explain the need for additional information known as metadata. These the purpose, meaning, and context of data, therefore facilitating a better understanding of the data in question. Metadata also aim to avoid any misinterpretation that could arise from data (Rosli et al., 2016). Measuring Data Quality, to gain insights into the quality of each database and how this might affect any analysis performed by using data-mining. 3.3.1 National Groundwater Archive The NGA gives a data disclaimer (NGA, s.a.(a)) regarding the quality of the data contained in the database by stating that ‘all data is supplied with no expressed or implied warranty as to its suitability for purpose, geometric accuracy or completeness’. This database is extremely large and contains numerous attributes that may be exported. Performing the data quality measures set out in Section 0 on the entire database will be laborious. Therefore, the data quality measures will only be applied to a selected number of attributes deemed to have a significant influence on the geohydrological setting. For the purpose of comparison to the GRIP, data from the primary catchment areas A, B, and X will be consolidated into a single dataset on which the data quality measures will be performed. 3.3.1.1 Completeness 3.3.1.1.1 Schema completeness Within the confines of the Limpopo Province border, the total number of unique boreholes logged in the NGA database at the time of this study was 65 530. A total of 373 possible attributes exist for each borehole identifier in the database. Therefore, as mentioned, only a few critical attributes will be discussed. These attributes include lithology, water strike, chemistry, abstraction, discharge, depth, and water levels. For the selected attributes, a maximum of 1 507 190 data entries is possible. The total number of populated cells is 555 009. Therefore, the schema completeness factor equates to 0.37 for the selection of attributes when Equation (3-1) is applied, or 37%. The schema completeness statistics are summarised in Table 3-2. 42 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa Table 3-2: Schema completeness results for a selection of the NGA located in the Limpopo Province Attribute Total Number of unique entries 65 530 Number of attributes per entry 22 Maximum number of possible data points 1 507 190 Number of functional data points 555 009 Schema Completeness Factor 0.37 3.3.1.1.2 Column completeness Equation (3-1) was used to calculate column completeness of the selected attributes for each of the columns. The results are tabulated in Annexure A, Table 8-2, and are sorted from the most complete to the least complete attributes. In summary, the column completeness is very poor with regard to the chosen attributes. However, it should be taken into account that the completeness factor is determined by the number of boreholes and that an incredible number of boreholes are present. Therefore, it may appear that an attribute such as, say, lithology is incomplete compared to the 65 530 boreholes, but there are 25 154 boreholes that enjoy lithology information. Although a column completeness factor of 1 is desirable, one may be able to conduct a reasonably good analysis of the lithology if so desired. A total of 21.74% of the columns are complete at a factor of 1 and this is attributed to the most basic information regarding the borehole such as the name and location. Figure 3-5 below depicts and compares column completeness. Column Completeness for selection of NGA attributes HCO3 Groundwater Occurrence Temperature pH Class pH Value Abstraction Quantity Fracturing Degree (Lithology) Depth To Bottom (Water Strike) Electrical Conductivity Weathering Degree (Lithology) Seepage Value (Water Strike) Discharge Rate Water Level Depth To Top (Water Strike) Water Strike Type Lithology Name Borehole Diameter Borehole Depth DataOwner Longitude Latitude GeositeType Identifier 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Column Completeness Factor Figure 3-5: Column completeness overview for a selection of the NGA 43 Chané de Bruyn, M.Sc. Dissertation Attributes Centre for Water Sciences and Management – North-West University, South Africa 3.3.1.2 Consistency No obvious inconsistencies were noted in the selection. No typographical errors were detected in the borehole names, which is important for the sake of ascertaining that all the information can be retrieved for the relevant borehole. Suspect data occurred in the depth to top of water strikes, since nine had a reading of 999.99 and forty-four 9999.99. These values were used to flag the fact that no data were available and only constituted 0.19% of the total, while it would in all likelihood not have had a profound impact if removed from data mining. Typographical errors play a major part in data inconsistency, and care should be taken to ensure their minimisation. A key component when it comes to avoiding errors is to confirm the language settings of the device being used, as this can cause unintended typographical errors. This may be the cause of faults around some entries in the diameter data, where entries such as 216 mm (8.5ö) are present and should rather read 216 mm (8.50)”. 3.3.1.3 Free-of-error Section 0 notes that a specific criterion is required in order to classify a data point as erroneous or not. Currently, no such a criterion exists for the NGA. As noted in section 3.2.1, the NGA specifies that erroneous data may occur. It was impossible at the time of conducting this study to determine this parameter, albeit a very important one. 3.3.2 Groundwater Resources Information Project Upon exporting the borehole data directly from the GRIP website, the Excel spreadsheets contain the following columns; GRIP site ID number, GRIP borehole number, H-area, quaternary catchment area, regional borehole number, alternative borehole number 1 & 2, farm name, farm number, province, district municipality, local municipality, settlement name, settlement ID, alternative settlement name, longitude, latitude, borehole depth, water level, water level date taken, depth to pump intake, discharge rate, duty cycle, daily abstraction, equipment, power, quality, and comments. An example of the spreadsheet can be found in Annexure B – GRIP database example. The time frame within which water level measurement entries fall, spans from 1900-01-01 to 2015- 05-15. Thus, GRIP covers a reasonable period of historic data, although up-to-date data will always be desirable. The date 1900-01-01 is used as a flag value to indicate that the date is unknown. Three other possible erroneous dates include 2088-08-13, 3005-03-09 and 1004-12- 44 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa 09. These are most likely typographical errors, which are important causes for concern (Gardner, 1992). Lastly, a date occurs as 07/09/2009, which is not in accordance with the date style of the database. The majority of dates are in the format YYYY-MM-DD. This ties into the data quality parameter of consistent representation. There is also a plethora of zeros (‘0’) in the date column. No metadata are readily available to indicate the exact meaning of the zero, but it could be assumed that it is representative of a missing date for this column. Columns with the same zero value abound. The assumption is made that this is representative of missing values or measurements. This is based upon the fact that, for attributes with numerical data types such as borehole depth, water level, discharge rate, depth to pump intake, duty cycle and daily abstraction, no unpopulated cells occur within the Excel spreadsheet, while this could also be indicative of a missing value. Also, in multiple instances, the attributes depth to pump intake, discharge rate, duty cycle, and daily abstraction are populated by a zero value, therefore indicating missing measurements. 3.3.2.1 Completeness 3.3.2.1.1 Schema completeness The borehole data from both Limpopo and Mpumalanga yield a total borehole entry count of 25 431 unique boreholes. Each has 27 attributes, amounting to a maximum of 686 637 possible data points in the entire dataset: this number excludes the first column, as the latter contains only the number of boreholes. Of this, 573 898 cells are populated, which includes the zero values discussed in Section 0. There are 110 913 zero value data points. Hence, there are in actuality 462 985 data points and this equates to a dataset that is 67% complete in its unchanged original state. The GRIP dataset therefore has a completeness factor of 0.67 when calculated in terms of Equation (3-1). It should be noted that this completeness factor was calculated for all available data. The completeness factor will change in the event of removal of attributes that are deemed unimportant or redundant. The schema completeness statistics are summarised in Table 3-3. Table 3-3: Schema completeness results for the GRIP Attribute Total Number of unique entries 25 431 Number of attributes per entry 27 Maximum number of possible data points 686 637 Number of functional data points 462 985 Schema Completeness Factor 0.67 45 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa 3.3.2.1.2 Column completeness Equation (3-1) was also used for each of the columns in an unaltered GRIP dataset. The results are tabulated in Table 8-3 in Annexure A and are sorted from the most complete to the least complete attributes. A total of 46.43% of the columns are complete at a factor of 1. To visualise and compare column completeness, please see Figure 3-6. Column Completeness for the GRIP Alternative settlement name Alternative borehole number 1 Regional borehole number Alternative borehole number 2 Quality Daily abstractions Duty cycle Discharge rate Depth to pump intake Water level date taken Water level Borehole depth Equipment Comment Power Latitude Longitude Settlement ID Settlement name Local municipality District municipality Province Farm number Farm name Quaternary catchment area H Area GRIP borehole number GRIP site ID number 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Column Completeness Factor Figure 3-6: Column completeness overview for the GRIP 3.3.2.2 Consistency GRIP contains several inconsistencies. Various columns contain text data where some values are placed between two quotation marks. For example, the majority of the farm name column entries do not contain quotation marks, but multiple data points do contain them. These values do not cause duplicates upon removing the quotation marks. The export engine adds the quotation marks around text strings if spaces are present. Other entries could be considered to be duplicates as a result of typographical errors such as spelling mistakes or different spelling of the same name. This is only the case where farm names are considered, though, and will likely not 46 Chané de Bruyn, M.Sc. Dissertation Attributes Centre for Water Sciences and Management – North-West University, South Africa affect the outcome of data mining in the present project. One example of this can be found in the farm name column where "ZEBEDIELA ESTATES" and "ZEBEDEIELA ESTATES" are contained within the same column. Where alternative borehole names are concerned, inconsistencies occur regarding spacing within the names. On average, and in most cases, the alternative names consist of a combination of characters and letters with spaces absent. Values that once again contained quotation marks, also presented with spaces in the name. This did not cause duplicates but, as mentioned, it is worth noting this inconsistency. Spaces are considered to be characters within an entry within a cell in Excel, and this could influence formulas and analyses performed on a particular set of cells. The problem of analysing consistency within datasets centres on not knowing whether an entry can be considered to be an inconsistency or not. If a space is present somewhere in the borehole ID name, the assumption could be made that, in all likelihood, it amounts to a typographical error and can be easily fixed. Where names are concerned, such as farm names, it can be assumed that, in all likelihood, two or more farms can exist with similar names spelled differently. Therefore, care needs to be exercised when analysing inconsistencies and attempting to fix them. 3.3.2.3 Free-of-error Accuracy is difficult to examine on such a large scale since it is hard to be knowledgeable about the process of measurements that had been followed. 3.4 Spatial datasets Spatially distributed data are key components in many research fields. Visualisation of data on a geospatial level may offer crucial insights that might not have been immediately apparent in terms of tabled or a two-dimensional structure. Spatial datasets usually cover a single area of interest such as evaporation. Consequently, data must be congregated from a variety of different sources, as no single dataset will contain each desired attribute. Many factors that could potentially influence the groundwater regime occur on surface level, such as evaporation, rainfall, and runoff, to mention a few. This is confirmed by Lerner and Harris (2009), who connect groundwater and the landscape, noting that anthropogenic activities such as urbanisation tampers with recharge. Therefore, it is essential to incorporate data from spatial datasets into that of the NGA and GRIP databases so as to gain a comprehensive dataset that would expectantly aid the prediction of groundwater levels and the establishment of relationships within the complex system. 47 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa Various spatial datasets are publicly available from government entities such as the Department of Forestry, Fisheries and the Environment, which established an E-GIS website containing multiple environmental geospatial datasets (DFFE, 2022). These datasets include land cover ranging from 1990 to 2020, the change in land cover, and other data such as protected areas. Land cover, whether natural or manufactured, influences recharge value (Lerner & Harris, 2009). Lerner and Harris (2009) note that urban surfaces are by and large impermeable, causing less rainfall to penetrate the earth beneath and lowering recharge values. However, it should be noted that the construction and maintenance of urban areas differ widely, also influencing recharge. Lerner and Harris (2009) further explain that vegetation plays a critical role in recharge, and so does agriculture. It has been surmised that croplands have higher recharge rates than those of native vegetation (Lerner & Harris, 2009; Kim & Jackson, 2012). Therefore, land cover is critical to incorporate into the final set used for data mining. The land cover data obtained from DFFE (2022) have been generated by using automated mapping models and Sentinel 2 satellite imagery (DFFE, 2021). The resolution is 20 m and has a calculated accuracy of 85.47% (DFFE, 2021). The Water Research Commission authorised a study regarding the water resources of South Africa, Lesotho, and Swaziland, colloquially known as WR2012. A web-enabled system has been established to provide water resource specialists with all the data, including maps, water resource models, and other tools that resulted from the study (WRC, 2012). The GIS maps available include many geohydrological data such as transmissivity, recharge depth grids, groundwater volume of aquifers, and so on. Other important datasets include rainfall, evaporation, runoff, vegetation, and geology. Consider that many of the groundwater parameters made available by the WR2012 originate from the Groundwater Resources Assessment Phase II, also known as GRAII. Finally, it has to be considered that not all GIS datasets necessarily cover the entirety of South Africa. 48 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa CHAPTER 4: METHODOLOGY The purpose of the present study is to research the validity of data-based approaches, as discussed in the literature review, to substantiate a cause-and-effect relationship between borehole parameters and the geological and geographical settings. This will be achieved by means of a desktop study and investigating data mining and machine-learning techniques. As the name implies, data mining requires extensive amounts of quantitative and qualitative data. For this study, data will be gathered from three different sources, namely the NGA, GRIP and spatial GIS data available for South Africa. Subsequently the data will be processed on an individual basis first to ensure that there are no duplicate entries and to clean up erroneous data. Entries with insufficient data will also be removed during this phase. These datasets will then be into a single dataset in a format that will be usable for the next phase. The next phase is model building. The statistical language R will be used in and integrated development environment like RStudio to compile scripts, where the algorithms discussed in the literature review are implemented by using the data from the previous phase. The algorithms discussed can be classified as either regression or classification algorithms and, in some instances, they can be used for both. Classification and regression model analyses and evaluation will be done. The results will be compared for each instance in order to establish the algorithm that is best suited for the purpose of predicting a continuous numerical water level or yield as well as the best-suited algorithm for predicting a class of water level or yield. Once regression and classification algorithms have been established, the methodology will have been concluded, and implementation will be conducted on three case study areas in South Africa, as found in the next chapter. 4.1 Data acquisition Publicly accessible data will be utilised in this study to ensure a degree of replicability. Data such as this are found in two major databases, namely the NGA which is developed and maintained by the Department of Water and Sanitation, and the GRIP database which is centred around borehole data specific to the Limpopo Province. Another data source used for this study is spatially distributed data. There are numerous repositories for Geographic Information System (GIS) data found online and were discussed in the previous section. The WR2012 was the predominantly 49 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa used source as this database contains numerous valuable geospatial data with a focus on geohydrological data. The acquisition process of data from the various sources will be discussed in the sections below. Section 4.2 will focus on their consolidation into a single set. 4.1.1 NGA data acquisition process For the purpose of consolidating data from various databases into one single dataset, only regions that would correspond geographically with the GRIP database were chosen for export. The GRIP database covers the Limpopo Province and a small area of the Mpumalanga Province. Therefore, primary catchment areas A, B, and X were chosen as geographic areas whereby borehole information was selected and exported. For the selection from the NGA, all quaternary catchments within the primary catchments A, B, and X were used. The exported boreholes and their distribution within the Limpopo Province are indicated in Figure 4-1. Figure 4-1 also depicts the overlap between the NGA boreholes and those of the GRIP database. After selecting the geographic area from which boreholes were to be exported, attributes were selected based upon their relevance to the geohydrological setting. Thus, not every available feature was selected for export. It should be taken into account when exporting data from the NGA that it should be done conservatively with regard to the number of attributes chosen in comparison to the number of boreholes. That is, if the number of selected boreholes is large and, if all the desired attributes were to be selected within one export, it would not execute. It is assumed that the server could not send a file of compiled data that were too large. When this was the case, multiple exports were conducted. After retrieving all the desired data in CSV format, the files were imported into Microsoft Access on the basis of the sections in terms of which they were exported. The data processing phase will be discussed further in Section 4.2. 4.1.2 GRIP data acquisition process The search function on the GRIP database webpage allows the user to export borehole data according to district municipality, H-area, and quaternary catchments. For the sake of thoroughness, all three categories were used. The data stretched across the Limpopo Province and parts of the Mpumalanga Province, as illustrated in Figure 4-1. The files were easily exported to a CSV file for further processing. 50 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa Figure 4-1: Distribution and overlap of boreholes from both the NGA and the GRIP databases 51 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa 4.1.3 GIS data acquisition process Additional data were collected from spatial datasets, as it was assumed that surface features such as rainfall and recharge, vegetation, and land use might have an impact on the borehole parameters. Various geological and geohydrological data were also captured within spatial datasets, aiming to capture subsurface attributes within a vector or raster file. Numerous geohydrological related maps were downloaded from the WR2012 website (WRC, 2012) and imported into GIS software such as QGIS (QGIS Development Team, 2021). The aim was to create an additional dataset with the combined boreholes of the NGA and GRIP which would be used to generate a GIS feature database. All the relevant GIS data were loaded into QGIS along with the borehole locations. QGIS has a join attributes by location function in its data management tools menu that joins a vector layer to a base layer, that is, it joins the attributes of a desired layer to the attributes table of the borehole layer. The resulting output layer was then used as the base layer for appending subsequent attributes. This process was repeated until all the desired attributes had been appended. For raster layers, the sample raster values function was used where the same concept applies as that of the vector layers. A single layer would be the end result, which was exported as a CSV file. Figure 4-2 reflects the process of assigning raster and vector data from various layers to a specific borehole. Figure 4-2: Assignment process of GIS data to a single borehole 52 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa 4.2 Data processing Data processing of the acquired data underwent several phases. Firstly, data of each base were processed separately to ensure that the final dataset from a specific source contained no duplicate entries. Subsequently, all three sources were assembled into one single dataset. Further processing took place to ensure that duplicate or similar attributes were discarded. 4.2.1 Data processing - Phase 1 Phase 1 involved an assemblage of data gathered from a source into a single representative dataset. This had to be done for each of the three sources. In the case of the NGA, 11 different datasets needed to be processed for multiple values per entry and assembled into one dataset. The GRIP and GIS files needed to be merged into one dataset as well. 4.2.1.1 NGA Microsoft Access was used to create separate datasets for each of the features within the NGA. Microsoft Excel also has this capability, but the reason for using a database management system such as Microsoft Access was that it supports relational databases. Microsoft Access allows the user to easily run queries on a dataset to calculate the average of the values for a unique entry or other functions such as minimum or maximum values, first or last values, and so on. This proved useful for exported files like those of water levels and water strikes which had multiple entries for the same borehole. The following datasets were identified as containing multiple values; abstractions, depth and diameter, discharge rate, field measurements, lithology, pumping test details, water strike, and water levels. In each instance, the average was calculated to gain a single representative value for each borehole. This however proved difficult for data such as lithology. In the case of static water level data, the data type is numeric. The number of water levels observed for a borehole is inconsequential, because any number of observed water levels can be averaged to gain a single representative value of the static water level. The case for the lithology and log data were not found to be a matter of numerical data which can be averaged, since it consists of character data or text strings. Furthermore, each borehole had a differing length of log entries, where one borehole could for example have five lithology bands noted and another twenty or more, whereas each lithology occurred at a different depth. This proved difficult to convert to a logical value for an algorithm to process and was therefore omitted from this study. 53 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa Although the other datasets did not need to have any queries run on them, they were also added to Microsoft Access for the ease of exporting one single dataset. After averaging all the necessary attributes, the geosite was used as the primary key whereby all the other attributes would be returned and exported into a single set that would be representative of the NGA set. 4.2.1.2 GRIP The GRIP data acquisition generated separate datasets based upon district municipality, H-area, and quaternary catchments. By using Microsoft Excel, each of the files were imported into the same spreadsheet. The remove duplicates function was performed on all the imported data to ensure that no duplicate entries were present. Upon verifying this, the GRIP dataset consisted of 25 431 unique entries. 4.2.1.3 GIS No processing was needed for the GIS data during this phase due to the fact that all duplicate boreholes were removed before importing into GIS. 4.2.2 Data processing - Phase 2 Phase 2 involved an assemblage of the three datasets from Phase 1 into one single dataset. Data from all three sources were consolidated into a spreadsheet upon which further processing was done. Elimination of similar features that occurred between the different datasets was necessary. This process was repeated for all instances of similar features. Furthermore, columns were examined for missing values, and those that indicated a poor completeness factor were removed, as there were too little data to use meaningfully in the data-mining process. Some variables were found to be numerical in nature, such as elevation, mean annual precipitation (MAP), depth, and so on. Other variables are categorical or have a text value which must be assigned a number. However, these categories or assigned indexes must ideally not be interpreted as a numerical value but, rather, as a factor value. Care had to be taken during the assembling of scripts to ensure these values were assigned as factor data types. 4.2.3 Data processing - Phase 3 The aim of this study, as indicated, is to classify relationships between the relevant geological setting and borehole parameters such as water level and yield. Therefore, Phase 3 included the creation of a dataset for each dependent variable, such as water level and average yield of water 54 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa strike. For each dataset, the independent variable was placed in the last column for ease of use in the model scripts. A wide variety of variables were available to choose from for the final dataset of each independent variable that was predicted, but keeping the aim of the study in mind, only parameters relevant to the geological setting were considered. Both regression and classification analysis were conducted for comparative reasons. Regression uses continuous data and classification requires the variable to be categorised. Therefore, the dependent variables were placed in appropriate classes. For classification, the dependant variable had to be a class. Therefore, categorising the variable into distinct classes was required. This was done for water levels and for water strike yields. Water levels were tested in terms of unit metres above mean sea level (mamsl), which ranged from approximately 50 mamsl to 2600 mamsl. This is a very broad range, and the classes had to be of a lower resolution, such as 50 m and 100 m. Yield was classed by using the same classes as those used in the hydrogeological map series, that is, five classes that were divided as reflected in Table 4-1 below. Table 4-1: Assigned yield classes Yield range Class 0.0 – 0.1 l/s A 0.1 – 0.5 l/s B 0.5 – 2.0 l/s C 2.0 – 5.0 l/s D > 5.0 l/s E The yield classes were chosen in this manner with a view to the distribution in each class. If equal interval classes had been chosen, most of the yield value would have fallen into one class, as depicted in Figure 4-3. The maximum yield class that was considered was > 5 l/s, since this was the maximum yield class expressed in the Geohydrological Map Series for groundwater occurrence. 55 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa Yield classes based on hydrogeological Yield classes - 10l/s increments Yield classes - 5l/s increments map series 6000 5000 2500 4500 5000 4000 2000 3500 4000 3000 1500 3000 2500 2000 1000 2000 1500 500 1000 1000 500 0 0 0 A B C D E A B C D E A B C D E F G H I J K Yield Class Yield Class Yield Class Figure 4-3: Distribution of yield values in different size classes 56 Chané de Bruyn, M.Sc. Dissertation Count Count Count Centre for Water Sciences and Management – North-West University, South Africa 4.3 Computer methods The data processing phase resulted in a final dataset – with a few variations for the classification algorithms - which would be used in all models. Five different algorithms were used to analyse and predict the dependent variables. The algorithms used were classified in terms of regression or classification or both. SVM and DT were found to be both regression and classification algorithms. Figure 4-4 below indicates the algorithms and their categorisation along with the R library used in the scripts. Figure 4-4: Types of machine-learning algorithms and the R libraries used in each R is a statistical language, and it was used to compile scripts and build all the models that are mentioned in Figure 4-4. RStudio was the integrated development environment (IDE) used to develop the models and the scripts were sourced from SuperDataScience (2020). The specific scripts used for each algorithm will be laid out in Annexure C – Model Scripts along with plots or other results that had been generated. 4.4 Algorithms A set of five algorithms was used to compare the accuracy of each and to determine which algorithm best suites the aim of this study, namely: • Decision trees • Baysian classifiers • K-Nearest neighbour • Support vector machines • Linear regression. 57 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa Fuzzy logic and ANN were omitted from the methodology as elucidated by the literature review. According to extant literature, fuzzy logic uses immense amounts of memory to generate a rule- base, which increases exponentially for each independent variable in the database. Due to the size of the initial dataset to be used, this method was not suitable in this study. The same issue cropped up in the case of ANNs, while they also produce limited (if any) insight into the system studied. Algorithms were run with the datasets split to an 8:2 ratio to ensure an adequate number of data observations to train the models on. Therefore, 80% of the dataset was chosen at random for the training set, and the remaining 20% used to test or validate the model. Measurements of accuracy or error used here, as discussed in terms of the literature review (Section 0, included RMSE, MAE, MAPE, and Pearson correlation for regression models. For classification models, confusion matrices were used to illustrate the results of the model. The Kappa value and other metrics could be calculated by using the resultant matrices. The testing focused on static water levels and yield. The following generated sections indicated the model performance for each algorithm (for both static water level and yield), and briefly discusses how each model were set up within the parameters of the algorithms. Regression and classification models with the best performance were applied to three case studies. The procedure for compiling the scripts started with the use of the entire dataset and the elimination of parameters that seemed to have a neutral to negative effect on accuracy. It was expected that this would result in parameters that were the drivers behind the independent variable that was being investigated. 4.4.1 Static Water Level 4.4.1.1 Regression Starting off with the dataset containing all the available parameters, it was immediately apparent that the regression models predicted with high rates of accuracy. This was to be expected, since groundwater levels follow the natural topography and, therefore, elevation dominates the predictions. Omitting the elevation resulted in a lower Pearson correlation, validating the preceding statement. Nonetheless, the algorithms performed well without the elevation data. 58 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa 4.4.1.1.1 Multiple Linear Regression The multiple linear regression (MLR) model is very simple to create without having to tweak additional arguments. The dataset was used and the dependant variable was set although, if refinement was desired, arguments such as weights, contrast, and offset were used as options. 4.4.1.1.2 Support Vector Regression For support vector regression (SVR) in terms of the e1072 library, two types of regression could be performed, namely nu and eps. The nu regressions performed marginally better than those of the eps regression. Furthermore, the kernel could be specified as linear, polynomial, radial basis, or sigmoid. The linear kernel performed better than the others. 4.4.1.1.3 Decision Tree Regression The rpart model facilitated control of the algorithm details such as minsplit, which is the minimum number of observations in a node before a split can be tried. The minimum split for the model was designated to start from 2, but there was no noticeable difference between a small and a large split. 4.4.1.1.4 Random Forest Regression The algorithm for random forest regression (RFR) has the option for selecting the number of trees. Multiple runs were conducted by using different totals of trees. A selection of 100 trees resulted in the best performing model. Consider that the larger the number of trees, the longer the model runtime is. 4.4.1.1.5 Regression model selection While initially omitting elevation, other variables were used singularly to establish the way in which the model reacted to variables such as geological parameters, transmissivity and storativity values, recharge, and annual precipitation. This one-on-one approach made it clear which variable had the greatest effect on the prediction. By using the variable with the greatest influence, other variables were added until the best correlation was attained. Variables that led to the best model performance in addition to elevation were mean annual precipitation, storativity grid values, and five sequential lithologies. Only these parameters were used in all four models to ensure that they could be compared on the basis of the same data. The final accuracy metrics are summarised in Table 4-2 reflecting that SVR was the model with the best performance and MLR 59 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa the second best. If elevation is omitted, RFR is the best performing model. For comparison purposes, the results were also compared with a Bayesian interpolator (Tripol1) and, in this case, the SVR performed better. Table 4-2: Water level regression model performance metrics Elevation used Elevation omitted Pear Regression MAPE Pearson son RMSE MAE RMSE MAPE MAE Algorithm (%) (%) (%) Multiple Linear 99.90 14.27 1.25 10.39 70.17 229.06 22.62 170.44 Support Vector 99.90 14.41 1.22 10.04 69.97 231.99 22.68 166.32 Decision Tree 97.49 72.23 6.53 50.12 77.42 205.32 20.11 148.04 Random Forest 98.96 48.25 2.72 19.56 90.16 142.59 12.35 89.34 Bayesian Interpolation 99.84 23.24 1.60 15.03 - - - - 4.4.1.1.6 Comparison with established geohydrological software The Bayesian interpolator in Tripol, which interpolates water levels based on elevation data, was also used to interpolate water levels. This was done as a measure for comparing the results of the machine learning models to those of established methods. The Bayesian estimation was used, where all its parameters were left at default. The Bayesian interpolation only uses coordinates, elevation, and water level elevation. No other parameters can be added. It should be noted that the interpolation results are only valid when high correlations between elevation and water levels exist. 4.4.1.2 Classification The parameters established during the regression model building phase were used to build the classification models in order to measure their performance in terms of the same set of parameters. The classification accuracy was excellent, albeit that, if a strong parameter such as elevation was omitted, some algorithms managed to predict results with fair accuracy. 1 Tripol is an interpolation application that performs Inverse Distance, Kriging, and Bayesian interpolations. 60 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa 4.4.1.2.1 K-Nearest neighbour classification Use of the K-NN algorithm is fairly straightforward: the number of neighbours was the only significant argument to be tweaked. The number of neighbours (k) was iterated starting from 1 while increasing k with every iteration. It was found that k = 1 performed the best. 4.4.1.2.2 Support vector classification As discussed in Section 0 above, SVM has different types of classification and kernel types. C- classification performed better than nu-classification when combined with a linear kernel type. The algorithm has numerous arguments to fine-tune the model, but arguments are sometimes dependent on the type of classification and kernel. 4.4.1.2.3 Naive Bayes classification The naive Bayes classifier is also straightforward to use. Minimum arguments were to set the dataset as x and the class to be predicted as y. 4.4.1.2.4 Decision-tree classification The C4.5 algorithm was used in terms of the RWeka library by using the J48 classification tree learner. J48 has very few arguments to tweak, but it performs better than the rpart or ctree algorithms that were also tested. Therefore, model creation is simple. 4.4.1.2.5 Random-forest classification The randomForest algorithm has various arguments that can be tweaked to find the best- performing model. The number of trees used to give the best performance were 100 along with a node size of 1, which is recommended for classification. 4.4.1.2.6 Classification model selection As stated, only the parameters established to be critical were used for building the models. The accuracy rates were nearly perfect, and elevation was the primary driver. For comparison, the elevation was omitted from models to evaluate the extent to which the performance would change. SVM and Naive Bayes struggled in this regard, whereas the other three algorithms handled the omitted elevation relatively well. 61 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa The final accuracy metrics of the confusion matrices are summarised in Table 4-3. Random-forest classification (RFC) was found to be the best performing model, with decision-trees ranking at a close second. RFC was expected to outperform the latter, since it is an ensemble method that combines the outcomes of multiple decision trees. With regard to the elevation omitted, K-NN performed the best and decision-trees again ranked at a close second. Table 4-3: Water level classification model performance metrics Elevation used Elevation omitted Correctly Kappa Strength of Correctly Kappa Strength of Regression Algorithm Classified (%) value Agreement Classified (%) value Agreement K-Nearest Neighbour 89.94 0.89 Almost Perfect 67.21 0.64 Substantial Support Vector Machines 88.77 0.88 Almost Perfect 36.77 0.30 Fair Naive Bayes 83.86 0.82 Almost Perfect 31.98 0.26 Fair Decision Tree 91.50 0.91 Almost Perfect 66.40 0.63 Substantial Random Forest 91.52 0.91 Almost Perfect 58.75 0.55 Moderate 4.4.2 Average water strike yield The models created for water level predictions were also used to predict yield. The same approach was taken, where all available parameters were used and tested on a one-on-one basis. The most notable parameters were those obtained during pumping tests, namely transmissivity, storage coefficient, and specific capacity. Other parameters that seemed to influence the prediction were recharge, mean annual precipitation and runoff, baseflow per quaternary, lithology, and the count of water strikes present in the borehole. Interpolated transmissivity and storativity values were also tested, as opposed to the pumping test parameters, since many boreholes with yield data did not necessarily include these detailed parameters. In the case of these interpolated values, the model performance severely degraded, reinforcing the importance of the parameters that were established during pumping tests and their relation to yield. 4.4.2.1 Regression and model selection Table 4-4 summarises the accuracy metrics obtained from the regression models. Random forest regression was found to perform best showing a fair Pearson correlation of 63% and the lowest 62 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa RMSE and MAE values. The accuracy metrics for the interpolated pumping test values have severely degraded in performance. Table 4-4: Yield regression model performance metrics Pumping test parameters Interpolated S and T values Regression Pearson MAPE Pearson RMSE MAE RMSE MAPE MAE Algorithm (%) (%) (%) Multiple Linear 32.59 3.93 - 2.85 20.50 4.10 - 3.07 Support Vector 32.88 3.95 - 2.37 17.22 3.82 - 2.60 Decision Tree 57.24 3.51 - 2.19 18.86 4.09 - 3.10 Random Forest 62.91 3.44 - 2.01 22.54 4.69 - 3.23 4.4.2.2 Classification and model selection Table 4-5 presents the accuracy metrics for the classification models. The random-forest algorithm again showed the best performance for yield classification, with a classification accuracy of 59% and a moderate Kappa value. Table 4-5: Yield classification model performance metrics Pumping test parameters Interpolated S and T values Correctly Kappa Strength of Correctly Kappa Strength of Regression Algorithm Classified (%) value Agreement Classified (%) value Agreement K-Nearest Neighbour 49.94 0.29 Fair 38.95 0.13 Slight Support Vector Machines 52.13 0.31 Fair 36.66 0.10 Slight Naive Bayes 43.78 0.20 Slight 30.44 0.07 Slight Decision Tree 55.65 0.38 Fair 36.33 0.12 Slight Random Forest 58.76 0.42 Moderate 41.49 0.18 Slight 63 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa 4.5 Assumptions and limitations The following assumptions and limitation are noted for this study: • The data gathered from the various databases are assumed to be comprehensive enough to achieve the objectives as presented in Chapter 1. • The study is limited by disregarding temporal aspects of data, so that the study is not seasonally bound. This is due to the fact that limited temporal parameters are available. • Only the most prevalent algorithms and associated libraries were considered for this study. • Algorithm arguments were generally used with default values and only adjustments of major arguments were considered. 64 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa CHAPTER 5: CASE STUDIES In order to test the methodology, it was applied to three case studies. It was expected that its validity would be apparent when the result of each case study was compared with the actual observed water levels. Study areas were chosen according to Vegter groundwater regions. These regions were delineated based on lithostratigraphic units and geological structures (see Nell & van Huyssteen, 2014), given that geology is a critical driver behind the groundwater characteristics found in an area (Dennis & Dennis, 2020). Therefore, each region has similar geohydrological responses, and are ideal to use as the main selection criteria, as they data within each region is assumed to be representative of the drivers within that specific area. A total of 64 groundwater regions occur in South Africa. In order to test the validity of the methodology, two areas with an abundance of data were used as well as an area with sparse data. Table 5-1 indicates the chosen groundwater regions and their respective number of boreholes as present in each groundwater region, along with the total of boreholes with water level data and yield data. Table 5-1: Borehole data distribution for chosen Vegter regions Boreholes with static Boreholes with yield Groundwater Region Total boreholes water level data data Lowveld 17 744 5 848 11 096 Eastern Bushveld Complex 11 656 3 156 7 225 Taung-Prieska Belt 1645 575 738 Each case study will be discussed separately. 5.1 Lowveld case study The first case study was conducted within the Lowveld groundwater region in view of the large amounts of data available. The locality of the region is depicted in Figure 5-1. 65 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa Figure 5-1: Locality map of the Lowveld groundwater region 66 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa 5.1.1 Background The Lowveld groundwater region spans from the Limpopo Province to Mpumalanga. It covers an area of 35 462 km2 and comprises of 17 744 boreholes, of which 11 096 has water strikes and yield data, while 5 848 has static water level data (Figure 5-2). It should be taken into account that water level data are available for 7 476 boreholes, but the conditions under which the water levels have been captured are not all static. A majority of the water levels were captured during pumping tests at drawdown and recovery periods. These water levels were omitted from analysis as they are not representative of the naturally occurring water levels and would skew results. Useful information that could be applied from pumping test analysis included transmissivity, storage coefficient, and specific capacity estimations (see Figure 5-3). The density of the different borehole distributions is displayed in Table 5-2. Table 5-2: Borehole density for the Lowveld region Borehole with specific data Total boreholes Density (boreholes/km2) All boreholes 17 744 0.50 Boreholes with water level 7 476 0.21 Boreholes with static water level 5 848 0.16 Boreholes with yield 11 096 0.31 Boreholes with transmissivity 2 162 0.06 Boreholes with storage 2 260 0.06 Boreholes with Specific Capacity 2 276 0.06 Each borehole with static water level data has on average two water level entries. There are, however, boreholes with numerous entries. Most notably is borehole 2329BB00004, which contains 1091 static water level entries spanning over a 126-month period. The change in water level data for this borehole is displayed in Figure 5-4 and it was found that the borehole experiences a maximum drawdown of 3 m. Overall, static water level entries span a time period from 1950/03/21 to 2018/10/12, amounting to approximately 68 years of data, whereas the data are not continuous. 67 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa Figure 5-2: Borehole distribution in the Lowveld region – static water levels and yield 68 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa Figure 5-3: Borehole distribution in the Lowveld region – pumping test parameters 69 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa Water Level over a period of approximately 10 years for borehole 2329BB00004 0 0.5 1 1.5 2 2.5 3 3.5 Date Figure 5-4: Time series water levels for borehole 2329BB00004 70 Chané de Bruyn, M.Sc. Dissertation Water Level (mbgl) 1985-09-02 1985-10-31 1985-12-21 1986-02-21 1986-05-22 1986-08-21 1986-10-26 1986-12-01 1987-01-02 1987-02-04 1987-04-02 1987-05-22 1987-08-13 1987-11-27 1987-12-27 1988-06-29 1988-10-24 1989-01-04 1989-02-25 1989-06-05 1989-10-22 1989-11-16 1989-12-06 1990-01-03 1990-02-01 1990-02-28 1990-03-27 1990-04-29 1990-09-11 1993-04-08 1993-08-07 1993-10-07 1993-11-02 1993-11-08 1993-11-26 1993-12-17 1994-01-05 1994-01-23 1994-03-02 1994-03-10 1994-03-19 1994-03-28 1994-04-11 1994-04-23 1994-05-03 1994-05-16 1994-06-19 1994-07-17 1994-07-22 1994-08-01 1994-08-20 1994-09-17 1994-09-25 1994-10-03 1994-10-17 1994-10-26 1994-11-09 1994-11-16 1994-11-28 1994-12-12 1995-01-24 1995-01-30 1995-02-04 1996-02-15 1996-03-12 Centre for Water Sciences and Management – North-West University, South Africa 5.1.2 Water Level Predictions Groundwater levels generally follow the topography of the area. The elevation of the Lowveld ranges from 142 mamsl in the lowest area towards the south-east of the Lowveld, and highest at 1 878 mamsl north-western towards mountainous area (Figure 5-6). The correlation between water level and elevation can be seen in Figure 5-5. Several boreholes do, however, deviate from the correlation. This could be caused by the aquifer type in which the boreholes are located or anomalies in the environment, such as geology. Lowveld: Static water level vs elevation 1500 R² = 0.9938 1400 1300 1200 1100 1000 900 800 700 600 500 400 300 200 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 Elevation (mamsl) Figure 5-5: Lowveld static water level and elevation correlation 71 Chané de Bruyn, M.Sc. Dissertation Water Level (mamsl) Centre for Water Sciences and Management – North-West University, South Africa Figure 5-6: Lowveld elevation and drainage map 72 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa During the conceptualisation of the methodology, four critical drivers behind water level could be established by using machine learning and data mining. Elevation is the biggest influencer for water level as established by the clear correlation between elevation and water levels. The three other factors include storativity, mean annual precipitation and geology. These four parameters were used to predict static water levels with high accuracy rates. Maps indicating the geospatial distribution of these parameters are presented in Annexure D – Maps. Figure 5-7 below shows water levels predicted by using SVR, as well as interpolated water levels by using the Bayesian interpolation. Although both clearly predict water levels with good accuracy, that of the SVR is the closest to the observed water levels. Lowveld: Observed vs predicted water levels 1800 1600 R² = 0.9859 1400 R² = 0.9999 R² = 0.9938 1200 1000 800 600 400 200 0 0 200 400 600 800 1000 1200 1400 1600 Elevation (mamsl) Observed WL TRIPOL WL Predicted WL Linear (Observed WL) Linear (TRIPOL WL) Linear (Predicted WL) Figure 5-7: Lowveld predicted water level correlation 73 Chané de Bruyn, M.Sc. Dissertation Water Level (mamsl) Centre for Water Sciences and Management – North-West University, South Africa The SVR model predicts numerical static water levels with a Pearson correlation of 99.69% and an RMSE of 13.74. This clearly contrasts with the Bayesian interpolation, which has an RMSE of 25.11, indicating that the SVR model has the better fit. This may be due to the influence of the extra parameters such as storativity, mean annual precipitation, and lithology. The Bayesian interpolation showed stark deviations in the higher elevations of the Lowveld area. It interpolated numerous water levels being well above the elevation, instead of below the surface. The SVR model results tended to stay close to the trend line, and it could be assumed that elevation was the primary predictor in the model. Figure 5-8 illustrates the water level elevation, observed water levels, and predicted water levels per entry of the test set. A section of the graph has been magnified due to the density of the data in order to examine the way in which different models performed. In the highlighted section, the variance of the Bayesian interpolation can be observed. The water level was predicted to be above the surface level on numerous occasions, whereas the SVR model water level predictions were never predicted to be above surface level. There were, however, instances where the observed water level was noted to be above the surface. This could have been caused by an error in the database. Alternatively, water levels could have been measured with respect to the borehole casing, which is situated well above ground level, while no correction was done for casing length above ground. 74 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa Lowveld: Water level predictions 1800 1700 1700 1600 1600 1500 1500 1400 1400 1300 1300 1200 1100 1200 1000 1100 900 1000 800 900 700 800 1100 1110 1120 1130 1140 1150 1160 1170 700 Index 600 500 400 300 200 100 0 0 100 200 300 400 500 600 700 800 900 1000 1100 1200 Index Elevation Observed WL TRIPOL WL Predicted WL Figure 5-8: Lowveld numerical water level predictions 75 Chané de Bruyn, M.Sc. Dissertation Water level (mamsl) Water Level (mamsl) Centre for Water Sciences and Management – North-West University, South Africa With regard to classification, the random-forest model predicted classes of 100 m intervals with an accuracy of 91.78% and a Kappa value of 0.90, which was considered perfect (Figure 5-9). For comparative purposes, the elevation was omitted in another run of the model to determine the extent to which the model would cope in the absence of this critical parameter. The model performed relatively well, although a significant decrease in performance was noted with an accuracy of 60.10% and a Kappa value of 0.49, which is moderate. The conclusion can be made that, although elevation is a primary driver, the parameters of storativity, mean annual precipitation, and five subordinate lithologies are also key parameters to be considered in unexplored areas. OBSERVED CLASS 1168 A B C D E F G H I J K L M N O P R Tot CE UA A 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 N/A N/A B 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 N/A N/A C 0 0 15 2 0 0 0 0 0 0 0 0 0 0 0 0 0 17 12% 88% D 0 0 1 120 13 0 0 0 0 0 0 0 0 0 0 0 0 134 10% 90% E 0 0 0 5 270 11 0 0 0 0 0 0 0 0 0 0 0 286 6% 94% F 0 0 0 0 12 308 15 0 0 0 0 0 0 0 0 0 0 335 8% 92% G 0 0 0 0 0 6 190 4 0 0 0 0 0 0 0 0 0 200 5% 95% H 0 0 0 0 0 0 1 59 3 0 0 0 0 0 0 0 0 63 6% 94% I 0 0 0 0 0 0 0 5 58 1 0 0 0 0 0 0 0 64 9% 91% J 0 0 0 0 0 1 0 0 4 28 1 0 0 0 0 0 0 34 18% 82% K 0 0 0 0 0 0 0 0 0 5 17 0 0 0 0 0 0 22 23% 77% L 0 0 0 0 0 0 0 0 1 0 3 3 0 0 0 0 0 7 57% 43% M 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 2 50% 50% N 0 0 0 0 1 0 0 0 0 0 0 0 0 3 0 0 0 4 25% 75% O 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 N/A N/A P 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 N/A N/A R 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 N/A N/A Tot 0 0 16 127 296 326 206 68 66 34 22 3 1 3 0 0 0 1072 OE N/A N/A 6% 6% 9% 6% 8% 13% 12% 18% 23% 0% 0% 0% N/A N/A N/A PA N/A N/A 94% 94% 91% 94% 92% 87% 88% 82% 77% 100% 100% 100% N/A N/A N/A OCA 91.78% po 91.78% K 0.90 Perfect pe 0.19244567 Figure 5-9: Lowveld water level classification prediction 76 Chané de Bruyn, M.Sc. Dissertation PREDICTED CLASS Centre for Water Sciences and Management – North-West University, South Africa 5.1.3 Yield predictions The yield parameters were more challenging to establish than those of the water levels. The parameters that did seem to have the most influence on the yield were transmissivity, storage coefficients, and specific capacity (especially those established during the pumping test of the borehole), recharge, mean annual precipitation, mean annual runoff, quaternary baseflow, surface lithology, and subordinate lithologies. Starting off with the parameters gained from pump-tests such transmissivity, storage, and specific capacity, and by using a random-forest regression model, the Pearson correlation was 56.23% and the RMSE 2.93. Figure 5-10 illustrates the observed versus predicted yields. Another script was run while omitting the pump-test parameters and using interpolated transmissivity and storage values instead, so as to observe the extent to which the model would predict successfully around interpolated data. A substantially greater number of observations were available for the model to use, since all the boreholes with yield could be included, instead of only those that had yield as well as pump-test parameters. Although the RMSE was roughly the same at 2.69, the Pearson correlation was only 30.68%. Lowveld: Observed yield vs Predicted yield 30 25 20 15 10 5 0 0 50 100 150 200 250 Observation Index Observed Yield Predicted Yield Figure 5-10: Lowveld predicted yield 77 Chané de Bruyn, M.Sc. Dissertation Yield Centre for Water Sciences and Management – North-West University, South Africa The same thought process was followed for yield classification. By using the pump-test parameters, the random-forest classification model predicted results with an accuracy of 57.36% and a Kappa value of 0.40, which was considered to be fair, as presented in Figure 5-11. With regard to the use of interpolated transmissivity and storage values, the classification accuracy increased to 66.32%, whereas the Kappa value decreased to 0.18, which is slight. Therefore, the model clearly performed better when information of a greater accuracy representative of the environment was used, which was to be expected. OBSERVED CLASS 401 A B C D E Tot CE UA A 0 0 3 1 0 4 100% 0% B 0 13 20 3 0 36 64% 36% C 0 8 94 26 5 133 29% 71% D 0 0 34 59 28 121 51% 49% E 0 0 7 36 64 107 40% 60% Tot 0 21 158 125 97 230 OE N/A 38% 41% 53% 34% PA N/A 62% 59% 47% 66% OCA 57.36% po 57% K 0.40 Fair pe 0.29 Figure 5-11: Lowveld yield classification confusion matrix 5.2 Eastern Bushveld Complex Case study The second case study was conducted in the Eastern Bushveld Complex groundwater region, which also enjoys large amounts of data. The locality of the region is depicted in Figure 5-12. 5.2.1 Background The Eastern Bushveld Complex stretches across three provinces, namely Limpopo, Gauteng, and Mpumalanga. The largest portion of the region is situated within the Limpopo Province south of 78 Chané de Bruyn, M.Sc. Dissertation PREDICTED CLASS Centre for Water Sciences and Management – North-West University, South Africa Polokwane. It has an area of approximately 16 807 km2 and comprises 11 656 boreholes, of which 7 225 enjoys water strike and yield data. Water level data are available for 4 007 boreholes, but only 3 156 boreholes have static water level data (Figure 5-13). Data from pumping test analysis include specific capacity, storage coefficient, and transmissivity. Boreholes that have these data available are indicated in Figure 5-14. Table 5-3 summarises the borehole distribution regarding specific data and density. Table 5-3: Borehole density for the Eastern Bushveld Complex region Borehole with specific data Total boreholes Density (boreholes/km2) All boreholes 11 656 0.69 Boreholes with water level 4 007 0.24 Boreholes with static water level 3 156 0.19 Boreholes with yield 7 225 0.43 Boreholes with transmissivity 1 041 0.06 Boreholes with storage 1 013 0.06 Boreholes with specific capacity 1 045 0.06 On average, more than half of the boreholes with static water levels have one or two noted water levels. Some boreholes show a significantly greater number of static water level entries, but only a few have more than ten entries. The borehole with the most static water level entries in the Eastern Bushveld Complex is 2429BDC0001 with 204 entries. However, all 204 entries were noted during a three-day period. The same is true for 2429DBA001, which has 180 entries. The fact that these water levels were noted as static may be erroneous and could potentially have been based on a pumping test. The change in water levels for 2429BDC0001 are displayed in Figure 5-15. Regarding all static water level entries for the region, the data spans a time period from 1911/08/12 to 2018/09/04, amounting to approximately 107 years. 79 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa Figure 5-12: Locality map of the Eastern Bushveld Complex groundwater region 80 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa Figure 5-13: Borehole distribution in the Eastern Bushveld Complex region – static water levels and yield 81 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa Figure 5-14: Borehole distribution in the Eastern Bushveld Complex region – pumping test parameters 82 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa Water Level over a period of approximately 10 years for borehole 2329BB00004 0 5 10 15 20 25 30 35 40 45 11/06/1998 11/06/1998 11/06/1998 11/06/1998 11/06/1998 11/06/1998 11/06/1998 11/06/1998 12/06/1998 12/06/1998 13/06/1998 07:59 08:18 08:40 09:02 11:07 12:03 13:01 13:50 07:00 12:30 03:30 Date Figure 5-15: Time series water levels for borehole 2429BDC0001 83 Chané de Bruyn, M.Sc. Dissertation Water Level (mbgl) Centre for Water Sciences and Management – North-West University, South Africa 5.2.2 Water level predictions Groundwater levels strongly correlate with topography, as depicted in Figure 5-16. Various boreholes deviated in the lower to mid elevations. The elevation of the region ranges from approximately 600 mamsl in the lower areas in the north-east of the region to 2 099 mamsl in the higher mountainous areas towards the west. Figure 5-17 shows the elevation in tandem with rivers and quaternary catchments. Eastern Bushveld Complex: Static water level correlation 1800 R² = 0.9971 1700 1600 1500 1400 1300 1200 1100 1000 900 800 700 600 500 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700 1800 Elevation (mamsl) Figure 5-16: Eastern Bushveld Complex static water level correlation 84 Chané de Bruyn, M.Sc. Dissertation Water Level (mamsl) Centre for Water Sciences and Management – North-West University, South Africa Figure 5-17: Eastern Bushveld Complex elevation and drainage map 85 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa By means of the methodology employed, elevation was found to be the crucial driver behind water levels. Storativity, mean annual precipitation, and geology were also major drivers. By using these four parameters, static water levels for the Eastern Bushveld Complex were predicted at relatively high accuracy rates. Refer to Annexure D – Maps, which indicates the geospatial distribution of these parameters. Figure 5-18 reflects the SVR algorithm and Bayesian interpolation predictions alongside the observed static water levels. Both SVR and the Bayesian interpolation predicted water levels were very close to those of the elevation, whereas the observed water levels deviated from the trend line to a greater extent. The SVR model and the Bayesian interpolation performed nearly identically: SVR performed marginally better. Eastern Bushveld Complex: Observed vs predicted water levels 1800 1700 R² = 0.9971 1600 R² = 0.9997 R² = 0.9998 1500 1400 1300 1200 1100 1000 900 800 700 600 500 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700 1800 Elevation (mamsl) Observed WL TRIPOL WL Predicted WL Linear (Observed WL) Linear (TRIPOL WL) Linear (Predicted WL) Figure 5-18: Eastern Bushveld Complex predicted water level correlation 86 Chané de Bruyn, M.Sc. Dissertation Water Level (mamsl) Centre for Water Sciences and Management – North-West University, South Africa The SVR model predicted numeric static water levels with a Pearson correlation of 99.86% and a RMSE of 13.96. Only a subtle difference occurred between predictions of the SVR model (Figure 5-19) and those of the Bayesian interpolation, which had a Pearson correlation of 99.85% and an RMSE of 14.64, therefore rendering the SVR model slightly better. This indicates a stronger correlation between water level and elevation than the other parameters, given that the Bayesian interpolation used only elevation as a parameter. Therefore, it could be presumed that elevation is the primary driver behind water levels in the Eastern Bushveld Complex. 87 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa Eastern Bushveld Complex: Water level predictions 1800 1100 1700 1600 1500 1000 1400 1300 1200 900 200 225 250 275 300 1100 Index 1000 900 800 700 600 500 0 50 100 150 200 250 300 350 400 450 500 550 600 650 Index Elevation Observed WL TRIPOL WL Predicted WL Figure 5-19: Eastern Bushveld Complex predicted water level prediction correlation 88 Chané de Bruyn, M.Sc. Dissertation Water level (mamsl) Water level (mamsl) Centre for Water Sciences and Management – North-West University, South Africa Random-forest classification and water level class intervals of 100 m showed an accuracy level of 90.48% and a Kappa value of 0.89, indicating a perfect prediction (Figure 5-20). As in the case of the Lowveld case study, a second run of the model was conducted and elevation was omitted. The performance of the model degraded markedly, lapsing to an accuracy level of only 56.98% and a Kappa value of 0.51, which corresponds to a moderate prediction. The model showed a 33.5% decline in accuracy, whereas the Lowveld study only showed a 17.55% decline in accuracy. This supports to the hypothesis that elevation is a major driver for the Eastern Bushveld Complex water levels. OBSERVED CLASS 630 A E F G H I J K L M N O P Q R Tot CE UA A 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 N/A N/A E 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 N/A N/A F 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 100% 0% G 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 3 100% 0% H 0 0 0 0 55 8 0 0 0 0 0 0 0 0 0 63 13% 87% I 0 0 0 0 3 96 3 0 0 0 0 0 0 0 0 102 6% 94% J 0 0 0 0 0 3 105 5 0 0 0 0 0 0 0 113 7% 93% K 0 0 0 0 0 0 4 71 4 0 0 0 0 0 0 79 10% 90% L 0 0 0 0 0 0 0 3 36 3 0 0 0 0 0 42 14% 86% M 0 0 0 0 0 0 0 0 0 56 4 0 0 0 0 60 7% 93% N 0 0 0 0 0 0 0 0 0 4 54 3 0 0 0 61 11% 89% O 0 0 0 0 0 0 0 0 0 0 0 53 2 0 0 55 4% 96% P 0 0 0 0 0 0 0 0 0 0 0 1 32 0 0 33 3% 97% Q 0 0 0 0 0 0 0 0 0 0 0 0 3 11 0 14 21% 79% R 0 0 0 0 0 0 0 0 0 0 0 0 0 3 1 4 75% 25% Tot 0 0 0 0 62 107 112 79 40 63 58 57 37 14 1 570 OE N/A N/A N/A N/A 11% 10% 6% 10% 10% 11% 7% 7% 14% 21% 0% PA N/A N/A N/A N/A 89% 90% 94% 90% 90% 89% 93% 93% 86% 79% 100% OCA 90.48% po 90% K 0.89 Perfect pe 0.12 Figure 5-20: Eastern Bushveld Complex water level classification confusion matrix 89 Chané de Bruyn, M.Sc. Dissertation PREDICTED CLASS Centre for Water Sciences and Management – North-West University, South Africa 5.2.3 Yield predictions As mentioned, the parameters for yield predictions were challenging to establish with the available data. The resulting parameter used included pump-test-established parameters such as transmissivity, storage coefficient, and specific capacity, recharge, mean annual precipitation, mean annual runoff, quaternary level baseflow, and geology. The random-forest regression model predicted numerical yields with a Pearson correlation of 72.09% and an RMSE of 3.31. Figure 5-21 illustrates the predicted yields versus the observed yields. Eastern Bushveld Complex: Observed vs predicted yield 30 25 20 15 10 5 0 Observed Yield Predicted Yield Figure 5-21: Eastern Bushveld Complex predicted yield The random-forest classification model could predict yield classes at an accuracy of 59.69% and had a Kappa value of 0.44, which is considered to be only moderate, as depicted in Figure 5-22. 90 Chané de Bruyn, M.Sc. Dissertation Yield Centre for Water Sciences and Management – North-West University, South Africa OBSERVED CLASS 191 A B C D E Tot CE UA A 1 0 3 1 3 8 88% 13% B 0 7 19 1 0 27 74% 26% C 0 4 52 11 3 70 26% 74% D 0 0 12 20 12 44 55% 45% E 0 0 1 7 34 42 19% 81% Tot 1 11 87 40 52 114 OE 100% 36% 40% 50% 35% PA 0% 64% 60% 50% 65% OCA 59.69% po 60% K 0.44 Moderate pe 0.28 Figure 5-22: Eastern Bushveld Complex yield classification confusion matrix 5.3 Taung-Prieska Belt case study The third case study was conducted in the Taung-Prieska Belt, also termed Dry Harts-Lower Vaal- Orange Lowland due to rivers that occur in the region. This area was chosen to contrast to the previous two areas, as it contains significantly less data and is located in a more arid region. The location of the region is depicted in Figure 5-23. 91 Chané de Bruyn, M.Sc. Dissertation PREDICTED CLASS Centre for Water Sciences and Management – North-West University, South Africa Figure 5-23: Locality map of the Taung-Prieska Belt groundwater region 92 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa 5.3.1 Background The larger part of the Taung-Prieska Belt is located in the Northern Cape with a small portion of the northern section situated in the North West Province. The region has an estimated area of 19 206 km2 and contains 1 645 boreholes. Only 575 of these enjoy static water level data and 738 have yield data (Figure 5-24). There are no pumping test details available for any borehole within the region. Table 5-4 summarises the borehole distribution regarding specific data and the density. Table 5-4: Borehole density for the Taung-Prieska Belt region Borehole with specific data Total boreholes Density (boreholes/km2) All boreholes 1 645 0.09 Boreholes with water level 795 0.04 Boreholes with static water level 575 0.03 Boreholes with yield 738 0.04 Boreholes with transmissivity 0 0 Boreholes with storage 0 0 Boreholes with specific capacity 0 0 Only one water level is noted for the majority of boreholes (more than 80%). There are, however, boreholes with significant numbers of entry. Seven boreholes have entries that reach well above a thousand. The boreholes with the most entries is 2624DC00033 with 8 402 entries, spanning a time period of approximately 46 years. In total, 4 327 individual days of data is captured. The distribution and change in static water levels are depicted in Figure 5-25. Regarding all static water level entries for the region, the data spans a time period from 1913/11/05 to 2015/04/16, amounting to approximately 101 years. 93 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa Figure 5-24: Borehole distribution in the Taung-Prieska Belt region – static water levels and yield 94 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa Water level over a period of approximately 46 years for borehole 2624DC00033 0 2 4 6 8 10 12 14 Date Figure 5-25: Time series water levels for borehole 2624DC00033 95 Chané de Bruyn, M.Sc. Dissertation Water Level (mbgl) 01/06/1956 01/06/1957 01/06/1958 01/06/1959 01/06/1960 01/06/1961 01/06/1962 01/06/1963 01/06/1964 01/06/1965 01/06/1966 01/06/1967 01/06/1968 01/06/1969 01/06/1970 01/06/1971 01/06/1972 01/06/1973 01/06/1974 01/06/1975 01/06/1976 01/06/1977 01/06/1978 01/06/1979 01/06/1980 01/06/1981 01/06/1982 01/06/1983 01/06/1984 01/06/1985 01/06/1986 01/06/1987 01/06/1988 01/06/1989 01/06/1990 01/06/1991 01/06/1992 01/06/1993 01/06/1994 01/06/1995 01/06/1996 01/06/1997 01/06/1998 01/06/1999 01/06/2000 01/06/2001 01/06/2002 Centre for Water Sciences and Management – North-West University, South Africa 5.3.2 Water level predictions Groundwater levels strongly correspond with the topography of the landscape, as illustrated in Figure 5-26. The elevation in the region ranges from 1 322 mamsl to 912 mamsl in the west where the Orange river flows out of the region. Figure 5-27 shows the elevation along with rivers and quaternary catchments. Taung-Prieska Belt: Static water level correlation 1250 R² = 0.972 1200 1150 1100 1050 1000 950 900 900 950 1000 1050 1100 1150 1200 1250 Elevation (mamsl) Figure 5-26: Taung-Prieska Belt static water level correlation 96 Chané de Bruyn, M.Sc. Dissertation Water Level (mamsl) Centre for Water Sciences and Management – North-West University, South Africa Figure 5-27: Taung-Prieska Belt elevation and drainage map 97 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa By implementing the present methodology and by using all four critical parameters, water levels could be predicted with fair accuracy. The SVR model and the Bayesian interpolation predicted relatively uniform results with some variation. The results are indicted in Figure 5-28. Annexure D – Maps indicate the geospatial distribution of the mentioned parameters. Taung-Prieska Belt: Static water level vs predicted water level 1250 R² = 0.972 1200 R² = 0.9924 1150 R² = 0.9964 1100 1050 1000 950 900 900 950 1000 1050 1100 1150 1200 1250 Elevation (mamsl) Observed WL TRIPOL WL Predicted WL Linear (Observed WL) Linear (TRIPOL WL) Linear (Predicted WL) Figure 5-28: Taung-Prieska Belt predicted water level correlation By using SVR, the regression model produced predictions with a Pearson correlation of 98.45% and an RMSE of 13.47. Only minor differences occurred between the results of the SVR model and those of the Bayesian interpolation (Figure 5-29). The Bayesian interpolation had a Pearson correlation of 98.10%, and an RMSE of 15.37. Given this small difference in results, it can be presumed that elevation is the primary parameter for predictions. 98 Chané de Bruyn, M.Sc. Dissertation Water Level (mamsl) Centre for Water Sciences and Management – North-West University, South Africa Taung-Prieska Belt: Water level predictions 1300 1200 1100 1000 900 0 10 20 30 40 50 60 70 80 90 100 110 120 Index Elevation Observed WL Predicted WL TRIPOL WL Figure 5-29: Taung-Prieska Belt water level prediction correlation 99 Chané de Bruyn, M.Sc. Dissertation Water Level (mamsl) Centre for Water Sciences and Management – North-West University, South Africa The random-forest classification model classified water levels on the basis of 100 m intervals with an accuracy of 87.83% and a Kappa value of 0.80, which is considered substantial (Figure 5-30). A second run was conducted to assess how well the model would perform with the critical parameter elevation omitted from the dataset. The model performed well, with an accuracy of 75.65%, which is regarded to be moderate, and a Kappa value of 0.61. This possibly established that the other parameters were of equal importance to elevation for a class-based water level. OBSERVED CLASS 115 I J K L M Tot CE UA I 0 0 0 0 0 0 N/A N/A J 0 5 2 0 0 7 29% 71% K 0 0 40 2 0 42 5% 95% L 0 0 4 49 1 54 9% 91% M 0 0 0 5 7 12 42% 58% Tot 0 5 46 56 8 101 OE N/A 0% 13% 13% 13% PA N/A 100% 87% 88% 88% OCA 87.83% po 88% K 0.80 Substantial pe 0.38 Figure 5-30: Taung-Prieska Belt water level classification confusion matrix 5.3.3 Yield predictions The methodology established parameters which possibly influenced the yield, including pump- test parameters such as transmissivity, storage coefficient and specific capacity, recharge, mean annual precipitation, mean annual runoff, quaternary level baseflow, geology, and the number of water strikes present. No pump-test parameter data were available for the region and these had to be omitted from the model. 100 Chané de Bruyn, M.Sc. Dissertation PREDICTED CLASS Centre for Water Sciences and Management – North-West University, South Africa The results were unexpectedly good considering the lack of such critical parameters. The random-forest regression model predicted numerical yields with a Pearson correlation of 58.43% and an RMSE of 0.44. Figure 5-31 illustrates the yield predictions versus observed yields. Taung-Prieska Belt: Observed yield vs Predicted yield 3.5 3 2.5 2 1.5 1 0.5 0 0 10 20 30 40 50 60 Index Observed Yield Predicted Yield Figure 5-31: Taung-Prieska Belt predicted yield The random-forest classification model predicted yield classes at an accuracy level of 77.61% and had a Kappa value of 0.57, which is considered moderate. The confusion matrix is displayed in Figure 5-32. 101 Chané de Bruyn, M.Sc. Dissertation Yield Centre for Water Sciences and Management – North-West University, South Africa OBSERVED CLASS 67 A B C D E Tot CE UA A 41 3 1 0 0 45 N/A N/A B 2 6 3 0 0 11 45% 55% C 0 4 4 1 0 9 56% 44% D 0 0 1 1 0 2 50% 50% E 0 0 0 0 0 0 N/A N/A Tot 43 13 9 2 0 52 OE 95% 54% 56% 50% N/A PA 5% 46% 44% 50% N/A OCA 77.61% po 78% K 0.57 Moderate pe 0.48 Figure 5-32: Taung-Prieska Belt yield classification confusion matrix 102 Chané de Bruyn, M.Sc. Dissertation PREDICTED CLASS Centre for Water Sciences and Management – North-West University, South Africa CHAPTER 6: RESULTS AND DISCUSSION This chapter critically evaluates the results obtained in the three case studies conducted, the latter as discussed in the preceding chapter. The present chapter discusses whether the hypothesis has been proved. 6.1 Water level modelling The methodology used here established that SVR was the best-performing algorithm to use for modelling static water levels whereas, for classification, the random-forest approach performed best. These were used to model the static water levels of three groundwater regions of which the results are presented in Table 6-1. Table 6-1: Static water level model results obtained from case studies Regression Classification Case Study Area Pearson Classification Kappa RMSE Correlation Accuracy Coefficient Lowveld 99.69% 13.74 91.78% 0.90 Eastern Bushveld Complex 99.86% 13.96 90.48% 0.89 Taung-Prieska Belt 98.45% 13.47 87.83% 0.80 The accuracy metrics for the models in each study area all occur in approximately the same order, despite the fact that the Taung-Prieska Belt had far less data available than the others. This indicates that the parameters established during the methodology could be used as indicators in unexplored areas in order to characterise the water table. It was observed during the case studies that elevation is the primary driver behind water table occurrence. Geology also factors into the water level, which is a parameter that was established during the methodology. The other parameters are MAP, which is a driving force behind groundwater recharge, and the storativity coefficient. Consider that water levels within groundwater databases are skewed. This engendered the high correlation with elevation, which holds true for unconfined aquifers, that is, those that are not confined by an impermeable layer. Uncased boreholes drilled in confined aquifers tend to mimic the behaviour of surrounding boreholes that are located in unconfined aquifers. Consequently, an equilibrium is reached in the system that skews the data. Ultimately, the data in the national 103 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa databases do not amount to a perfect representation of the system, because boreholes were not drilled and cased-off to ensure the integrity of the system and study the setting. Another set of issues that arises from the national groundwater dataset, specifically the NGA, is known artefacts. Geographic positions captured within the NGA are not necessarily accurately taken in the field. Many of the older boreholes were captured before the advent of GPS, and the centroid of the farm portion was used as the geographic location. Therefore, any data from the spatial datasets do not necessarily represent the setting in which the borehole is situated. Despite these issues, water levels are still modelled with high rates of accuracy. Nonetheless, these matters should be kept in mind in future. 104 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa 6.2 Yield modelling Modelling the yield and establishing primary drivers of this parameter proved to be more challenging than in the case of the water levels. The random-forest algorithm performed best with regard to numerical yield value modelling and classification of yield. The results are indicated in Table 6-2. Table 6-2: Yield model results obtained from case studies Regression Classification Case Study Area Pearson Classification Kappa RMSE Correlation Accuracy Coefficient Lowveld 56.23% 2.93 57.36% 0.40 Eastern Bushveld Complex 72.09% 3.31 59.69% 0.44 Taung-Prieska Belt 58.43% 0.44 77.61% 0.57 The results of yield modelling are not as uniform as those of the static water levels. The Lowveld gave the most yield data, but does not have higher accuracy rates than those of the other two case studies. In fact, the Taung-Prieska Belt yield classification predicted fairly well for an area with exponentially less data compared to the other areas. Therefore, it might be assumed that, in contrast to static water level modelling, yield modelling does not suffer a ‘one size fits all’ model. The model should be adjusted to the regional setting. Although parameters have been established that could be important drivers, such as transmissivity, storage coefficient, specific capacity, geology, and recharge and baseflow, these cannot be considered as the primary drivers for any and all regions. Yield was limited by the fact that it is strongly dependent on geology, while the available lithological logs could not be used in the analysis, as described. Furthermore, borehole yields were averaged for the purpose of analysis, which could also have skewed the data. Although isolated model results seem to model yield with a considerable degree of accuracy, the overall result suggests that further investigation is needed to establish the primary drivers behind yield. 105 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa CHAPTER 7: CONCLUSIONS AND RECOMMENDATIONS The aim of this study was to mine the groundwater databases of South Africa and make use of data-driven modelling through machine learning, so as to classify relationships between borehole parameters such as static water level and yield and the surrounding settings. The model was compiled with the aim of utilising it for different regions across South Africa, and not to limit it to one area. Three case studies were conducted across South Africa to test the validity of the methodology proposed. A single dataset was created by using three different groundwater information sources, namely the NGA, the GRIP, and geospatial data. This dataset was then used to model static water levels and borehole yield by using five different machine-learning algorithms. During the modelling phase, certain parameters were identified that could influence borehole parameters. The parameter with the most influence was elevation, which was to be expected due to the distinct correlation between groundwater level and topographic elevation. Second most important were mean annual precipitation (MAP), storativity (as gridded values from GIS datasets), and five sequential lithologies. On evaluation of the model results and considering the relevant extant literature that has been reviewed here, the following conclusions are made: • The foundations of a well-performing model are input data. The quality of the data will considerably impact the results of a model. Even though large quantities of data were available during this study, the quality of the data remains unknown. • Borehole-specific geology could not be used in the modelling phase as it was too complex to gain a representative value which could be used by the algorithms. This was a critical dataset that could have influenced the results of the models. • Not all algorithms are equally suited to a single task. Naive Bayes and K-Nearest neighbour classification do not manage high-dimension data well. This was established during the study, as these, especially naive Bayes, consistently underperformed. • Although data-driven models are able to model the dependant variable of the system with a reasonable degree of accuracy, the drivers of the system are not immediately apparent. A trial-and-error approach must be taken in order to establish the drivers of a system, and even then, the interrelationships between the independent variables are not known. • Static water levels could be modelled with significant accuracy, but borehole yield drivers are still uncertain and require further research. 106 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa The machine learning models created for this study, have been adjusted for the selected parameters and was designed with the aim of generalisation. Although the three case studies represent different groundwater regions and tested the applicability of the models for different scenarios, South Africa has many different and diverse regions. The true generalisation of the models is still to be tested in other regions of the country. For example, the most prominent parameter used in the model is elevation and has been proved to be a determining factor in the regions tested. Yet, the dolomitic aquifers of the North West Province is notorious for their complexity in simulating understood processes. Groundwater levels are very different in these dolomite or karst aquifers and do not necessarily follow the topography. As per the literature review, there exists a hesitancy in the water resource sector to make use of data-driven models as opposed to the widely accepted process-based models. This hesitance may be in part due to the lack in competence of computer science skills. A similar challenge was faced during this study; therefore, it is recommended that members of the water resource community be exposed to these topics conventionally considered outside their field. Data quality and quantity has been a central topic in this study. Section 3 discussed the available data for South African geohydrological databases and the quality of their data. It may be concluded that certain regions, such as the Limpopo Province, enjoys extensive and relatively good quality data, where regions in, for example, the Northern Cape, may not have publicly available data on which to conduct through studies. On the basis of these, the following recommendations can be made: • Data quality is more important than data quantity. Good quality data should be used to refine the models. • The models created used the algorithms in their most basic form. Arguments for each algorithm can be tweaked to improve calibrate models. • Consideration should be given as to how complex data such as lithological logs can be represented in the dataset to be usable for the model. Finally, the hypothesis that applying data mining and machine-learning techniques on borehole data will improve a geohydrological characterisation of unexplored areas has tested true. 107 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa BIBLIOGRAPHY Aggarwal, C.C. 2015. Data mining: The textbook. Cham: Springer International Publishing. Akossou, A.Y.J. & Palm, R. 2013. Impact of data structure on the estimation R-square and adjusted R-square in linear regression. International Journal of Mathematics and Computation 20(3):84-93. Alaliyat, S. 2008. Video–based fall detection in elderly’s houses. Gjøvik: Gjøvik University College. (Thesis – MSc). Allwright, A., Witthueser, K., Cobbing, J., Mallory, S. & Sawunyama, T. 2013. Development of a groundwater resource assessment methodology for South Africa: Towards a holistic approach. Water Research Commission Report No. 2048/1/13. Arabameri, A., Roy, J., Saha, S., Blaschke, T., Ghorbanzadeh, O. & Tien Bui, D. 2019. Application of probabilistic and machine learning models for groundwater potentiality mapping in Damghan sedimentary plain, Iran. Remote sensing 11(24):3015-3049. Aranibar, L.A.Q. 1994. Learning fuzzy logic from examples. Athens, OH: Ohio University. (Thesis – MSc). Babovic, V. 2005. Data mining in hydrology. Hydrological Processes 19:1511-1515. Bagaria, J. 2019. Set theory. In: Stanford encyclopedia of philosophy. https://plato.stanford.edu/archives/spr2020/entries/set-theory/ Date of access: 26 Aug. 2020. Batini, C. & Scannapieco, M. 2016. Data and information quality – dimensions, principles and techniques. In: Carey, M.J. & Ceri, S., eds. Data-centric systems and applications. Cham: Springer International. pp. 5-7. Bougher, B.B. 2009. Machine learning applications to geophysical data analysis. Vancouver: The University of British Columbia. (Thesis – MSc). Bramer, M. 2016. Principles of data mining. 3rd ed. London: Springer. Brownlee, J. 2021. Regression metrics for machine learning. https://machinelearningmastery.com/regression-metrics-for-machine-learning/ Date of access: 20 Feb. 2023. 108 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa Caté, A., Perozzi, L., Gloaguen, E. & Blouin, M. 2017. Machine learning as a tool for geologists. The leading edge 36(3):215-219. Cichosz, P. 2015. Data mining algorithms: Explained using R. Hoboken, NJ: Wiley. Dennis, R. & Dennis, I. 2020. Geo-statistical analysis and sub-delineation of all vegter regions. North-West University, K5/2745 Deliverable 6. Devi, G.K., Ganasri, B.P. & Dwarakish, G.S. 2015. A review on hydrologicl models. Aquatic Procedia 4:1001-1007. DFFE. 2022. GIS data downloads [Datasets]. DFFE. https://egis.environment.gov.za/data_egis/data_download/current DFFE. 2021. South African national land-cover 2020 accuracy report. Department of Forestry, Fisheries and Environment Public Release Report version 1.0.4. Diez, P. 2018. Smart wheelchairs and brain-computer interfaces: Mobile assistive technologies. Cambridge, MA: Academic Press. DWA. 2009. Review of GRA1, GRA2 and international assessment methodologies. Department of Water Affairs Report No. P RSA 000/00/11609/6. DWS (Department of Water and Sanitation). 2020 National groundwater archive (NGA) stored borehole distribution. https://www.dws.gov.za/Groundwater/data/boreholedist.aspx Date of access: 9 Nov. 2020. Elefteriadou, L. 2014. Mathematical and empirical models. In: An Introduction to traffic flow theory. New York, NY: Springer. pp. 129-135. Freeze, R.A. & Cherry, J.A. 1979. Groundwater. Englewood Cliffs, NJ: Prentice-Hall. Gaaloul, N., Eslamian, S. & Ostad-Ali-Askari, K. 2018. Boreholes. In: Bobrowsky, P.T. & Marker, B., eds. Encyclopedia of engineering geology. Cham: Springer. pp. 68-73. García, S., Luengo, J. & Herrera, F. 2015. Data preprocessing in data mining. Vol. 72. Cham, Switzerland: Springer International Publishing. Gardner, S.A. 1992. Spelling errors in online databased: What the technical communicator should know. Technica Communication 39(1):50-53. 109 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa Gaur, P. 2012. Neural networks in data mining. International Journal of Electronics and Computer Science Engineering 1(3):1449-1453. Goltz, M. & Huang, J. 2017. Analytical modeling of solute transport in groundwater: Using models to understand the effect of natural processes on containment fate and transport. Hoboken, NJ: Wiley. GRIP (Groundwater Resource Information Project). s.a. About the GRIP project. http://griplimpopo.co.za/about/ Date of access: 11 Nov. 2020. Grus, J. 2015. Data science from scratch. Sebastopol, CA: O’Rielly Media. Hand, D.J. 2013. Data mining. In: El-Shaarawi, A.H. & Piegorsch, W.W., eds. Encyclopedia of Environmetrics. Hoboken, NJ: John Wiley & Sons, Ltd. pp. 1-4. Hand, D.J., Mannila, H. & Smyth, P. 2001. Principles of data mining. Cambridge, MA: MIT Press. Jing, H., He, X., Tian, Y., Lancia, M., Cao, G., Crivellari, A., … Zheng, C. 2022. Comparison and interpretation of data-driven models for simulating site-specific human-impacted groundwater dynamics in the North China plain. Journal of Hydrology 616:128751. Joyce, J. 2019. Bayes’ theorem. In: Standford encyclopedia of philosophy. https://plato.stanford.edu/archives/spr2019/entries/bayes-theorem/ Date of access: 3 Mar. 2020. Kapitanova, K, Son, S.H., Kang, K.D. 2012. Using fuzzy logic for robust event detection in wireless sensor networks. Ad Hoc Networks 10(4):709-722. Kenda, K., Čerin, M., Bogataj, M., Senožetnik, M., Klemen, K., Pergar, P., Laspidou, C. & Mladenić, D. 2018. Groundwater modeling with machine learning techniques: Ljubljana Polje aquifer. Proceedings 2(11):697-704. Khan, A.N., Kim, B.W., Rizwan, A., Ahmad, R., Iqbal, N., Kim, K. & Kim, D.H. 2023. A new method for determination of optimal borehole drilling location considering drillist cost minimization and sustaible groundwater management. ACS Omega 2023(8):10806-10821. Kim, J.H. & Jackson, R.B. 2012. A global analysis of groundwater recharge for vegetation, climate, and soils. Vadose Zone Journal 11(1):159-174. 110 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa Kohavi, R. & Becker, B. 1996. Dataset for: Adult data set [Dataset]. UCI machine learning repository. https://archive.ics.uci.edu/ml/datasets/adult Kumar, C.P. 2019. An overview of commonly used groundwater modelling software. International Journal of Advanced Research in Science, Engineering and Technology 6(1):7854- 7865. Landis, J.R. & Koch, G.G. 1977. The measurement of observer agreement for categorical data. Biometrics 33(1):159-174. Larose, D.T. 2005. Discovering knowledge in data: An introduction to data mining. Hoboken, NJ: Wiley. Larose, C.D. & Larose, D.T. 2019. Data science using python and R. Hoboken, NJ: Wiley. Lerner, D.N. & Harris, B. 2009. The relationship between land use and groundwater resources and quality. Land Use Policy 26(1):s265-s273. Maliva, R.G. 2016. Aquifer characterization techniques. Cham: Springer. Meinzer, O.E. 1934. The history and development of ground-water hydrology. Journal of the Washington Academy of Sciences 24(1):6-32. Melville, P. & Sindhwani, V. 2017. Recommender systems. In: Sammut, C. & Webb, G.I., eds. Encyclopedia of machine learning and data mining. New York: Springer. pp. 1047. Mijwel, M.M. 2018. Artificial neural networks – advantages and disadvantages. https://www.linkedin.com/pulse/artificial-neural-networks-advantages-disadvantages-maad-m- mijwel/ Date of access: 18 Mar. 2023. Mitchell-Guthrie, P. 2014. Looking backwards, looking forwards: SAS, data mining, and machine learning [Blog post]. https://blogs.sas.com/content/subconsciousmusings/2014/08/22/ looking-backwards-looking-forwards-sas-data-mining-and-machine-learning/ Date of access: 19 Oct. 2021. Müller, B., Reinhardt, J. & Strickland, M.T. 1995. Neural networks: An introduction. 2nd ed. New York, NY: Springer. 111 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa Montgomery, D.C., Peck, E.A. & Vining, G.G. 2012. Introduction to linear regression analysis. 5th ed. Hoboken, NJ: Wiley. Neelamegam, S. & Ramaraj, E. 2013. Classification algorithm in data mining: An overview. International Journal of P2P Network Trends and Technology 3(5):1-5. Nell, J.P. & van Huyssteen, C.W. 2014. Geology and groundwater regions to quantify primary salinity, sodicity and alkalinity in South Africa. South African Journal of Plant and Soil 31(3):127- 135. NGA (National Groundwater Archive). s.a.(a) Data disclaimer. https://www.dws.gov.za/NGANet/Resources/Docs/Disclaimer.htm Date of access: 9 Nov. 2020. NGA (National Groundwater Archive). s.a.(b) Glossary. https://www.dws.gov.za/NGANet/Resources/Docs/Glossary.htm Date of access: 9 Nov. 2020. NGA (National Groundwater Archive). s.a.(c) Site map. https://www.dws.gov.za/NGANet/Resources/docs/SiteMap.htm Date of access: 9 Nov. 2020. NGA (National Groundwater Archive). s.a.(d) About us. https://www.dws.gov.za/NGANet/Resources/Docs/About%20Us.htm Date of access: 9 Nov. 2020. Noble, W.S. 2006. What is a support vector machine? Nature Biotechnology 24(12):1565-1567. Oliveira, P., Rodrigues, F. & Henriques, P.R. 2005. A formal definition of data quality problems. In: International Conference on Information Quality (MIT ICIQ Conference), Cambridge. https://dblp.org/rec/conf/iq/OliveiraRH05 Date of access: 31 Aug. 2020. Oyebode, O.K., Adeyemo, J.A. & Otieno, F.A.O. 2015. Comparison of two data-driven modelling techniques for long-term streamflow prediction using limited datasets. Journal of the South African Institution of Civil Engineering 57(3):9-17. Oyebode, O., Otieno, F. & Adeyemo, J. 2014. Review of three data-driven modelling techniques for hydrological modelling and forecasting. Fresenius Environmental Bulletin 23(7):1443-1454. Pipino, L.L., Lee, Y.W. & Wang, R.Y. 2002. Data quality assessment. Communications of the ACM 45(4):211-218. 112 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa QGIS Development Team. 2021. QGIS Desktop (Version 3.20.3) [Software]. Available at: https://www.qgis.org/en/site/forusers/download.html Rojas, R. 1996. Neural networks: A systematic introduction. New York, NY: Springer. Rosli, M.M., Tempero, E. & Luxton-Reilly, A. 2016. What is in our datasets? Describing a structure of datasets. In: Gedeon, T., eds. Conference proceedings. ACSW ’16: Australasian Computer Science Week Multiconference, Canberra, Australia. New York: Association for Computing Machinery. pp. 1-10. Rosli, M.M., Tempero, E. & Luxton-Reilly, A. 2018. Evaluating the quality of datasets in software engineering. Advanced Science Letters 24(10):7232-7239. RSA. 1998. National Water Act. In: Government Gazette. Pretoria, Republic of South Africa. Russell, S.T. & Norvig, P. 2010. Artificial intelligence: A modern approach. 3rd ed. Upper Saddle River, NJ: Prentice-Hall. Saha, S. 2018. What is the c4.5 algorithm and how does it work? https://towardsdatascience.com/what-is-the-c4-5-algorithm-and-how-does-it-work- 2b971a9e7db0 Date of access: 18 Mar. 2023. Sahoo, S., Russo, T.A., Elliott, J. & Foster, I. 2017. Machine learning algorithms for modeling groundwater level changes in agricultural regions of the U.S. Water Resource Research 53:3878- 3895. Sammut, C. & Webb, G.I., eds. 2017. Encyclopedia of machine learning and data mining. Boston, MA: Springer. Santner, T.J. & Duffy, D.E. 1989. The statistical analysis of discrete data. New York, NY: Springer Science+Business Media. Shirmohammadi, B., Vafakhah, M., Moosavi, V. & Moghaddamnia, A. 2013. Application of several data-driven techniques for predicting groundwater level. Water Resources Management 27:419-432. Sirsat, M. 2019. What is confusion matrix and advanced classification metrics? https://manisha- sirsat.blogspot.com/2019/04/confusion-matrix.html Date of access: 1 Apr. 2020. 113 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa Sivakumar, B. & Berndtsson, R. 2010. Advances in data-based approaches for hydrologic modeling and forecasting. Singapore: World Scientific Publishing. Solomatine, D.P. & Ostfeld, A. 2008. Data-driven modellling: Some past experiences and new approaches. Journal of Hydroinformatics 10(1):3-22. Sun, J., Hu, L., Li, D., Sun, K. & Yang, Z. 2022. Data-driven models for accurate groundwater level prediction and their practical significance in groundwater management. Journal of Hydrology 608:127630. SuperDataScience. 2020. Machine learning A-Z: Download codes and datasets [scripts]. https://www.superdatascience.com/pages/machine-learning Taylor, C.J. & Alley, W.M. 2001. Ground-water-level monitoring and the importance of long-term water-level data. U.S. Geological Survey Circular No. 1217. Taylor, R.G., Koussis, A.T. & Tindimugaya, C. 2009. Groundwater and climate in Africa – a review. Hydrological Sciences Journal 54(4):655-664. Tehrany, M.S., Pradhan, B. & Jebur M.N. 2013. Spatial prediction of flood susceptible areas using rule based decision tree (DT) and novel ensemble bivariate and multivariate statistical models in GIS. Journal of Hydrology 504:69-79. Teng, X. & Gong, Y. 2018. Research on application of machine learning in data mining. In IOP conference series: Materials science and engineering, 392(6). Bristol: IOP Publishing. Provost, A.M., Reilly, T.E., Harbaugh, A.W. & Pollock, D.W. 2009. U.S. geological survey groundwater modeling software: Making sense of a complex natural resource. U.S. Geological Fact Sheet No. 2009-3105. Villholth, K.G. & Giordano, M. 2007. Groundwater use in a global perspective – can it be managed? In: Villholth, K.G. & Giordano, M., eds. The agricultural groundwater revolution: Opportunities and threats to development. Oxfordshire: Wallingford. pp. 393-401. WRC (Water Research Commission). 2012. GIS Maps [Geospatial dataset]. Water resources of South Africa, 2012 study. https://www.waterresourceswr2012.co.za/resource-centre/ Wheater, H., Sorooshian, S. & Sharma, K.D., eds. 2007. Hydrological modelling in arid and semi- arid areas. New York, NY: Cambridge University Press. 114 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa Wu, S. 2020. 3 Best metrics to evaluate regression model? https://towardsdatascience.com/what-are-the-best-metrics-to-evaluate-your-regression-model- 418ca481755b Date of access: 20 Feb. 2023. Xu, Y. & Beekman, H.E. 2019. Review: Groundwater recharge estimation in arid and semi-arid Southern Africa. Hydrogeology Journal 27:929-943. Yan, X. & Su, X.G. 2009. Linear regression analysis: Theory and computing. Singapore: World Scientific. Zadeh, L.A. 1988. Fuzzy Logic. Computer, 21(4):83-93. Zaki, M.J. & Meira Jr, W. 2014. Data mining and analysis: fundamental concepts and algorithms. Cambridge, UK: Cambridge University Press. Zhu, M., Wang, J., Yang, X., Zhang, Y., Zhang, L., Ren, H., … Ye, L. 2022. A review of the application of machine learning in water quality evaluation. Eco-Environment & Health 1(2):107- 116. 115 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa ANNEXURES 8.1 Annexure A – NGA database Table 8-1: NGA available features for export Geosites Operational Geosite Construction Field Depth & Openings & Equipment Downhole Pumping Test Water Levels Abstraction Discharge Rate Reference Other Numbers Casing Fill Materials Developments Piezometer Linked to Bulk Water Strike Yield Test Recommendati Site Visits Owners Lithology Information Completion Measurements Diameter Screens Installed Geophysics Details Meter ons Construction Construction Measurement Piezometer Other Number Measurement Casing Column Casing Column Casing Column Development Piezometer Data Owner Of Pumping Test Yield Test Start Pumping Test Descriptor Data Owner Completion Reference Point Meter Type Reference Type Monitoring Type Completion Logging Date Visit Date Owner Name Date & Time Number Type Date Number Number Number Methods Number Linked Geosite Start Date Date Start Date Designator Date Date Starting Starting Yield Construction Reference Measurement Measurement Library Report Casing Collar Chemical Types Piezometer Identifier Of Pumping Test Yield Test Start Pumping Test Descriptor Identifier Measurement Other Number Data Source Depth To Top Depth To Top Installation Date Measurement Logging Unit Visit Reason Address Completion Height Date & Time Date & Time Number Height Used Purpose Linked Geosite Start Time Time Start Time Name Date & Time Method Date Ending Starting Ending Starting Construction Measuring Reporting Reporting Observed Depth To Depth To Development Decommissione Groundwater Logging Pumping Test Recommendatio Reporting Geosite Type Measurement Measurement Measurement Report Name Depth To Top Start Date Test Type Contact Details Event Date Completion Method Institution Institution Casing Bottom Bottom Date d Date Occurrence Company End Date n Date Institution Date & Time Date & Time Date & Time Date Ending Ending Source Of Reporting Construction Water Level Depth To Number Of Duration Of Depth To Water Strike Logging Pumping Test Static Water Reporting Mine Type Measurement Discharge Rate Measurement Report Date Assignor Depth To Top Fill Type Data Source End Date Recommendatio Institution Depth To Top Cost Status Bottom Openings Development Bottom Type Contractor End Time Level Institution Date & Time Date & Time n Address Penetration Logging Constant Yield Reporting Construction Drawdown Measurement Consultant's Depth To Opening Piezometer Reporting Purpose Depth To Confidential Data Source Discharge Type Information Gravel Pack Finish Type Seepage Value Company Test Method Test Total Institution Visit Date Method Period Depth Report Number Bottom Method Height Institution Indicator Bottom Available Address Duration Hours Contact Number Constant Yield Depth From Logging Reference Construction Reporting Discharge Electrical Test Total Protection Zone Recovery Period Located At Depth Qualifier Casing Material Opening Width Casing / Lining Pump Type Depth To Top Company Pump Type Site Visitor Lithology Name Datum Company Institution Method Conductivity Duration Up Gradient Collar Contact Number Minutes Paper Trace Coordinate Construction Piezometer Hour Meter Geology Other Casing Piezometer Depth To Pump Depth To Depth To Pump Protection Zone Site Visitor Data Source Ph Class Diameter Opening Length Reference Analysis Method Primary Colour Method Contractor Number Reading Indicator Material Length Intake Bottom Intake Down Gradient Address Number Construction Conversion Coordinate GPS Measurement Reporting Opencast Mine Opening Pump Power Total Blow Yield Data Cassette / Reporting Other Analysis Water Quality Site Visitor Company Factor / Ph Value Inner Diameter Installed Date Colour Qualifier Accuracy Date & Time Institution Length Diameter Source Value Video Number Institution Methods Class Contact Number Contact Details Constant Value Construction Starting Opencast Mine Screen Decommissione Pump Power Contribution Measurement Testing Specific Chemical Secondary Elevation Company Measurement Quantity Temperature Outer Diameter Width Manufacturer d Date Rating Value Type Company Capacity Suitability Colour Contact Number Date & Time Ending Seepage Elevation Power Meter Opencast Mine Deepest Interval Screen Pump Special Testing Composition Drilling Fluid Measurement HCO3 Inner Diameter Indicator For Transmissivity Pump Type Method Reading Depth Closed Specification Manufacturer Measurements Contractor Qualifier Date & Time Contribution Water Meter Underground Testing Elevation GPS Collective Pump Serial Depth To Pump Additives Water Level Data Source Mine Shaft Diameter Outer Diameter Company Storativity Fabric Qualifier Accuracy Reading(Measur Number Intake Depth Address ement) Testing Elevation Additional Water Meter Reporting Lining Collar Piezometer Pump Riser Data Source Tunnel Shape Company Specific Yield Duty Cycle Fabric Attribute Reference Point Additives Reading Institution Height Material Main Material Contact Number Other Water Meter Other Recommended Reporting Tunnel Cross- Pump Riser Leakage / Farm Name Construction Measurement Lining Material Piezometer Analysis Method Abstraction Material Type Institution Section Diameter Drainage Factor Method Reason Material Yield Latest Water Protection Level Measurement Other Lining Reporting Other Analysis Hydraulic Farm Number Tunnel Length Meter Type Recovery Period Formation Type Method Measurement Method Material Institution Methods Resistance Only Water Levels Other Protection Measurement Drain Casing Meter Serial Specific Hydraulic Operational Loss Town (with or without Method Status Length Number Capacity Conductivity Period Percentage pumping tests) Abstraction Collector Supplying Water Quality Piezometer Portion (with or without Transmissivity Loss Reason Diameter Company Class Number pumping tests) Supplying Water Level Other Loss Village Arm Diameter Storativity Step Number Contractor Drawdown Reason Supplying Water Level Geosite Status Arm Number Company Specific Yield Step Duration Measurement Texture Qualifier Address Frequency Date When Supplying Leakage / Average Monitoring Hardness Status Was Arm Length Company Drainage Factor Discharge Rate Period Qualifier Observed Contact Number Measurement Starting Geosite Bulk Meter Hydraulic Date & Time Particle Shape Status Date Indicator Resistance Type 116 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa Water Meter Actual Ending Geosite Maximum Hydraulic Measurement Sorting Status Date Measurement Conductivity Date Value Rainfall Actual Reporting Water Quality Boulder Monitoring Measurement Institution Class Percentage Equipment Type Time Rainfall Other Cobble Map Number Elapsed Time Equipment Type Percentage Electronic Data Water Level Pebble Province Logger Type Percentage Other Electronic Registration Granule Data Logger Water Level District Percentage Type Electronic Data Quaternary Water Level Sand Logger Drainage Region Status Percentage Manufacturer Other Electronic Site Selector Data Logger Discharge Rate Silt Percentage Manufacturer Electronic Data Water Logger Serial Quantity Clay Percentage Consumer Number Measurement Degree Of Comment Method Weathering Other Degree Of Latitude Measurement Fracturing Method Water Level Longitude Monitoring Equipment Type Water Management Area Municipal District Old Municipal District New Surface Geology Hydrogeological Region Water User Association Geomorphology Land Cover Taste Of Water Intended Geosite Purpose Observed/Actua l Water Uses Drain Type Mine Groundwater Extraction Method Mine Status Mined Commodity Sinkhole Type Sinkhole Cause Sinkhole Probable Trigger Sinkhole Classification Spring Seasonality Regime Spring Geomorphologic al/Geological Type Spring Classification Through Discharge Rate Tunnel Purpose Tunnel Usage 117 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa Table 8-2: Column completeness results for the NGA Attribute Column Completeness Identifier 1 Geosite Type 1 Latitude 1 Longitude 1 Data Owner 1 Borehole Depth 0.6851 Borehole Diameter 0.6851 Lithology Name 0.3839 Water Strike Type 0.3528 Depth to Top (Water Strike) 0.3522 Water Level 0.3241 Discharge Rate 0.2174 Seepage Value (Water Strike) 0.2058 Weathering Degree (Lithology) 0.1184 Electrical Conductivity 0.0467 Depth To Bottom (Water Strike) 0.0220 Fracturing Degree (Lithology) 0.0193 Abstraction Quantity 0.0183 pH Value 0.0126 pH Class 0.0105 Temperature 0.0095 Groundwater Occurrence 0.0057 HCO3 0.0001 118 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa Table 8-3: Column completeness results for the GRIP Attribute Column Completeness GRIP site ID number 1 GRIP borehole number 1 H Area 1 Quaternary catchment area 1 Farm name 1 Farm number 1 Province 1 District municipality 1 Local municipality 1 Settlement name 1 Settlement ID 1 Longitude 1 Latitude 1 Power 0.9997 Comment 0.9995 Equipment 0.9977 Borehole depth 0.5239 Water level 0.4056 Water level date taken 0.4056 Depth to pump intake 0.3274 Discharge rate 0.3273 Duty cycle 0.3273 Daily abstractions 0.3273 Quality 0.3164 Alternative borehole number 2 0.1376 Regional borehole number 0.0554 Alternative borehole number 1 0.0479 Alternative settlement name 0.0104 119 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa 8.2 Annexure B – GRIP database example GRIP Quaternary Regional Alternative Alternative GRIP Site ID H Farm District Settlement Settlement borehole Catchment borehole Borehole Borehole Farm Name Province Local Municipality number Area Number Municipality Name ID Number area number Number 1 Number 2 H01- H01-2125 H01 B52G - G45030 - KOPPIEKRAAL LPKS475 Limpopo Capricorn Lepele-Nkumpi Serobaneng 1851 2429BDV0003 H01- H01-2318 H01 B52G - - - MOLSGAT LPKS439 Limpopo Capricorn Lepele-Nkumpi Masite 1147 2429BCV0014 H02- “Greater H02-0907 H02 B52B - - - GROBLERSVREDE LPKS844 Limpopo Makhuduthamaga Ga-Ratau 438 2429DDV0015 Sekhukhune” H02- “GELUKS “Greater H02-1150 H02 B52B - - - LPKS000 Limpopo Makhuduthamaga Manganeng 1022 2429DBN0060 LOCATION” Sekhukhune” Depth to Daily Alternative Longitude Latitude Borehole Waterlevel Water level Discharge Duty cycle pump intake Abstraction Equipment Power Quality Comment Settlement name [WGS84] [WGS84] depth [m] [mbgl] date taken rate [l/s] [hours] [m] [m3/day] “No - 29.60052 -24.23661 125 17.95 2001-10-02 66 3 24 259.2 “No equipment” - TESTED power” “Submersible “Electric “CLASS - 29.67855 -24.31084 37 12.62 2003-02-01 24 0.7 24 60.48 TESTED pump” motor” 0” “Momo-type “CLASS - 29.95428 -24.78826 64.2 9.51 2005-06-23 24 0.5 24 43.2 Diesel TESTED pump” 2” “CLASS - 29.97279 -24.6857 47.72 19.66 2006-07-04 36 0.5 24 43.2 Handpump Hand TESTED 3” 120 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa 8.3 Annexure C – Model Scripts Scripts presented are those used for water level prediction. Those of yield prediction are exactly the same, only the input dataset and dependent variable differ. 8.3.1 Regression # Multiple Linear Regression # Importing the dataset dataset = read.csv('Method_WL_mamsl.csv') str(dataset) # Removing columns from imported dataset dataset = subset(dataset, select = -c(WL_100m)) # dataset = subset(dataset, select = -c(Elevation)) str(dataset) # Numerical to factor dataset$LITHO_1 = factor(dataset$LITHO_1) dataset$LITHO_2 = factor(dataset$LITHO_2) dataset$LITHO_3 = factor(dataset$LITHO_3) dataset$LITHO_4 = factor(dataset$LITHO_4) dataset$LITHO_5 = factor(dataset$LITHO_5) str(dataset) # Splitting the dataset into the Training set and Test set # install.packages('caTools') library(caTools) set.seed(1234) split = sample.split(dataset$WL_mamsl, SplitRatio = 0.8) training_set = subset(dataset, split == TRUE) test_set = subset(dataset, split == FALSE) # Fitting Multiple Linear Regression to the Training set regressor = lm(formula = WL_mamsl ~ ., data = training_set) # Predicting the Test set results y_pred = predict(regressor, newdata = test_set) # Plot results plot(test_set$WL_mamsl, type = "l", col = "red") lines(y_pred, col = "grey") lines(moving_average, col = "blue") # Correlation Pearson = cor(test_set$WL_mamsl, y_pred, method = c("pearson")) RMSE = sqrt(mean((test_set$WL_mamsl - y_pred)^2)) MAPE = MAPE(y_pred,test_set$WL_mamsl) MAE = MAE(test_set$WL_mamsl,y_pred) print(Pearson*100) print(RMSE) print(MAPE) print(MAE) # Significance value of features summary(regressor) 121 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa # Support Vector Regression # Importing the dataset dataset = read.csv('Method_WL_mamsl.csv') str(dataset) # Removing columns from imported dataset dataset = subset(dataset, select = -c(WL_100m)) # dataset = subset(dataset, select = -c(Elevation)) str(dataset) # Numerical to factor dataset$LITHO_1 = factor(dataset$LITHO_1) dataset$LITHO_2 = factor(dataset$LITHO_2) dataset$LITHO_3 = factor(dataset$LITHO_3) dataset$LITHO_4 = factor(dataset$LITHO_4) dataset$LITHO_5 = factor(dataset$LITHO_5) str(dataset) # Splitting the dataset into the Training set and Test set library(caTools) set.seed(123) split = sample.split(dataset$WL_mamsl, SplitRatio = 0.8) training_set = subset(dataset, split == TRUE) test_set = subset(dataset, split == FALSE) # Fitting Support Vector Regression to the Training set library(e1071) regressor_L_nu = svm(formula = WL_mamsl ~ ., data = training_set, type = 'nu-regression', kernel = 'linear') y_pred_L_nu = predict(regressor_L_nu, newdata = test_set) # Plot results plot(test_set$WL_mamsl, type = "l", col = "red") lines(y_pred_L_nu, col = "grey") lines(moving_average, col = "blue") # Correlation Pearson_L_nu = cor(test_set$WL_mamsl, y_pred_L_nu, method = c("pearson")) RMSE_L_nu = sqrt(mean((test_set$WL_mamsl - y_pred_L_nu)^2)) MAPE = MAPE(y_pred_L_nu,test_set$WL_mamsl) MAE = MAE(test_set$WL_mamsl,y_pred_L_nu) print(Pearson_L_nu*100) print(RMSE_L_nu) print(MAPE*100) print(MAE) # Significance value of features summary(regressor_L_nu) 122 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa # Decision Tree Regression # Importing the dataset dataset = read.csv('Method_WL_mamsl.csv') str(dataset) # Removing columns from imported dataset dataset = subset(dataset, select = -c(WL_100m)) dataset = subset(dataset, select = -c(Elevation)) str(dataset) # Numerical to factor dataset$LITHO_1 = factor(dataset$LITHO_1) dataset$LITHO_2 = factor(dataset$LITHO_2) dataset$LITHO_3 = factor(dataset$LITHO_3) dataset$LITHO_4 = factor(dataset$LITHO_4) dataset$LITHO_5 = factor(dataset$LITHO_5) str(dataset) # Splitting the dataset into the Training set and Test set # install.packages('caTools') library(caTools) set.seed(123) split = sample.split(dataset$WL_mamsl, SplitRatio = 0.8) training_set = subset(dataset, split == TRUE) test_set = subset(dataset, split == FALSE) # Fitting Decision Tree Regression to the Training set # install.packages('rpart') library(rpart) regressor = rpart(formula = WL_mamsl ~ ., data = training_set, control = rpart.control(minsplit = 2)) # Predicting a new result with Decision Tree Regression y_pred = predict(regressor, newdata = test_set) # Plot results plot(test_set$WL_mamsl, type = "l", col = "red") lines(y_pred, col = "grey") lines(moving_average, col = "blue") # Correlation Pearson = cor(test_set$WL_mamsl, y_pred, method = c("pearson")) RMSE = sqrt(mean((test_set$WL_mamsl - y_pred)^2)) MAPE = MAPE(y_pred,test_set$WL_mamsl) MAE = MAE(test_set$WL_mamsl,y_pred) print(Pearson*100) print(RMSE) print(MAPE*100) print(MAE) # Significance value of features summary(regressor) 123 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa # Random Forest Regression # Importing the dataset dataset = read.csv('Method_WL_mamsl.csv') str(dataset) # Removing columns from imported dataset dataset = subset(dataset, select = -c(WL_100m)) dataset = subset(dataset, select = -c(Elevation)) str(dataset) # Numerical to factor dataset$LITHO_1 = factor(dataset$LITHO_1) dataset$LITHO_2 = factor(dataset$LITHO_2) dataset$LITHO_3 = factor(dataset$LITHO_3) dataset$LITHO_4 = factor(dataset$LITHO_4) dataset$LITHO_5 = factor(dataset$LITHO_5) str(dataset) # Splitting the dataset into the Training set and Test set # install.packages('caTools') library(caTools) set.seed(123) split = sample.split(dataset$WL_mamsl, SplitRatio = 0.8) training_set = subset(dataset, split == TRUE) test_set = subset(dataset, split == FALSE) # Fitting Random Forest Regression to the Training set # install.packages('randomForest') library(randomForest) set.seed(123) regressor = randomForest(formula = WL_mamsl ~ ., data = training_set, ntree = 100, keep.forest = TRUE, importance = TRUE) # Predicting a new result with Random Forest Regression y_pred = predict(regressor, newdata = test_set) # Get variable importance from the model fit print(regressor) importance(regressor) varImpPlot(regressor) # Plot results plot(test_set$WL_mamsl, type = "l", col = "red") lines(y_pred, col = "grey") lines(moving_average, col = "blue") # Correlation Pearson = cor(test_set$WL_mamsl, y_pred, method = c("pearson")) RMSE = sqrt(mean((test_set$WL_mamsl - y_pred)^2)) MAPE = MAPE(y_pred,test_set$WL_mamsl) MAE = MAE(test_set$WL_mamsl,y_pred) print(Pearson*100) print(RMSE) print(MAPE*100) print(MAE) 124 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa 8.3.2 Classification # K-Nearest Neighbors (K-NN) # Importing the dataset dataset = read.csv('Method_WL_mamsl.csv') str(dataset) # Removing columns from imported dataset dataset = subset(dataset, select = -c(WL_mamsl)) # dataset = subset(dataset, select = -c(Elevation)) str(dataset) # Numerical to factor dataset$LITHO_1 = factor(dataset$LITHO_1) dataset$LITHO_2 = factor(dataset$LITHO_2) dataset$LITHO_3 = factor(dataset$LITHO_3) dataset$LITHO_4 = factor(dataset$LITHO_4) dataset$LITHO_5 = factor(dataset$LITHO_5) dataset$WL_100m = factor(dataset$WL_100m) str(dataset) # Splitting the dataset into the Training set and Test set # install.packages('caTools') library(caTools) set.seed(123) split = sample.split(dataset$WL_100m, SplitRatio = 0.8) training_set = subset(dataset, split == TRUE) test_set = subset(dataset, split == FALSE) # Fitting K-NN to the Training set and Predicting the Test set results library(class) y_pred = knn(train = training_set[, -9], test = test_set[, -9], cl = training_set[, 9], k = 1) # Making the Confusion Matrix cm = table(test_set[, 9], y_pred) print(cm) Misclassification = 1 - sum(diag(cm))/sum(cm) Correctly_Classified = 1 - Misclassification print(Misclassification*100) print(Correctly_Classified*100) # Calculating Kappa value diagonal.counts = diag(cm) N = sum(cm) row.marginal.props = rowSums(cm)/N col.marginal.props = colSums(cm)/N Po = sum(diagonal.counts)/N Pe = sum(row.marginal.props*col.marginal.props) k = (Po - Pe)/(1 - Pe) k 125 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa # Support Vector Machine (SVM) # Importing the dataset dataset = read.csv('Method_WL_mamsl.csv') str(dataset) # Removing columns from imported dataset dataset = subset(dataset, select = -c(WL_mamsl)) dataset = subset(dataset, select = -c(Elevation)) str(dataset) # Numerical to factor dataset$LITHO_1 = factor(dataset$LITHO_1) dataset$LITHO_2 = factor(dataset$LITHO_2) dataset$LITHO_3 = factor(dataset$LITHO_3) dataset$LITHO_4 = factor(dataset$LITHO_4) dataset$LITHO_5 = factor(dataset$LITHO_5) dataset$WL_100m = factor(dataset$WL_100m) str(dataset) # Splitting the dataset into the Training set and Test set # install.packages('caTools') library(caTools) set.seed(123) split = sample.split(dataset$WL_100m, SplitRatio = 0.8) training_set = subset(dataset, split == TRUE) test_set = subset(dataset, split == FALSE) # Fitting SVM to the Training set # install.packages('e1071') library(e1071) classifier = svm(formula = WL_100m ~ ., data = training_set, type = 'C-classification', kernel = 'linear') # Predicting the Test set results y_pred = predict(classifier, newdata = test_set[-8]) # Making the Confusion Matrix cm = table(test_set[, 8], y_pred) print(cm) Misclassification = 1 - sum(diag(cm))/sum(cm) Correctly_Classified = 1 - Misclassification print(Misclassification*100) print(Correctly_Classified*100) # Calculating Kappa value diagonal.counts = diag(cm) N = sum(cm) row.marginal.props = rowSums(cm)/N col.marginal.props = colSums(cm)/N Po = sum(diagonal.counts)/N Pe = sum(row.marginal.props*col.marginal.props) k = (Po - Pe)/(1 - Pe) k 126 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa # Naive Bayes # Importing the dataset dataset = read.csv('Method_WL_mamsl.csv') str(dataset) # Removing columns from imported dataset dataset = subset(dataset, select = -c(WL_mamsl)) dataset = subset(dataset, select = -c(Elevation)) str(dataset) # Numerical to factor dataset$LITHO_1 = factor(dataset$LITHO_1) dataset$LITHO_2 = factor(dataset$LITHO_2) dataset$LITHO_3 = factor(dataset$LITHO_3) dataset$LITHO_4 = factor(dataset$LITHO_4) dataset$LITHO_5 = factor(dataset$LITHO_5) dataset$WL_100m = factor(dataset$WL_100m) str(dataset) # Splitting the dataset into the Training set and Test set # install.packages('caTools') library(caTools) set.seed(123) split = sample.split(dataset$WL_100m, SplitRatio = 0.8) training_set = subset(dataset, split == TRUE) test_set = subset(dataset, split == FALSE) # Fitting naiveBayes to the Training set # install.packages('e1071') library(e1071) classifier = naiveBayes(x = training_set[, -8], y = training_set$WL_100m) # Predicting the Test set results y_pred = predict(classifier, newdata = test_set[-8]) # Making the Confusion Matrix cm = table(test_set[, 8], y_pred) print(cm) Misclassification = 1 - sum(diag(cm))/sum(cm) Correctly_Classified = 1 - Misclassification print(Misclassification*100) print(Correctly_Classified*100) # Calculating Kappa value diagonal.counts = diag(cm) N = sum(cm) row.marginal.props = rowSums(cm)/N col.marginal.props = colSums(cm)/N Po = sum(diagonal.counts)/N Pe = sum(row.marginal.props*col.marginal.props) k = (Po - Pe)/(1 - Pe) k 127 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa # Decision Tree Classification # Importing the dataset dataset = read.csv('Method_WL_mamsl.csv') str(dataset) # Removing columns from imported dataset dataset = subset(dataset, select = -c(WL_mamsl)) dataset = subset(dataset, select = -c(Elevation)) str(dataset) # Numerical to factor dataset$LITHO_1 = factor(dataset$LITHO_1) dataset$LITHO_2 = factor(dataset$LITHO_2) dataset$LITHO_3 = factor(dataset$LITHO_3) dataset$LITHO_4 = factor(dataset$LITHO_4) dataset$LITHO_5 = factor(dataset$LITHO_5) dataset$WL_100m = factor(dataset$WL_100m) str(dataset) # Splitting the dataset into the Training set and Test set # install.packages('caTools') library(caTools) set.seed(123) split = sample.split(dataset$WL_100m, SplitRatio = 0.8) training_set = subset(dataset, split == TRUE) test_set = subset(dataset, split == FALSE) # Fitting DT to the Training set library(RWeka) classifier = J48(formula = WL_100m ~ ., data = training_set) summary(classifier) # Predicting the Test set results y_pred = predict(classifier, newdata = test_set[-8]) # Making the Confusion Matrix cm = table(test_set[, 8], y_pred) print(cm) summary(cm) Misclassification = 1 - sum(diag(cm))/sum(cm) Correctly_Classified = 1 - Misclassification print(Misclassification*100) print(Correctly_Classified*100) # Calculating Kappa value diagonal.counts = diag(cm) N = sum(cm) row.marginal.props = rowSums(cm)/N col.marginal.props = colSums(cm)/N Po = sum(diagonal.counts)/N Pe = sum(row.marginal.props*col.marginal.props) k = (Po - Pe)/(1 - Pe) k 128 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa # Random Forest Classification # Importing the dataset dataset = read.csv('Method_WL_mamsl.csv') str(dataset) # Removing columns from imported dataset dataset = subset(dataset, select = -c(WL_mamsl)) # dataset = subset(dataset, select = -c(Elevation)) str(dataset) # Numerical to factor dataset$LITHO_1 = factor(dataset$LITHO_1) dataset$LITHO_2 = factor(dataset$LITHO_2) dataset$LITHO_3 = factor(dataset$LITHO_3) dataset$LITHO_4 = factor(dataset$LITHO_4) dataset$LITHO_5 = factor(dataset$LITHO_5) dataset$WL_100m = factor(dataset$WL_100m) str(dataset) # Splitting the dataset into the Training set and Test set # install.packages('caTools') library(caTools) set.seed(123) #12345 split = sample.split(dataset$WL_100m, SplitRatio = 0.8) training_set = subset(dataset, split == TRUE) test_set = subset(dataset, split == FALSE) # Fitting Random Forest Classification to the Training set # install.packages('randomForest') library(randomForest) classifier = randomForest(x = training_set[, -9], y = training_set$WL_100m, ntree = 100, keep.forest = TRUE, importance = TRUE, nodesize = 5) # Predicting the Test set results y_pred = predict(classifier, newdata = test_set[-9]) # Making the Confusion Matrix cm = table(test_set[, 9], y_pred) print(cm) Misclassification = 1 - sum(diag(cm))/sum(cm) Correctly_Classified = 1 - Misclassification print(Misclassification*100) print(Correctly_Classified*100) # Calculating Kappa value diagonal.counts = diag(cm) N = sum(cm) row.marginal.props = rowSums(cm)/N col.marginal.props = colSums(cm)/N Po = sum(diagonal.counts)/N Pe = sum(row.marginal.props*col.marginal.props) k = (Po - Pe)/(1 - Pe) k 129 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa 8.4 Annexure D – Maps Figure 8-1: Eastern Bushveld Complex – Baseflow 130 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa Figure 8-2: Eastern Bushveld Complex - Lithology 131 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa Figure 8-3: Eastern Bushveld Complex – Geology 132 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa Figure 8-4: Eastern Bushveld Complex - Precipitation 133 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa Figure 8-5: Eastern Bushveld Complex - Recharge 134 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa Figure 8-6: Eastern Bushveld Complex - Runoff 135 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa Figure 8-7: Eastern Bushveld Complex - Storativity 136 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa Figure 8-8: Lowveld - Baseflow 137 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa Figure 8-9: Lowveld - Lithology 138 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa Figure 8-10: Lowveld - Geology 139 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa Figure 8-11: Lowveld - Precipitation 140 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa Figure 8-12: Lowveld - Recharge 141 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa Figure 8-13: Lowveld - Runoff 142 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa Figure 8-14: Lowveld - Storativity 143 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa Figure 8-15: Taung-Prieska Belt - Baseflow 144 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa Figure 8-16: Taung-Prieska Belt - Lithology 145 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa Figure 8-17: Taung-Prieska Belt - Geology 146 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa Figure 8-18: Taung-Prieska Belt - Precipitation 147 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa Figure 8-19: Taung-Prieska Belt - Recharge 148 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa Figure 8-20: Taung-Prieska Belt - Runoff 149 Chané de Bruyn, M.Sc. Dissertation Centre for Water Sciences and Management – North-West University, South Africa Figure 8-21: Taung-Prieska Belt - Storativity 150 Chané de Bruyn, M.Sc. Dissertation