Using model performance to assess representativeness of data for model development and calibration
Abstract
The main objective of this thesis is to propose a novel methodology that can be employed to
assess the representativeness of external or pooled data when it is used in the development and
calibration of regulatory models by banks. Currently, there is no formal methodology to assess
representativeness, which highlights the significance of this research.
In this thesis, we provide a review of existing regulatory literature to identify the requirements that
need to be considered when assessing representativeness. We emphasise that both qualitative
and quantitative aspects need to be considered to ensure a comprehensive analysis.
Our proposed methodology is designed to assess the representativeness of external data by
utilising model performance as a metric. The methodology is applied to two case studies to
demonstrate its effectiveness. In the first case study, we investigate whether a pooled data source
from Global Credit Data (GCD) is representative when considering the enrichment of internal data
with pooled data in the development of a regulatory loss-given default (LGD) model. The second
case study differs from the first by illustrating which other countries in the pooled data set could
be representative when enriching internal data during the development of an LGD model.
To validate the effectiveness of our methodology, we compared it with the Multivariate Prediction
Accuracy Index (MPAI). Using these case studies as examples, our proposed methodology
provides users with a generalised framework to identify subsets of the external data that are
representative of their country's or bank's data. This makes our methodology universally
applicable for banks to assess the representativeness of external data before utilising it in their
regulatory model development and calibration process.
The methodology is not without shortcomings. We have applied our methodology using a
linear model and mean squared error (MSE) as performance measure, but it could also
be investigated whether the methodology delivers similar performance when a different type
of model (e.g. logistic regression) or a different performance measure (e.g.Gini coefficient) are
used. This study did not address the validation of external data's representativeness in the
absence of internal data. Therefore, it presents an intriguing opportunity for future research to
explore how a financial institution can accomplish this task.