Skip to content


By Brain John Jr Aboze

Model validation is a crucial component to guide proper understanding, mitigation and evaluation of inherent and foreign risks associated with the model. It makes the machine learning development and deployment more reliable and productive as it guides model quality assurance and control. The importance of model validation can not be underemphasized when the malfunction of the model has a significant impact on the entire machine learning application. Before we delve into model validation, let’s have a basic understanding of model quality assurance and control.


According to the Supervisory Guidance on Model Risk Management (SR 11-7) issued by the FED, this is defined within the regulatory guidance as “the set of processes and activities intended to verify that models are performing as expected, in line with their design objectives and business uses. Effective validation helps ensure that models are sound. It also identifies potential limitations and assumptions, and assesses their possible impact.

Just as experiments in the laboratory are uncertain about not working in the real world, so do the ML models. Production environments differ from development environments due to associated model risks such as data and concept drifts, low training data quality, malicious attacks, runtime framework bugs and issues, model training and evaluation bugs, model explainability issues, fairness appraisal issues etc. Model validation is simply examining models to ensure that they are risk-free (at least to an acceptable level) before putting them into production. It can merely be seen as a model risk evaluation exercise. It aids in understanding the model’s worst-case scenario and its impact and associated business risks (financial, legal, safety, ethical and reputational).

Let’s take a look at the classical machine learning pipeline against the machine learning pipeline with validation.

Classical machine learning pipeline

Classical machine learning pipeline

Source: Author

The classical pipeline shows how the basic pipeline of various machine learning operations in both research and practices before the emergence of MLOps.

Machine learning pipeline with validation

Machine learning pipeline with validation

Source: Author

This approach ensures models are validated before pushing to production and monitoring their compliance while in the production environment. This extra validation step examines the model’s predictions with an unknown/unseen dataset for assessing the model’s predictive capacity and performance to strengthen the model reliability of its output. We know that machine learning models show unexpected behaviors to unexpected input data or malicious attacks. This validation phase tests models to detect vulnerabilities and evaluate their robustness thus, determining whether the trained model is trustworthy.

Model validation can be carried out in two ways:

  1. In-sample validation: this strategy validates models on the same dataset that is used to train the model, which is also known as internal validation.
  2. Out-of-sample validation: this strategy validates models using a new dataset (external) that was used in the model training phase, which is also known as external validation.

The first strategy is prone to overfitting and underfitting and probably has a low generalization capacity. The second strategy will subject the model to some level of stress scenarios and new independent data instances to evaluate the robustness. However, may introduce a term known as “leaderboard likelihood” which is a problem caused by exposure of the validation/test labels when trying to make optimizations.

Validating models before promoting them into the production environment can take two forms:

  1. Offline evaluation: This evaluation mechanism uses performance metrics (accuracy, recall, root mean square error, AUC-ROC etc.) for the business objectives by evaluating the model’s performance using historical data. Here the model must be evaluated on an independent dataset (data the model hasn’t seen but follows the same probability distribution as the training data set) from one used for model training. We often start with one dataset, so to have a new data sample to validate the model, we must resample the data to simulate a situation in which we have new data. The main techniques applied to resample and produce new data instances in offline evaluation are Hold-Out Validation, Cross-Validation, and Bootstrapping. These resampling techniques also help control underfitting/overfitting and help notice bias (through various model explainability techniques) as it evaluates multiple subsets and distribution of the data. Let’s us look into these techniques closelya. Holdout: This is the most straightforward resampling procedure used to evaluate machine learning models and assess how the model will perform for an independent test dataset. The data is randomly split into two – ‘train’ and ‘validation’ set in this strategy. Generally, the train set split is more than the validation data split; a common split ratio is 70:30 for the train and validation set. The higher the training data ratio, the better the model. Also, the higher the validation ratio, the higher the amount of data isolated from training. The train set is used to train the model, and the validation set is used to evaluate the model (serving as an independent test set). This technique can easily be implemented with sklearn’s train-test split. Despite the ease of understanding and implementation, it is not suitable for an imbalanced dataset.Deepcheck

    b. Cross-validation: This resampling technique employs training your model on multiple train-test splits. This strategy ensures that different data subsets are used for training and testing the model on various iterations. Cross-validation combines (average) the validation results across the various train-test split over the iterations and estimates the model’s performance. This technique reduces variability and gives better insights into an independent dataset’s model generalization capacity. The description and implementation of the various cross-validation techniques can be found on sklearn’s documentation. The most used cross-validation technique is the k-fold cross-validation, where the dataset is split equally into k partitions/ folds. Out of the k-folds, one partition is selected as the validation set, and the remaining k-1 partitions are used as training data. It is an iterative process that repeats for k times until each partition is used as a validation set and the remaining training set. Let’s explain this visually, using k=4

    Source: Author

    c. Bootstrapping: This is a resampling technique that samples a dataset with replacement which means randomly selecting data and allowing for duplicates and may completely omit original data instances. The new subset created from this resampling procedure is the bootstrapped dataset and equals the same number of data instances of the original dataset. The bootstrapped dataset creates variation from the original dataset and is mostly used when an explicit testing set is not available as the “surrogate” data (bootstrapped dataset) serves as a hypothetical test set. The description and implementation of bootstrap technique can be found on sklearn’s documentation.

    Source: Author

    Offline evaluation helps deploy the best possible model up to the knowledge at model training time. However, the results of the offline evaluation should be held as absolute truth since the model is only as good as the training data, and a high drift in data distribution (distribution drift) in the production environment may have a severe impact on the model. In retraining scenarios, model promotion decisions (substitution on new candidate models) will be made by comparing these offline metrics to existing baseline/production models.

  2. Online evaluation: Online evaluation extends model evaluation using live data while measuring and monitoring performance metrics on the deployment model interacting with real users. This online evaluation aids monitoring of model against model degradation that can be caused by various reasons such as malicious actors adapting the model behaviors, change of products or policies that affects customers’ behavior, and sometimes the world simply evolves in the form of data drift. Tracking model performance metrics (RMSE, ROC, precision, etc.) over time gives insight into model stability and checks if the model’s prediction is consistent with that of ground truth. In addition to the model performance metrics, online metrics also cover n business metrics and KPIs such as customer lifetime value and customer behavior which may not be available on historical data. Another important online metric is the return of investment (ROI); this metric is not straightforward to implement and not possible for some models. The ROI metric evaluates the difference between the expenses of the model and the revenue obtained from it. Here, the expenses of the model cover computational, operational, and maintenance costs. These online metrics establish causal relationships between a model and user outcomes. A metric that helps the observes this causal relationship is A/B testing. A/B testing is based on statistical hypothesis testing. A/B must track the number of data points that affect the model, such as users, items, etc. It guides the comparison of two models on two random samples of the data population (one serving as a control group and the other the test group) over a specific time window. Online evaluation evaluates the model performance on live data over time and the impact on the business as they are aligned to its objectives.


Proper experiment tracking, version control, and metadata management on all these models can guide develop rollback and promotion strategies and pipelines. However, in practice, we often experience considerable discrepancies between the offline and online performance of the models. The ultimate goal is to have production models that improve over time and provide the best results. Degraded models will have to be substituted by a better candidate model. The new candidate model should perform better on offline and online evaluation metrics or at least be similar. However, candidate models can also substitute production models due to lower complexity, fairness appraisals and model explainability.

Model validation with Deepchecks

This article won’t focus on the A/B testing and resampling techniques (holdout, cross-validation or bootstrap). Online methods such as A/B testing work in the production and can’t provide controls or checks to guide before promoting models to production. Resampling techniques help subset and rearrange data and examine the model’s generalization capacity, comprehensively covered here. We would rather discuss deep model validation checks (inspection parameters) using Deepchecks (no puns intended). Deepchecks is an open-source Python package that provides a collection of inspections/checks that returns notifications and reports about data and model validation and gives insights on data-related and model-related issues with minimal effort. The notifications are subjected to inspection conditions that yield prompts such as pass, fail, warning and results.

Deepchecks is readily applied in the research phase and can be characterized as a collection of offline techniques to aid data validation, find potential methodological problems and model validation and analysis.

Model validation with Deepchecks

Source: When Should You Use Deepchecks

Getting started with Deepchecks

Installation guide:

Using pip
pip install deepchecks -U –user

Using conda
conda install -c conda-forge deepcheck

Based on the phase we are validating, we will require a supported model(such as sklearn, XGBoost, LightGBM, CatBoost) we wish to validate. The various model validation checks to be employed by Deepchecks will be explained using a Loan default prediction use case that can be found on my GitHub repository here.

Use Case: Loan defaults prediction
The goal of the use is to build a machine learning model that predicts loan defaulters based on certain variables present in the dataset. The dataset can be found on Kaggle here and the entire project on GitHub here.

The preprocessing steps can be referred to in the project notebook. We will delve straight into the model validation section

# Creating deepchecks object instance
from deepchecks import Dataset
ds_train = X_train.merge(y_train, left_index=True, right_index=True)
ds_test = X_test.merge(y_test, left_index=True, right_index=True)
ds_train = Dataset(ds_train, label=“loan_status”, cat_features=[‘sub_grade’, ‘verification_status’, ‘purpose’, ‘initial_list_status’,‘application_type’, ‘home_ownership’])
ds_test = Dataset(ds_test, label=“loan_status”, cat_features=[‘sub_grade’, ‘verification_status’, ‘purpose’, ‘initial_list_status’,‘application_type’, ‘home_ownership’])

This simple code block creates data objects using the Deepcheck Dataset module The dataset objects take in the label data, label column, and categorical features (recommended, not required). Next we will initiate the machine learning algorithms we plan to utilize and fit to the deepchecks data object.

# Random forest
from sklearn.ensemble import RandomForestClassifier
# LightGBM
from lightgbm import LGBMClassifier
# Model fitting function
def model_fit(clf):
# For Random forest
rf_clf = RandomForestClassifier()
rf_clf = model_fit(rf_clf)
# For LightGBM
lgb_clf = LGBMClassifier()
lgb_clf = model_fit(lgb_clf)

Next, we will evaluate both model’s performance using the deepchecks model performance report module as follows:

from deepchecks.checks.performance import MultiModelPerformanceReport
MultiModelPerformanceReport().run(ds_train, ds_test, [rf_clf, lgb_clf])

Getting started with Deepchecks

This plot gives the summarized results of the F1 score, precision and recall metrics of both models on the test data. Special attention should be given to class 0 (charged off) that represents loan defaults. Understanding of the performance of the class can guide business decisions on whether to deny loans request, increase interest rate, reduce loan amount or even increase loan repayment terms of future clients with similar patterns.

Let’s analyze the model performance deeper with the ROC, which is a plot of TPR(True Positive Rate) vs FPR and summarizes all the possible confusion matrices based on the classification threshold. The AUC score evaluates the area under the curve and the higher the score the better it’s classification power to classify both classes more accurately.

ROC is a probability curve and AUC represents the degree or measure of separability. It tells how much a model is capable of distinguishing between classes. Higher the AUC, the better the model is at predicting 0s as 0s and 1s as 1s. By analogy, Higher the AUC, the better the model is at creditworthiness and helps in reducing credit default risk.

from deepchecks.checks.performance import RocReport
# For Random Forest
check = RocReport(), rf_clf)

Receiver operating characteristic for binary data

# For LightGBM
check = RocReport(), lgb_clf)

From both plots, you can see that LightGBM had a better AUC score on the test data.

Predicting the probability of an observation belonging to each possible class can be more convenient than predicting class values. Predicting probabilities of possible classes allows more flexibility and guides decisions on how to interpret the possibilities, presenting predictions with uncertainty and providing more nuanced ways to evaluate the model’s skill. The calibration curve compares how well the classifier’s probabilistic predictions reflect the actual probabilities in the real world. The Brier score metric may be used to assess how well a classifier is calibrated, the lower the Brier score is for a set of predictions, the better the predictions are calibrated. Let’s look at the calibration plots of both models

# Calibration score for Random forest
from deepchecks.checks import CalibrationScore
check = CalibrationScore(), rf_clf)

Calibration plots

# Calibration score for LIghtGBM
check = CalibrationScore(), lgb_clf)

From the brief score, LightGBM has the lower Brief score and hence is better calibrated.

The model inference is an important metric that calculates the average model inference time for one sample in seconds. Let’s quickly estimate that:

# Checking Model Inference time
from deepchecks.checks.methodology import ModelInferenceTime
# For Random forest
check = ModelInferenceTime(), rf_clf)
# For LightGBM
check = ModelInferenceTime(), lgb_clf)

The average model inference time for one sample (in seconds) for the random forest model and lightgbm models are: 0.00021382 and 0.00000466 respectively.


Model validation helps reduce operational costs, and guides model quality, assurance, stability, flexibility, and scalability. In this article, DeepChecks, an open-source library, was used for model validation, which has a suite of validation metrics that can be implemented with very little code. The easy experience validating the model with Deepchecks suites cut four major categories: