Understanding Data Drift: Protecting ML in Banking

VALID Systems Jan 14, 2025 5:08:50 PM

At VALID, we take pride in providing consistent results for our clients and partners. Our machine learning models are a crucial component to generating those results. One of the ways we ensure our models are performing their best is to monitor for data drift. Simply put, data drift is a measurable change in the distribution of the data. At times, these shifts in the data can lead to incorrect predictions from models.

Real World Data Drift Impact

An example of how data drift had a major impact on real world results was back in 2021. Zillow had turned its AI prowess with its “Zestimate” into “Zillow Offers”, a service that buys homes directly from consumers to resell on the open market. They started this service in 2017 and saw very strong growth and profitability, especially in the pandemic driven boom. Once the market started to plateau, the models responsible for generating the offers failed to adjust to the new reality. In 2021, Zillow began buying houses for more than they were worth, unable to make a profit from the sale. At the end of year, the losses from the remaining inventory of houses exceeded half-billion dollars.

While the Zillow use case involves more red flags than just data drift, it can’t be ignored as a major factor contributing to the loss. The post-mortem points to the use of a forecasting model, being used in conjunction with its tried-and-true Zestimate that was unable to adapt to the changing market. As model owners, we must accept the fact that our predictions will reflect the reality of the data that was used to train it.

Data Drift at VALID Systems

In the bank space that VALID operates in, we have seen some unique shifts in data over the last few years. As was well documented in the media, the rise of account balances and their fall back to normal levels showed up in our data. Another model input experiencing change has been the average check amount within the deposit population. The trend of larger check amounts has shown no sign of slowing down.

Figure 1: The change in average check amount within the check deposit population

Our Decision Science team receives a comprehensive data drift report each month. The report tracks population stability (PSI) across dimensions that rank high in our models feature importance. We calculate this for transactions scored by each of our production models. We use the percent of total items along with the PSI to illustrate the shifts. The rate of returned items is also included so that we can understand if the shift has also contributed to a potential increase in losses. A population shift without a loss impact only bears watching, but a shift with a loss increase kicks off a series of analysis and model retraining. A PSI of 10% or less is commonly used as the threshold to consider the data stable. That number guides our formatting choices, but we display each stratification for review.

Building Data Drift Reporting in Sigma

VALID’s BI tool of choice is from Sigma Computing. In the last year and a half, we have made tremendous strides in providing self-service analytics to the VALID team with the help of Sigma. Even with some data sources containing over 3 billion rows, Sigma can efficiently generate the reports we need to give insights into the data that helps us run our business.

Structuring the Data to Measure Drift

To measure the drift present in the data we first need to structure the data appropriately. The goal here is to obtain two sets of data to be measured. The first set is the “Observed” population, this is the data we are looking for the change in. In our case this is the last full complete month. The second set is the “Expected” population. Here we use the last 6 full months prior to the observed items. With our two populations set, we can then calculate the drift between the two.

A screenshot of a computer

Description automatically generated

Figure 2: An early mockup of how the data could be structured in Sigma

Calculating the Population Stability Index

By calculating the Population Stability Index (PSI), we’re able to measure how much change we have seen in a variable’s distribution across our ‘Observed’ and ‘Expected’ groups. Observed, referencing the percent of total eligible items seen in the most recent complete month; and Expected, which is an average of the remaining 6 months’ eligible items, then turned into a percent of total. PSI is then calculated using the following formula in Sigma:

Figure 3: The PSI calculation in Sigma

Overcoming Challenges Faced Along the Way

Achieving this within our reporting took some ingenuity. After limiting our data to the 7 most recently completed months we needed a way to dynamically determine what data applied to our Expected group and which would fall into the Observed. In order to accomplish this, we created a pivot table that included our desired variable(s) on the Y-Axis, a distinct Month of Reporting Date column on the X-Axis and populated it with the sum of Items Eligible. Then we leveraged a Cumulative Count function on our distinct months to get a clean 1-7 designation we could use in the next step.

Figure 4: Early data prep

Next, we created a secondary pivot table off the one we had just built. However, instead of having the distinct months as columns, we created a field that would designate everything with a Cumulative Count value of 7 as ‘Observed’ and all others as ‘Expected’. With our groups now set we could transform our Items Eligible counts into a % of total, (remembering to first average the Items Eligible for the Expected group), the next step was calculating PSI. Due to the nature of our data, it was not as simple as plugging fields into the logic shared above. However, since we had our data nicely structured into two separate columns, we were able to rely on the Lag function to determine PSI. The logic was now: ([% of Eligible Items] - Lag([% of Eligible Items])) * Ln([% of Eligible Items] / Lag([% of Eligible Items])). This left us with a pivot table displaying % of Total Eligible Items, by our variable, across the Expected and Observed timeframes. The last issue was cleaning up the view for reporting.

Figure 5: Further grouping of the data

Due to using the Lag function, we have this column of nulls under our Expected group. We would anticipate seeing these given there is not a column for the Lag function to reference prior to the Expected group. However, it adds no value to the desired results and takes up space. Instead, we can build one more pivot table off this latest one, and by creating some columns referencing the dynamic grouping calcs we can provide a final view that looks like the example below:


Figure 6: Glimpse into a summarized view

The Final Report

Lastly, adding a bit of color to this view assists readers’ ability to quickly spot areas to review. For this report we opted to highlight denser populations with a darker shade of blue and created a custom conditional formatting rule for our PSI column to alert users to any instances of PSI being above the acceptable 10% threshold.

A screenshot of a data drift

Description automatically generated

Figure 7: The entire view of the final report.

With our ML scores impacting millions of risk decisions per month, we take our data very seriously. This data drift report is a small example of the effort put in to monitoring data at large. We also understand that having quick, convenient access to our data is crucial to managing our business. With the help of tools like Sigma and a passionate team, we can quickly identify and address any shifting data.