Challenges in Building Production-Grade Machine Learning Models

Marco Barsacchi

Senior Data Scientist

Marco Barsacchi was awarded a PhD in Smart Computing from the University of Florence, Pisa, and Siena after having earned a first-class master's degree in Biomedical Engineering from the University of Pisa. At Featurespace, he works as a Senior Data Scientist designing and deploying machine learning models to customers. He also contributed to inventing the Automated Deep Behavioural Networks. In his spare time, Marco enjoys hiking, wildlife photography, reading books, and playing the odd blues on his electric guitar.

Connect

Introduction

Data scientists around the world create machine learning models that are deployed in production and used for a wide variety of tasks. We often treat them as infallible, but just like any other system machine learning models can fail catastrophically – potentially causing severe damage. Failures can come in many different forms. Sometimes those failures are obvious: a system could grind to a halt, greatly hampering the ability to process payments; a system could still be up and running but start producing meaningless scores. Sometimes those failures can be more subtle but no less important: an algorithm wrongly declining a mortgage to a person can have life-changing consequences. Because of those implications, building a production-grade machine learning model involves balancing an often-conflicting set of requirements. These requirements differ considerably from those that you are faced with when designing a model to tackle a Kaggle challenge.

At Featurespace, we deploy machine learning models that risk-score financial events in real time and protect millions of customers across the globe. When our models are integrated into a production system, they are expected to work reliably (and effectively) for years, with minimal user intervention. Because of that, designing a machine learning model that scores financial events predictably and stably poses several unique challenges. In the rest of this article, I will discuss some of those topics and how they are addressed at Featurespace.

Latency

As a first requirement, every machine learning system we deploy must be able to provide a risk score with minimal latency (of the order of tens of milliseconds). Every time a payment is processed, a decision must be made as quickly as possible, with minimal friction for the end user. Delaying the response might ultimately mean the customer gives up on the payment, thus causing extreme annoyance, and a loss of revenue for the seller. Or it might mean having to accept or decline a transaction without relying on the risk score, consequently defeating the purpose of having a risk score in the first place. Furthermore, even if a few hundred milliseconds might not seem too restrictive, multiple systems from multiple parties might be running in the payment processing chain – each of which should not take more than a few tens of milliseconds to run. Achieving low latency requires some fine engineering as well as a careful feature design. From the very start, a real-time model requires design choices that target its raison d’être.

Stability and Robustness

Additionally, machine learning models deployed into production must be extremely stable over time. When processing millions of financial events per day, and when affecting the daily lives of people, consistency is paramount. A machine learning model whose score distribution fluctuates wildly over time causes a lot of headaches for the model governance board, regardless of whether it provides world-leading performance. As a customer, you expect your card to behave reliably; imagine if all your payments were unexpectedly declined. Furthermore, financial institutions tend to budget for a certain number of alerts per day. Huge fluctuations, not related to underlying fraud attacks, are unwelcome and could put a lot of pressure on fraud teams and their ability to promptly investigate alerts. We can pursue stability in multiple ways; again, prudent feature design is paramount to ensuring that the model is stable. For example, stateful features that aggregate over longer time windows tend to be more stable, albeit they take longer to mature. Additionally, features that track erratic behaviors are suitable candidates for being excluded from the feature set altogether.

Finally, testing is key – you would rather spend two more weeks testing than discover that the first live retrain wrecked your score distribution. It is essential to train your model multiple times on different time periods. Ideally, the model performance and feature importances are stable across runs and time periods. If not, you should look for changes and amend the features if necessary.

The closely related concept of robustness is also extremely important. If stability refers to a model behaving predictably under “expected inputs”, robustness refers to a model’s resilience when the inputs are changed. Machine learning researchers and practitioners tend to define the concept of robustness slightly differently. Broadly speaking, a computer system is robust if it can cope with errors during execution. There are two ways robustness is defined in machine learning: small train/test discrepancy (also related to the concept of generalization) and robustness to perturbations. The former is usually well addressed by every machine learning practitioner. Addressing the latter, however, requires a bit more work. Defining the type of perturbations you want your model to be robust against is paramount here. After all, even if some models can be more robust than others, no model can be arbitrarily robust against arbitrary noise, unless it always outputs a constant prediction. As such, the first step is understanding what kind of perturbations you expect your model to be affected by. You can then measure their effect on performance, and re-design your features, if needed. At Featurespace we devote a considerable amount of time investigating the robustness of our model.

Explainability

The last key aspect I want to consider is that of explainability. Let us take a brief detour first; despite the machine learning community often using “explainability” and “interpretability” interchangeably, I believe it useful to be able to distinguish between the two. An interpretable machine learning model is one where we can fundamentally understand how it arrived at a specific decision. Think of it as a set of rules – you can clearly understand why a particular rule triggered, even if you do not know why a threshold was set the way it is. Interpretability is almost invariably tied to the structure of the model, i.e., it is a passive characteristic. Explainability instead sets a different goal. An explainable model will tell you why a particular decision was made, but not how the model arrived at that decision. Explainability relies on a model being able to clarify or explain its reasoning process. Often, an explainable model is all we need. When declining a payment, knowing that it was declined because the velocity was high and the beneficiary was unexpected is enough to justify the decision, even if we do not know all the internal steps taken by the model – or even if the explanation is not entirely accurate.

Most fraud models benefit from being explainable, even if it is not always a strict regulatory requirement. Albeit machine learning researchers empower us with a variety of tools we can use to add explainability to an existing model (SHAP being the most notable one), they do not come for free. Often, explainability will have an impact on the latency and thus might need to be factored in from the start.

Complexity

When building production-grade models, simplicity is key. In engineering, we always try to strike a balance between complexity and performance. Complex tools require more maintenance and are more prone to failure. While more complex models tend to have an edge when it comes to analytical performance, they usually come with an increased maintenance burden. Simplicity significantly helps in achieving some of the other goals as well (a simple model tends to have reduced latency, be more explainable, and be more stable over time). Now, this is not about always using a logistic regressor over more complex models – after all, we have some new fancy models at Featurespace. Instead, this is about resisting the call to always go blindly for the most advanced model without carefully considering all the implications first. What I am calling for is striking a fine balance between simplicity and complexity, depending on the problem at hand.

Conclusion

At Featurespace, we have been building production-grade real-time models for years; it all comes naturally. And yet, if you pause and consider the problem, you can see how much effort goes into addressing all the above issues. It is by no means a simple feat. It comes with years of experience, and a few unexpected failures from which we have learnt a lot. Everything we do is about making sure our models are not only performant but also stable and dependable over time, in line with Featurespace’s goal of making the world a safer place to transact.

thong-vo-Maf7wdHCmvo-unsplash-glitched-12-10-2021-14-37-02