Transfer Learning for Cold-Start Binary Classification Problems

Michael Varley

Lead Data Scientist

Michael is a Lead Data Scientist Featurespace, responsible for artfully guiding customers through the design and deployment of their machine learning systems whilst also leading one of our teams of Data Scientists. He has worked on multiple deployments and proofs of concept for key customers and prospects since joining in 2018, covering fraud and anti-money laundering for tier 1 banks, payment processors and card networks worldwide. When you can pull Michael away from the keyboard, he's an avid skier and can be found in the Alps during most holidays.

Connect

Introduction

Consider a generic scenario in which an organisation has a new event-driven data stream coming online; this could be a novel source of web traffic or a new stream of payments. This problem may be regarded as a complete data ‘cold-start’ scenario, insofar as the organisation has never seen this data stream previously. Nonetheless, from the moment the data stream is turned on, the organisation wishes to make a prediction about a target variable associated with these events. It may be assumed that the target variable is not included in the data stream and that the event will only be labelled at a later stage of the data processing pipeline. For example, an organisation with a new stream of web traffic might wish to determine whether messages within that stream are generated by a bot. Similarly, a bank with a new source of transaction data might wish to identify fraudulent transactions. In either case, the ‘ground truth’ is not known at the point when the prediction needs to be made; however, based on the prediction made by the model, the organisation can then take appropriate action even in the absence of the ‘ground truth’.

According to conventional wisdom, the only viable solution to this problem would be to manually implement business logic (i.e. rules) to make predictions based on prior domain knowledge about the new data stream. A rules-based implementation would nonetheless have several disadvantages: depending on the domain, rules can be significantly less performant than models, and they provide discrete predictions of the outcome rather than a continuous probabilistic ‘score’. Implementing a machine learning model based on the data stream is clearly not possible due to the lack of historical data. However, suppose the organisation in question has access to similar data streams which share some characteristics with the new data stream. It may be possible to exploit such data streams in order to design a more efficacious prediction system at the point of the data cold-start via transfer learning.

This article outlines the details of just such a scenario, in which Featurespace implemented a transfer-learned model for a novel data stream by utilising similar datasets. The content here is less focused on numerical results or low-level technical descriptions of the end product, but instead addresses the high-level challenges and considerations which a data scientist should address when tackling this type of problem. The aim is to outline a framework for feature selection and a general modelling approach for transfer-learned models.

Problem Statement

In 2020, a prospective customer approached us with a challenge. The customer in question is a payment processor which provides services to a number of large financial institutions. However, prior to their initial engagement with Featurespace they had focused exclusively on processing card-present (CP) payments. With the growth of the ecommerce market, they were now looking to expand their services to include the processing of card-not-present (CNP) transactions. However, CNP transactions are significantly riskier than CP transactions, and consequently our customer wanted to start providing a fraud detection solution to their member banks as part of the CNP processing service.

The customer insisted that a policy-driven rules-based approach with binary outcome predictions would not be a viable solution here, since they did not wish to issue outright accept/decline recommendations on transactions. Rather, they wanted to provide an indication of the overall risk associated with a given transaction using a score produced by a model and then let their own customers decide on the most appropriate course of action in accordance with their risk-appetite. Consequently, only a model-based approach with a continuous output indicating transaction riskiness would be satisfactory.

Detailed Problem Overview

Our customer’s plan was to pursue a phased rollout of their processing solution within the CNP market, whereby they would initially onboard lower risk traffic before rolling the solution out to increasingly high-risk sectors of the market. The progress of this rollout would not only be governed by the underlying risk factors present in the transaction stream but also by the sales process. Sales of this product were made at the level of payment gateways; a typical gateway will provide payment processing services for a suite of merchants and, therefore, once a gateway was onboarded to the solution it would likely onboard a significant number of eligible merchants to the system simultaneously. Clearly, this would result in systematic step changes in the composition of the data portfolio as more traffic was onboarded.

The customer wanted the model to be in place at the beginning of this rollout process; therefore, the model had to be effective from day 1 with a complete data cold-start and remain performant as the underlying data composition evolved. Particularly during the initial period after the model was put live, there was a presupposition that the number of labels available would not be sufficient to allow for regular retraining of the live model, on account of the low data volumes and low-risk character of the transactions. However, there would be the opportunity to optimise the solution for the specific dataset once a sufficient number of labels had been accrued.

Our Approach

The solution was intended to be a single model serving the entire cohort of card-issuing banks to which the payment processor provided services. The payment processor had not processed any relevant CNP data previously across the entire portfolio. However, we were granted access to historical data from three analogous institutions, hereafter referred to as the offline datasets. For the purposes of this problem, it can be assumed that the available data elements in the offline historical datasets are consistent with the data elements that will be available to the live data stream.

Therefore, whilst we were not in a position to build a model for the entire portfolio, we nonetheless had access to representative data from institutions that could be considered to form a subsection of the dataset.

The approach that we took when building our model was a ‘leave one data set out’ approach to validation. To be specific, the performance of the model would be measured by training a classifier using feature vectors from two of the data sets and then validating on a third dataset; this procedure would be repeated for all possible combinations of data sets. The model which generalised best to the ‘held out’ data sets would be considered best-in-class. Our underlying assumption was that models that transferred well to other data sets would similarly transfer well to the live system.

As might be expected, the transferred models did not perform as well as models with training data drawn from the same dataset, but the discrepancy in performance was pleasingly small. At a fixed false positive rate of 0.5%, transferred models caught 63% to 85% as much fraud as otherwise identical models (i.e. models with the same features and hyperparameters) where the training set and test set were drawn from the same data source.

Feature Selection

The first consideration when building a transfer-learned model is the feature selection process. A set of features which is close to optimal in a ‘traditional’ machine learning scenario (i.e. one where the training set and test set are drawn from the same data source) is unlikely to be optimal in a setup where the test set comes from a different data source to the training set. This is due to the potential for systematic feature discrepancies between the two data sources. When analysing this, it is helpful to consider two categories of feature: in the context of a binary classification problem, domain-consistent features are defined as those where the probability density functions (PDFs) of the feature for two populations, corresponding to both possible values of the binary target variable (i.e. 1 and 0), are consistent across all data sets in the same domain; domain-variable features are those where the density functions of the feature for the two target variable populations vary significantly from one data set to another.

To take an example from fraud detection models, card velocity (i.e. the rate at which a particular credit or debit card is making purchases) is approximately domain-consistent: in almost all data sets high velocity is a good indicator of fraud; furthermore, the distribution of the number of transactions made by cardholders over a specified amount of time is unlikely to vary drastically between datasets drawn from different banks. By contrast, the transaction amount is somewhat variable as a feature across the domain: in some datasets, high transaction values may be strong indicators of fraudulent behaviour, whilst in others the predominant fraud modality may be low-value ‘tester’ transactions.

Given the importance of establishing the consistency of information provided by a given feature across domains, transfer-learned problems are best tackled by building the model on more than one external offline data set. This is particularly true if the discrepancies between the offline datasets with respect to each other are considered analogous to the discrepancies between the offline data sets and the new live stream of data. Having access to two or more different offline data sets permits the model designer to compare PDFs for every candidate feature across the different datasets and every possible value of the target variable, and thereby select only domain-consistent features for the model. Feature mean-scaling, selective binning and more bespoke feature transformations can help ensure that PDFs align across different data sets, even if the raw feature exhibits some variation across the domain.

Data Scaling and Feature Design

Another facet of the data rollout process which the Featurespace data scientists were cognisant of when building the model was that the extent of visibility within the data portfolio would change substantially over time. As more traffic was onboarded, the number of merchants within the portfolio would increase, and the proportion of transactions for any given cardholder that would be visible to the model would likewise rise. Therefore, features ideally had to be designed to be as robust as possible to visibility changes in the data portfolio (i.e. changes in data volume should not impact the distribution of the feature). The mean-scaling strategy mentioned in the context of feature selection has the added benefit of also making features more robust to changes in data, especially if this mean-scaling is implemented as a time-windowed rolling mean of the unadjusted feature value.

Additionally, stateful features designed to capture the short-term behaviour of cards within the dataset can be presented to the model as a ratio with respect to longer term behavioural indicators of that same card. This normalisation process also aims to reduce shifts in feature distribution which might occur as a result of changing data set composition. Therefore, when the volume increase across the data portfolio produces a shift in the ‘numerator’ short-term behaviour indicator, the ‘denominator’ longer-term behavioural indicator will likewise adjust after the look-back window has elapsed so that the feature distribution would be unaffected in the long term. There will still be some impact in the short-term when the short behavioural state has burned-in but the longer one still hasn’t; however, this can be mitigated by ensuring that the look-back period of the long-term stateful component of the feature is not excessive.

Classifier Selection

Fraud labels are usually readily available for card-fraud detection problems on large datasets; therefore, it’s often possible to build models with relatively complex classifier types such as tree ensembles. However, when looking for a classifier that readily transfers across multiple data sets, a high level of complexity is not always a bonus. Complex classifiers are more likely to overfit to the characteristics of one particular data set. We found that for transfer-learning problems, simple linear classifiers exhibited similar performance to more complex classifiers; moreover, such a classifier would require a smaller number of fraud labels for a model retrain after go-live. Therefore, we chose to implement a simple logistic regression classifier for this particular model.

Conclusion

The above framework provides a summary of Featurespace’s considerations when developing a transfer-learned model for a novel data stream with changing data visibility. It’s easy to imagine such a framework finding application in a number of domains which may involve data cold-starts and where a model-based scoring framework is required from inception. This approach shows that the requirement to have fully representative historical data from which to build a model can perhaps be circumvented in scenarios where analogous data sources are available and where the ways in which data can vary between data sets are well understood. As organisations look to diversify away from simple rules-based systems, transfer learning offers the potential for organisations to introduce advanced prediction strategies earlier in their technology rollout processes than would normally be achievable with a standard machine-learning approach.