Generating and disseminating high-quality and secure synthetic data

Fraud and Financial Crime trends in the UK

The Financial Conduct Authority (FCA) in the UK recently issued a call for input on synthetic data to support financial services innovation, in line with its evolving data strategy.

Unauthorised financial fraud losses across payment cards, remote banking and cheques reached £783.8 million in 2021 in the UK, and Authorised Push Payment (APP) scams caused a further £479 million in losses, with a total fraud loss of £1.26 billion. Scams, fraud and money laundering continue to threat the UK economy and society. Innovation is crucial to getting ahead of the organised criminal networks committing these crimes.

Many financial institutions and innovative fintechs are inhibited by restricted data sets on which to develop their fraud and financial crime prevention models and strategies. With higher volume data sets on which to build and train models, including high quality synthetic data, innovators like Featurespace can further improve the fraud catch rates and false positive ratios for fraud and money laundering. Currently Featurespace’s machine learning inventions are able to prevent 75% of fraud attacks and reduce false positive alerts by 75%. With additional data it could be possible to further improve that, preventing even more fraud and financial crime in the UK.

What is synthetic data?

Synthetic data is privacy-preserving artificial information created to serve as substitute data for developers and businesses. In the space of fraud and financial crime prevention, this takes the form of sequences of synthetic transactions associated with mock financial entities, such as card holders or merchants. Good quality synthetic data encodes the true temporal statistical properties that describe the transaction behaviour of diverse types of entities, as well as common fraud and money laundering patterns. Thus, synthetic data can be used to build and test fraud prevention systems that can be later leveraged in production-ready systems.

Additionally, AI systems capable of producing synthetic data of high quality can further facilitate the means to set up balanced study designs, through oversampling of unrepresented strata or missing data imputation, as well as enabling causal reasoning and contributing towards transfer learning across applications. For instance, it is understood that data augmentation and oversampling can yield larger data sets, which in combination with other data science innovations can be used to accelerate neural network training, meaning the models more quickly learn to accurately identify bad behaviour. My colleague Piotr Skalski has written an article explaining how this kind of model training works.

Gartner predicts that by 2024, 60% of the data used for the creation of AI and analytics projects will be synthetically generated.

How is synthetic data being used in financial services?

Financial services data is (quite rightly) highly regulated, because of the sensitive nature and volume of Personally Identifiable Information (PII) data within sets. But this regulation does make sharing of data difficult, between financial institutions and those trying to innovate and solve the challenges of fraud and financial crime. Synthetic data is not tied to a real individual and does not contain any real PII. Although there are still security considerations, it can be more easily shared for collaboration on these major industry challenges.

The emphasis on collaboration is one that has been developing in UK financial services for some time, thanks in part to open banking regulation that enshrines the technical capabilities to share data. There are now examples of financial institutions creating programmes and pilots for synthetic data. HSBC has initiated Synthetic Data as a Service project. The bank is working with The Alan Turing Institute on synthetic datasets generation, that preserve both the utility and privacy of its data. The generated synthetic data would support faster innovation and support collaboration with the wider ecosystem, including fintech start-ups, academics, and technology partners. HSBC has recognised the need for collaboration in the generation phase. Synthetic data is a relatively new area within data science in financial services, meaning that in-house teams will need to develop their expertise, particularly in partnership with academic experts.

Watch Profesor Mark Girolami, Chief Scientist at The Alan Turing Institute speaking at our Cambridge Sessions event, in June 2022.

HSBC states that as part of its pilot, it currently has several synthetic data generation tools for structured and unstructured data, conceptual architecture with data flows and approvals, and an API for Synthetic Data as a Service user (to upload real data, get information on the capabilities, and report on privacy and utility evaluation). The outcome being to pilot Synthetic Data as a Service launched in Platform as a Service (PaaS) in 2022.

How could the FCA ensure high-quality synthetic data?

The FCA call for input specifically requested comments on the role of a central body such as the FCA in the gathering, generating, disseminating, and even hosting of synthetic data. As a central regulating body, the FCA has the ability to create frameworks to ensure the quality of synthetic data, whether at the financial institutions themselves or a public utility version of Synthetic Data as a Service, managed and hosted by the FCA. There are quality benefits to creating a synthetic data set from the complete market set of real data and limiting the sharing of sensitive data to between the FCA and the individual financial institutions. This approach could overcome some of the data security challenges that have so far held back the financial services industry and deliver the benefits of training fraud and financial crime prevention models on large synthetic datasets.

Featurespace was delighted to contribute its expertise in the field of synthetic data to the FCA, you can download the full submission below.