Technical note
4 minute read

Building privacy-preserving federated learning to help fight financial crime

Money laundering and other financial crimes, add up to a roughly $2 trillion hit to the global economy every year, according to some estimates. Only a small fraction of these illegal transactions are actually detected as such, which limits the value of illicit assets that end up being frozen by law enforcement authorities to about 1%.

One way to improve the detection of suspicious transactions is through more systematic data sharing among banks and other institutions within the international financial system, such as the payment network (PN). Applying machine learning and AI techniques like federated learning on shared data can provide a great deal of insight towards tracking the money trail of criminal activity of all sorts. However, there are good reasons for financial organizations to be reluctant about openly sharing transaction data with one another. Some of these concerns are regulatory, as well as concerns about losing competitive advantage.

This is where our recent research combining a set of cryptographic techniques comes in. We developed a solution to tackle the problem of enabling federated learning over a large set of shared data — while preserving data privacy with the goal of improving the detection of anomalies in suspicious financial transactions. We submitted this solution to the U.S. Privacy Enhancing Technologies (PETs) Prize Challenge, which was announced by the US and UK governments in July 2022. Our solution, developed by a team across several IBM Research labs, won second place among the winners of the challenge’s first phase.

The PETs prize challenge

Effective detection of financial crimes requires collaboration among multiple entities that own a diverse set of data. For example, when training over a set of money transfers, the payment network holds the details of the transfers and banks hold account information. Although trust among these entities is limited by regulation and competition, they all are aligned in wanting to improve the detection of suspicious transactions. Federated learning (FL), in particular vertical federated learning (VFL), enables entities to collaboratively train an automated anomaly detection model. However, in the case of international financial transactions, the data is partitioned both vertically and horizontally and, as a result existing VFL approaches cannot be used in a plug-and-play manner.

The challenge was to develop a privacy-preserving technology to train a model in this hybrid setting. The challenge comprises three phases, a concept paper proposal, solution development, and facilitating red teams to attack the solution. The first phase ended in November 2022. In this phase, the participants needed to submit a theoretical solution that privately trains a model. The participant’s solutions were scored for their security as well as feasibility and scalability. In the next phases of the competition, the participants will submit an implementation of their solution, which will be tested on data that was not shared with them beforehand.

IBM’s solution

Our submission was developed by a team of researchers with expertise in fields as diverse as homomorphic encryption, differential privacy, privacy-preserving machine learning, federated learning, robustness against adversarial attacks, and graph algorithms.

The proposed solution, Private Vertical FL for Anti-Money Laundering (PV4AML), is a holistic approach that combines several cryptographic, privacy, and machine learning techniques to generate a random forest with the help of an aggregator.

GraphPipelineDiagram (1).png
Overview of the PV4AML system consisting of a private training and a private inference process.

Model Architecture

The proposed solution enables a payment network and banks to collaboratively train an ensemble model, in particular a random forest, without learning anything about each other’s private datasets. Choosing an ensemble model enables the team to take advantage of the well-known properties of ensembles to reduce variance and increase accuracy. Conventionally, a random forest consists of greedy decision trees, where features in a tree are chosen greedily using some judiciously defined criterion, such as information gain. The solution proposes to train a random forest consisting of random decision trees (RDT). In a random decision tree, features for the tree nodes are chosen at random instead of using a selection criterion. The structure of a random decision tree is built independently of the training data. The training data is used only to determine labels associated with the leaf nodes of the tree.

Private Feature Engineering

The proposed solution allows  the payment network (PN) and each bank to locally engineer complex features. Incorporating statistical features of transaction graphs, including attributes of account nodes and their neighborhoods, can significantly boost the accuracy of the trained model. The PN side applies a pipeline of proven graph-based financial crime detection techniques to the PN data and feeds these results into an ensemble of privacy-preserving decision trees to incorporate the influence of the bank data without exposing the latter to PN (or vice versa). The features extracted by a participant remain locally at the participant, and the training, as well as inference protocols, are designed to preserve the privacy of the features from the other participants including the aggregator.

Privacy-preserving training

A benefit of using RDTs is that tree structures can be built independently of the training data. For ease of exposition, the PN builds the tree structures. The most challenging part of the training process is to (privately) compute the label for each leaf node, which may depend on both PN and bank features. The team proposed a novel protocol based on homomorphic encryption (HE) that enables the PN and banks to collaborate for computing the labels of leaf nodes. At the end of this protocol, the PN does not learn any information about any bank’s account dataset, and any bank does not learn any information about the PN’s transaction dataset or the other banks’ account datasets.

Protecting against inference-time attacks

To protect against inference time attacks, we incorporate differential privacy by building on the techniques presented here, wherein each bank adds calibrated Laplace noise, when computing the number of labels of ‘red’ leaf nodes under HE.

Conclusion

We believe our unique combination of privacy techniques, scalability of deployment, and extensibility to new features makes our solution extremely compelling for real-world deployment. And we are continuing to develop our approach and will submit results to a top privacy conference.

This work was carried out by a multi-lab IBM Research team with the following members: Nathalie Baracaldo Angel, Nir Drucker, Naoise Holohan , Keith Houck, Swanand Kadhe, Ryo Kawahara, Alan King, Eyal Kushnir, Heiko Ludwig, Ambrish Rawat, Hayim Shaul, Mikio Takeuchi and Yi Zhou.