How Latin dinosaur names are protecting financial data

For the last decade, financial institutions have started to digitise their internal processes, moving away from paper-based filing and legacy applications, to modern, cloud-based services to enable them to stay compliant and to allow them to extract more value from the data. However, security is crucial and data protection privacy has become a major concern for users.

In their quest for the best solutions to mitigate such issues and comply with regulations like the General Data Protection Regulation (GDPR), CIOs within companies have started piloting and implementing new cryptographic innovations such as homomorphic encryption and pseudonymisation. Today we talk with Michael Osborne, a Security Researcher at IBM’s Zurich Research Laboratory, to discover why these techniques are relevant for the financial industries.

What are some of the techniques banks are using to analyse data to look for fraud, while still remaining GDPR compliant?

When it comes to analysing data for fraud, pseudonymisation is a legitimate processing technique within GDPR. However, and this is the key thing with GDPR, the processing must be carried out in such a way that the risk to subjects is acceptable.

As a result, banks are left with a dilemma: on the one hand, if they do not have sufficient AML analytics and technologies for fraud detection (e.g. AML legislation, anti-money laundering legislation) they can be fined and, on the other hand, if they do it in such a way that there is a breach and there is a kind of a risk to data subjects, they can again be fined within GDPR law. Moreover, banks don't want to give away their information about the clients that they have, even if they know that they will benefit from collaborative insights.

So, at the end of the day, it’s all about how can you do it?

These are the situations where things like pseudonymisation, fully homomorphic encryption (FHE) and tokenization really come into their own.

Fully homomorphic encryption enables data to be shared or brought together in an encrypted fashion and generates insights without ever having to decrypt the data. A rough analogy is the way we processed photography before digital. Before smartphones photographers would need to put their camera in a black bag and remove the film from the camera to develop it. FHE is the black bag and the data is the photos.

Another set of technologies that enable similar processes is called secure multi-party computation, where data isn’t actually shared directly, as this technique uses cryptographic protocols to send bits and pieces in a dynamic way about it. Overall, with both techniques we are able to achieve the same results, but without bringing the data together.

The reason fully homomorphic encryption and secure multi-party computation aren’t pervasive yet is because it’s 1000x slower than standard encryption, but to put that into context, we are talking about minutes instead of fractions of a second. For some situations where you do some analytics and get the results a little bit later, its not a problem. But if you need to do things before a payment happens, then obviously that's a use case that's very time dependent and its not ideal to use those technologies just yet.

If you want to bring data together, for example, to generate insights over different data sets within your organisation, pseudonymisation and tokenization are a much more efficient way to do it. For example, if 10 competing banks want to share fraud data, without revealing any client information, you can use these technologies to de-identify the data without losing its utility.

As mentioned above, another concept operating in the area of data encryption is homomorphic encryption. Where can we apply it?

For the analytics use cases, where say you want to look for fraud patterns over a few years, fully homomorphic encryption applies very well.

Now for example, let’s say you want to take out a loan for the nice home in the Swiss Alps and you would like to apply online to various lenders. These lenders will ask about your nationality, home savings, but you are concerned about revealing this to too many vendors. A solution to this problem is known as a zero knowledge proof, where you need to prove something without actually revealing too much detail. Going back to my example, the proof would confirm that you are a citizen of a country where they bank is licensed to conduct business.

Zero knowledge proofs can be used within smart contracts, for instance in supply chain blockchains when one user doesn’t want everybody to see a specific (information within a) smart contract. Thus, the user can actually hide the business logic using fully homomorphic encryption by encrypting those smart contracts.

Please share with our readers your experience while implementing pseudonymisation to some of IBM’s customers. What where the challenges of developing such projects?

The pseudonymisation that we have created started several years ago in the realm of tokenization. Tokenization is used in the payment card industry to obfuscate the card number in data processing. Still, our team has extended the concepts around pseudonymisation using lots of different cryptographic techniques in a way that it can be used not just for bank account numbers, but across all sorts of data; you can de-identify data using these tokens or these pseudonyms, but they all remain linkable and, as a result, you dont lose any data utility.

In terms of uses cases, we worked with Rabobank for a number of years and we learned a lot about these techniques because they pseudonymise very complex applications. For example, their payment infrastructure consists of many different technologies, and if they want to test anything they need to produce vast amounts of test data because in GDPR you are actually not allowed to use non de-identified data for testing.

One of the challenges while working with our clients was finding out whether something had been pseudonymised or not. Sometimes there are quality problems with data, or you often have multiple formats in the same field when data sets get merged. On top of these, we need the ability to be able to deconstruct the data into its individual parts before we apply any of the pseudonymisation techniques.

Also, data might not be cleaned also because we have what we call “nested databases”. Another challenge in the banking environment would be around the shortcuts that developers actually use to solve different situations. For instance, most pseudonymisation projects have failed simply because the existing tools are not able to cope with real world problems/challenges and it’s essentially our cooperation with the clients on the projects that drive viable solutions.

How about lessons learned? Can you share some success stories, as well as ways to overcome challenges, based on your experience?

While working within these complex environments, we’ve discovered how data sets link together and can conclude that the data that we have to handle is very complex.

I will use the example of the IBAN number, which consists of a bank sort code, an account number, a country code and the checksum. If you pseudonyminise one field, meaning that you change any bits of the IBAN number, then the account number must be pseudonymised in the same way in other tables in the banking system. There are all sorts of logic in these data sets and we need to preserve the semantics of these fields, which can be complicated.

For instance, to sort out these complexities, Rabobank uses distinctive Latin flower names or more recently Latin dinosaur names to pseudonymise data, and it is very easy to spot out the data that should have had these flower/dinosaur names but it doesn’t. Moreover, IBM has created dictionaries, which are gigabytes large, to enable banks/companies to use the right cryptographic techniques to jump always to the right pseudonym of the right flower name for the right field and that's consistent across many fields.

Interestingly, these real-world problems and solutions essentially drive forward our innovation in the sense that you don't come across these things in an academic environment. And to conclude, regarding the technologies we’ve discussed, if you have large volumes of data, pseudonymisation is perfect. If you have very complex low volumes of data, then there might be a more interesting case for fully homomorphic encryption.

About Michael Osborne

Mr Osborne currently leads the security and privacy activities at the IBM Research Lab in Rüschlikon Switzerland and is the worldwide leading for IBM Q Security and Encryption. His current focus includes leading IBM Research divisions Quantum Safe Cryptography efforts to develop and standardise quantum resistant technology and transferring this technology to IBM’s products and services. A second area includes the development of cryptographic pseudonym security technology that enables data to be protected while it is being used. This includes the work at Truata to create a GDPR compliant data trust for extracting business insights from anonymised data.

About IBM Research

IBM Research is one of the world’s largest and most influential corporate research labs, with more than 3,000 researchers in 12 labs located across six continents. We play the long game, investing now in tomorrows breakthroughs. Our scientists are charting the future of artificial intelligence, breakthroughs like quantum computing, how blockchain will reshape the enterprise and much more.