Nationwide unlocks rapid innovation with synthetic data
Blog

Nationwide unlocks rapid innovation with synthetic data

By on 16 Dec 2020.

Case study: Hazy synthetic data cuts time for Nationwide to evaluate innovation partners from six months to three days

Nationwide Building Society is pioneering rapid innovation in the financial sector by increasingly collaborating with vendors in the burgeoning fintech ecosystem in order to remain competitive. A major hurdle for Nationwide to get the full value out of this initiative is provisioning representative data to third parties due to strict governance controls around their raw data assets. This process can take six months or more to clear. This stifles the pace of innovation.

Hazy successfully delivered a safe synthetic transaction data generator to Nationwide Building Society that cut the data provisioning process down to three days.

Hazy synthetic data preserves real customer behaviour without exposing any customer information or transactions, making the synthetic data 100 percent safe and fully representative of the real data. This means it can be used as a drop-in replacement for real data in existing workflows, allowing Nationwide to unlock third-party collaborations and increase the pace of innovation.

Summary

The banking industry is rapidly changing and the need to innovate has never been more critical to remain competitive. Transformation is being driven by factors such as new entrants, ease of switching, and digital banking services. In addition, technological developments in other sectors are increasing expectations from customers about how they manage their money. However, the way banks and building societies are innovating is also changing: in-house innovation teams are increasingly working with third parties to innovate faster by capitalising on their specialised expertise and applications. This approach offers huge potential for banks and building societies to tap into additional capability faster and more cost-effectively but faces a significant challenge: the ability to share data.

Sharing data for analysis is an operational requirement for financial institutions, enabling them to gain insights that directly support business imperatives including innovation, fraud detection and credit risk. Such insights enable in-house teams and third parties to build, shape and deliver propositions derived from understanding customer behaviours based on their transactions.

Data sharing is quite rightly subject to strict governance, security, regulation and legislation such as:

Third-party integrations and interfaces are also mandated by UK Open Banking and including the Payment Services Directive Two or PSD2.

In some cases, these regulations make sharing data across regional borders or organisations impossible which would otherwise allow for even greater insights.

Synthetic data is a new paradigm for sharing information safely and responsibly for innovation in financial services. Hazy synthetic data uses artificial intelligence and machine learning to create synthetic data from securely held customer data which does not leave its protected environment. Hazy’s software extracts the statistical information and relationships within the data but contains none of the original data so cannot be traced back in any way to the source, meaning that customers are 100 percent protected. Synthetic data can therefore be used by internal teams and third parties freely and safely to analyse and validate commercial innovations quickly.

Nationwide Building Society and Hazy have worked together to address these challenges head-on and removed three major barriers to sharing transactions safely and faster with third party partners:

  1. Generating synthetic data that preserves the statistical characteristics of the original data sufficiently for behavioural analysis of current account transactions
  2. Substantially reducing the time and cost of creating safe data from months to days
  3. Sharing synthetic data via the cloud without risk

This is the first time that synthetic data has been proven to preserve the time-sensitive nuances of customer banking transactions that can be shared safely with external parties in a production environment.

It is also a transformational play for Nationwide in proving that synthetic data is sufficiently representative of real data to increase their speed to innovation and sets a benchmark for driving data agility and eliminating security concerns for sharing data.

The Challenge

Companies face multiple challenges to sharing data. Key amongst these is the ability to transfer the patterns of consumer behaviour in the data needed to feed the analytics they want to run, without the need to transfer the real data. Another important challenge is that Nationwide wants to do this without forcing their analytical partners to ingest an entirely new type of data structure. In other words, they want a drop-in replacement for the real data that has the same schema and properties.

Techniques such as data masking and data anonymisation which have until now been used to protect the privacy of customer’s data have known weaknesses including:

A further challenge is that the time it takes to create masked data depends on complexity and size. There are also limits to the quality and utility of the output compared to production data. This is an industry challenge as the whole process can take six months or more which significantly reduces the capacity to innovate and collaborate with third parties.

Synthetic data addresses these challenges, enabling Nationwide to obtain representative, reusable customer transaction data that contains no personally identifiable information, which can be shared with third parties for validation of their capabilities and innovation exercises. Such capability enables Nationwide to generate a proper assessment of third party technologies without exposing the Building Society to risk or requiring a lengthy governance process to obtain data.

The Solution

Hazy synthetic data is sufficiently representative of real transaction data and preserves the signal — the statistical properties of the original data — and can be used as a drop-in replacement for real data.

Hazy synthetic data preserves the statistical properties and patterns of consumer transactional behaviour without any of the privacy concerns.

This signal is required to analyse how customers manage their money which, in most cases, is very similar: making sure bills are paid and their account stays in credit each month. However, transaction behaviours that fall outside of the norm may indicate fraud or pivotal events such as unemployment. To identify these behaviours requires a high statistical fidelity within that signal and is critically dependent on the time when transactions occur. These are known as signatures — characteristics of behaviour that lead to specific outcomes.

Hazy synthetic data generation software trained on a dataset of 30 million customer transactions from an 18-month period, representing more than 8,800 customers with nearly 10,000 accounts between them. Once trained, the resulting synthetic data model can be used to generate synthetic datasets of arbitrary size on demand.

Within this dataset there is a rich variety of characteristics and patterns of behaviour that need to be learned in order to produce a synthetic dataset that is fit for purpose. Here are a few key examples for illustration:

To verify that the synthetic data successfully preserved these characteristics, a battery of metrics from the Hazy toolkit were used:

Having set out the performance metrics for success, the Hazy team fine-tuned the synthetic data generator model and used Hazy’s interactive visualisation tool to enable direct A/B comparison of real and synthetic data.

The results were outstanding and indicate we believe for the first time a new level of signal preservation in sequential synthetic data. The graph below shows an example of a realistic synthetic account history.

Example of account balance for a synthetic customer. Synthetic transaction data is indistinguishable from real one.
Example of account balance for a synthetic customer. Synthetic transaction data is indistinguishable from real one.

The second proof point was demonstrating how the time to create securely share-able data could be reduced from months to days. A detailed business analysis was undertaken comparing the current process of data masking to the new process of synthetic data generation, with a focus on reducing time. Then, Nationwide went further to evaluate the complete end-to-end process from identifying a use case and onboarding of third parties through to the third proof point of sharing synthetic data with them for analysis. This puts synthetic data at the heart of the workflow as shown below.

Use case identified, 3rd party engaged and scope agreed Previous approach Process with synthetic data Use case identified, 3rd party engaged and scope agreed Reusable one page data agreement issued to 3rd party Generator run in Azure to create dataset, output CSV saved in Azure storage blob Shared access link to dataset created with expiry time (after 2 days) Generator output parameters are logged in Excel for audit trail Links shared with 3rd party so they can access data Engage with procurement & controls communities to shape engagement Engage business to source data & design data set Secure resource to extract, cleanse and anonymise data as appropriate Create dataset and find physical method of sharing with 3rd party Elapsed time: approx 6 months depending on requirements Elapsed time: approx 2–3 days
Generating data for third parties workflow

The Benefits

The headline benefit from our collaboration is that Nationwide can now innovate faster. Using Hazy synthetic data enables Nationwide to preserve the behavioural and temporal characteristics of production data to rapidly create and provision representative synthetic data sets for third parties to perform analysis. The full set of measurable benefits that using synthetic data has created include:

Combined, these represent a significant step-change in delivery of value to Nationwide in terms of speed, security and costs.

Conclusion

Nationwide and Hazy have solved the historic challenge of creating safe, representative, compliant data to share with third parties so that, through analysis, they can generate meaningful actionable insights into account behaviours which directly enhances the speed of innovation.

Rapid innovation is a vital driver to remain competitive in an evolving landscape that is being transformed by unforeseen changes to the economy, aggressive new entrants, new regulations such as Open Banking, and challenger FinTech platform strategies. The Hazy synthetic data solution has shown it can meet these challenges head on.


This post was originally written by Dr. Alexander Mikhalev, entitled “Hazy: Synthetic Data to fuel Rapid Innovation” on the Nationwide Technology Blog.

For the technical background to this project, please read our Head of Data Science Armando Vieira's research “Generating Synthetic Sequential Data using GANs” published in Towards Artificial Intelligence.

Subscribe

Synthetic data newsletter

Sign up for our sporadic newsletter to keep up to date on synthetic data, privacy matters and machine learning.