Case study: Hazy synthetic data cuts time for Nationwide to evaluate innovation partners from six months to three days
Nationwide Building Society is pioneering rapid innovation in the financial sector by increasingly collaborating with vendors in the burgeoning fintech ecosystem in order to remain competitive. A major hurdle for Nationwide to get the full value out of this initiative is provisioning representative data to third parties due to strict governance controls around their raw data assets. This process can take six months or more to clear. This stifles the pace of innovation.
Hazy successfully delivered a safe synthetic transaction data generator to Nationwide Building Society that cut the data provisioning process down to three days.
Hazy synthetic data preserves real customer behaviour without exposing any customer information or transactions, making the synthetic data 100 percent safe and fully representative of the real data. This means it can be used as a drop-in replacement for real data in existing workflows, allowing Nationwide to unlock third-party collaborations and increase the pace of innovation.
The banking industry is rapidly changing and the need to innovate has never been more critical to remain competitive. Transformation is being driven by factors such as new entrants, ease of switching, and digital banking services. In addition, technological developments in other sectors are increasing expectations from customers about how they manage their money. However, the way banks and building societies are innovating is also changing: in-house innovation teams are increasingly working with third parties to innovate faster by capitalising on their specialised expertise and applications. This approach offers huge potential for banks and building societies to tap into additional capability faster and more cost-effectively but faces a significant challenge: the ability to share data.
Sharing data for analysis is an operational requirement for financial institutions, enabling them to gain insights that directly support business imperatives including innovation, fraud detection and credit risk. Such insights enable in-house teams and third parties to build, shape and deliver propositions derived from understanding customer behaviours based on their transactions.
Data sharing is quite rightly subject to strict governance, security, regulation and legislation such as:
- European Union General Data Protection Regulation (GDPR)
- Payment Card Industry Data Security Standard (PCI-DSS)
- California Privacy Rights Act of 2020 (CRPA)
- UK Data Protection Act (DPA)
In some cases, these regulations make sharing data across regional borders or organisations impossible which would otherwise allow for even greater insights.
Synthetic data is a new paradigm for sharing information safely and responsibly for innovation in financial services. Hazy synthetic data uses artificial intelligence and machine learning to create synthetic data from securely held customer data which does not leave its protected environment. Hazy’s software extracts the statistical information and relationships within the data but contains none of the original data so cannot be traced back in any way to the source, meaning that customers are 100 percent protected. Synthetic data can therefore be used by internal teams and third parties freely and safely to analyse and validate commercial innovations quickly.
Nationwide Building Society and Hazy have worked together to address these challenges head-on and removed three major barriers to sharing transactions safely and faster with third party partners:
- Generating synthetic data that preserves the statistical characteristics of the original data sufficiently for behavioural analysis of current account transactions
- Substantially reducing the time and cost of creating safe data from months to days
- Sharing synthetic data via the cloud without risk
This is the first time that synthetic data has been proven to preserve the time-sensitive nuances of customer banking transactions that can be shared safely with external parties in a production environment.
It is also a transformational play for Nationwide in proving that synthetic data is sufficiently representative of real data to increase their speed to innovation and sets a benchmark for driving data agility and eliminating security concerns for sharing data.
Companies face multiple challenges to sharing data. Key amongst these is the ability to transfer the patterns of consumer behaviour in the data needed to feed the analytics they want to run, without the need to transfer the real data. Another important challenge is that Nationwide wants to do this without forcing their analytical partners to ingest an entirely new type of data structure. In other words, they want a drop-in replacement for the real data that has the same schema and properties.
Techniques such as data masking and data anonymisation which have until now been used to protect the privacy of customer’s data have known weaknesses including:
- Not preserving key statistical relationships in the original data
- Being a slow and resource-intensive process
- The risk of being reverse engineered to reveal original data (eg, through linked attacks)
A further challenge is that the time it takes to create masked data depends on complexity and size. There are also limits to the quality and utility of the output compared to production data. This is an industry challenge as the whole process can take six months or more which significantly reduces the capacity to innovate and collaborate with third parties.
Synthetic data addresses these challenges, enabling Nationwide to obtain representative, reusable customer transaction data that contains no personally identifiable information, which can be shared with third parties for validation of their capabilities and innovation exercises. Such capability enables Nationwide to generate a proper assessment of third party technologies without exposing the Building Society to risk or requiring a lengthy governance process to obtain data.
Hazy synthetic data is sufficiently representative of real transaction data and preserves the signal — the statistical properties of the original data — and can be used as a drop-in replacement for real data.
Hazy synthetic data preserves the statistical properties and patterns of consumer transactional behaviour without any of the privacy concerns.
This signal is required to analyse how customers manage their money which, in most cases, is very similar: making sure bills are paid and their account stays in credit each month. However, transaction behaviours that fall outside of the norm may indicate fraud or pivotal events such as unemployment. To identify these behaviours requires a high statistical fidelity within that signal and is critically dependent on the time when transactions occur. These are known as signatures — characteristics of behaviour that lead to specific outcomes.
Hazy synthetic data generation software trained on a dataset of 30 million customer transactions from an 18-month period, representing more than 8,800 customers with nearly 10,000 accounts between them. Once trained, the resulting synthetic data model can be used to generate synthetic datasets of arbitrary size on demand.
Within this dataset there is a rich variety of characteristics and patterns of behaviour that need to be learned in order to produce a synthetic dataset that is fit for purpose. Here are a few key examples for illustration:
- Different types of customer with varying numbers of accounts
- Different types of account, i.e. credit card, savings and current accounts exhibiting different behaviours
- Transactions with a wide range of values, ranging from buying a coffee to paying for a family holiday
- Transactions with a wide variety of merchants and other recipients, such as convenience stores or local government authorities (for council tax payments)
- Sequential behaviour: transactions which recur every month such as rent or salary, transactions which follow each other in quick succession, etc.
To verify that the synthetic data successfully preserved these characteristics, a battery of metrics from the Hazy toolkit were used:
- Probability distributions over key attributes, such as transaction amounts, running balance and initial and final balance
- Codependency between pairs of attributes
- Quality of classification tasks, such as classifying behavioural patterns into life events
- Importance given to various features when using machine learning to predict attributes such as merchant category from other attributes
- Autocorrelation of transaction time series - how an entire time series correlates with itself as it progressively shifts in time.
Having set out the performance metrics for success, the Hazy team fine-tuned the synthetic data generator model and used Hazy’s interactive visualisation tool to enable direct A/B comparison of real and synthetic data.
The results were outstanding and indicate we believe for the first time a new level of signal preservation in sequential synthetic data. The graph below shows an example of a realistic synthetic account history.
The second proof point was demonstrating how the time to create securely share-able data could be reduced from months to days. A detailed business analysis was undertaken comparing the current process of data masking to the new process of synthetic data generation, with a focus on reducing time. Then, Nationwide went further to evaluate the complete end-to-end process from identifying a use case and onboarding of third parties through to the third proof point of sharing synthetic data with them for analysis. This puts synthetic data at the heart of the workflow as shown below.
The headline benefit from our collaboration is that Nationwide can now innovate faster. Using Hazy synthetic data enables Nationwide to preserve the behavioural and temporal characteristics of production data to rapidly create and provision representative synthetic data sets for third parties to perform analysis. The full set of measurable benefits that using synthetic data has created include:
- Reducing the time to create and share safe data from months to days
- Increasing the throughput of innovation projects per year
- Reducing the people time required to prepare data
- Reducing Nationwide’s risk of data leakage
- Making the process of sharing data with external parties faster, safer and trackable
- Building Hazy synthetic data generation into the end-to-end workflow of third party onboarding through the standardisation of contractual processes and governance
Combined, these represent a significant step-change in delivery of value to Nationwide in terms of speed, security and costs.
Nationwide and Hazy have solved the historic challenge of creating safe, representative, compliant data to share with third parties so that, through analysis, they can generate meaningful actionable insights into account behaviours which directly enhances the speed of innovation.
Rapid innovation is a vital driver to remain competitive in an evolving landscape that is being transformed by unforeseen changes to the economy, aggressive new entrants, new regulations such as Open Banking, and challenger FinTech platform strategies. The Hazy synthetic data solution has shown it can meet these challenges head on.
This post was originally written by Dr. Alexander Mikhalev, entitled “Hazy: Synthetic Data to fuel Rapid Innovation” on the Nationwide Technology Blog.
For the technical background to this project, please read our Head of Data Science Armando Vieira's research “Generating Synthetic Sequential Data using GANs” published in Towards Artificial Intelligence.