Case study: Cloud enablement at a Tier 1 bank

By Harry Keen on 16 Mar 2019.

Hazy helped a tier 1 bank that was blocked by governance restrictions run analytics workloads on AWS cloud. Enabling usage of advanced tools and high compute as well as reducing the time it took to provision the data to the cloud from months to hours.

Problem

Hazy’s customer was blocked from moving a data analytics workload to AWS by data security and data privacy governance restrictions. The customer operates in the heavily regulated financial industry and has governance processes in place to protect their customers information, the bank’s reputation and meet the data compliance regulation. Unfortunately, the data science team need statistically representative data to build, test and evaluate modelling approaches and quickly iterate through new ideas.

Possible solutions

The team considered using raw customer data and going through the live data exception request process. The process outlined by the security team and privacy board was going to take months to complete and was deemed too long to make best use of the team’s capabilities.

The team considered running pipeline test exercises on dummy data to run in parallel with the live data exception request process to try and speed up the process, but the dummy data they could make from schema based tools was not representative enough - it lacked referential integrity, dirtiness of real data and the inherent statistical properties.

They also considered using anonymisation techniques but were aware that they would loose to much utility in the data and the “look and feel”, so would not be able to use it as a drop in replacement for real data. The security team was also aware of

Hazy solution

To solve this problem, they used Hazy to provision safe synthetic data that preserved all of the nuances of the real data - statistical equivalence, referential integrity and sequential transaction information. As the synthetic data contained no real information and was generated using pre-approved minimum differential privacy guidelines the security team were able to sign off the cloud usage with this data immediately.

The team used Hazy to train a synthetic generator on real data on premise, moved the generator to their cloud environment using Hazy’s integrated infrastructure and generated statistically equivalent synthetic data in the cloud.

Benefits

This meant the data science team could run modelling exercises using advanced cloud tooling as if they were using the real data. The process took 1 hour to provision data to the cloud, significantly less time than using real data saving our customer significant resources.


Check out the rest of our blog, subscribe using RSS