The state of synthetic data
Like every new technology, synthetic data has benefited and suffered from hype over the last few years. Having founded our business six years ago, we’ve experienced this first hand.
I sometimes get asked by prospects and customers who’ve used the technology, “what does it take to get synthetic data fully implemented across the business?” That question comes from understanding how slow and risky data provisioning is in the enterprise and understanding the potential synthetic data has to solve that problem.
Synthetic data use cases span widely and include simulation, adaptation, augmentation and fairness, yet none of these applications are as developed as the privacy use case. Privacy enhancing synthetic data for the tabular data domain is now being seen as a production ready solution and has been proven to work very well on distinct use cases. However, as a new technology a number of barriers remain to be overcome before the products built with it will truly scale.
The core technology challenges are largely solved, the early adopters have proved viability and are now willing to act as reference customers. The next set of challenges are all based around proving the product can scale in an enterprise wide setting and that is going to take a combined effort from technical, security, privacy, legal, and management teams in businesses as well as regulators outside of businesses to get this technology to the next stage.
The following are a set of my own opinions about the state of the synthetic data market, the state of the technology, how large enterprises are using it today and the challenges they face to adopting this technology more widely. Some of these opinions are backed up and shared by analysts and other observers of the sector, some are more anecdotal from our own experiences working with pioneering enterprise customers at Hazy.
State of the technology
The fact is the benefits and limitations of various core algorithms from GANs to Variational Autoencoders used for generating synthetic data have been well studied and well understood for some time now. The quality of the synthetic data that can be generated is well understood. The privacy synthetic data can achieve is well understood. The metrics used to measure these characteristics are also well understood with some consolidation between providers and research bodies around metrics standards. Further research will only provide incremental improvements and the current technology that is already good enough to provide significant value for customers if packaged and delivered correctly.
The technological standard for any synthetic data product worth its salt today is to use generative AI techniques to produce high quality, statistically representative and private synthetic data. The trade off between privacy and utility should be able to be managed via a tunable dial in the product using a differential privacy mechanism. If it’s not using an explicit privacy mechanism (e.g., differential privacy), the synthetic data is not private. Period.
Mature synthetic data platforms should be able to connect to large, complex, structured databases ranging from single tables to multi-table structures in the scale of hundreds, if not thousands of tables and automatically detect and maintain referential integrity while doing it. It should be able to train a generative model efficiently on often limited (likely just CPU) on-premise enterprise infrastructure - most enterprises who need this product will not share their data externally (that’s the problem synthetic data is here to solve).
It should also be deployable on cloud environments and ideally accommodate a modular, hybrid setup to allow seamless cloud migration and air-gapped separation of training and generation processes to make sure no real data ever needs to leave a secure production environment. A container based deployment approach should be available to allow cost efficient spikes in compute usage from training and generation runs.
Lastly it must be usable for non-technical users both to download and consume synthetic data but also to connect, configure and train synthetic data generators against sensitive raw datasets in a variety of common formats. This is a particularly thorny UI/UX problem to solve when connecting and configuring thousands of tables. This should be solved with great UI/UX and lots of automation, including automatic detection of primary/foreign key relationships, business logic and data-types.
Those are just the baseline features, anything short of that and the platform is not really technically ready to scale.
How and where synthetic data is being used today
The main applications for synthetic data lie in industries that;
- Are heavily regulated
- Have businesses built on trust (so regardless of regulation will not risk leaking sensitive data)
- Have valuable data sources that contain large amounts of sensitive personal/confidential information
- Have complex legacy tech infrastructure, often through historical acquisitions
- Have the know-how, buy-in and budget to extract the value from those data sources through mature analytics/development functions.
- Have a existential pressure to leverage those data assets and innovate (think banking and the presence on challenger startups forcing incumbents to improve their services quickly)
The top three sectors are therefore, finance, healthcare and telco.
The use cases for synthetic data in large enterprises are also well understood.
- Testing
- Analytics
- Data science
- Collaborating with or testing third parties
- Monetisation
Today we see a huge appetite for testing and software development as this is a bread and butter use case in every large organisation. All the other use cases are growing fast.
Remaining challenges
So if the technology is there and the use cases are understood, what are the remaining challenges? Synthetic data has been well proven on individual use cases or within individual departments but has yet to be truly scaled across an enterprise. At Hazy we’re working with customers that are going on that journey, but still have a couple of key steps to go. We think the remaining challenges now lie in regulation, vision and execution.
Regulation
Many legal teams internally at our customers and externally at independent law firms agree synthetic data generated using differential privacy falls outside the scope of GDPR.
Many of the businesses we speak to wish the regulators had a clear position on the technology.
Many data regulators have been researching this technology themselves or with independent third party experts such as the ATI and so understand it well.
Our advice to regulators is clear, the technology works, the companies you regulate are already using it and it produces private synthetic (so long as differential privacy is used in the process) that can be leveraged to benefit society and end consumers. So, now is the time for data and industry specific regulators to seize the initiative. This is a real life example of how regulators can quickly and easily provide guidance to unleash innovation. The first mover regulators on this technology will look the smartest. A good example of this is the FCA in the UK.
Vision
A word of warning and we’ve learnt this the hard way, don’t start implementing synthetic data unless you have the vision for this to be an enterprise wide, systemic solution that sits as a layer between your data lake and your users.
Synthetic data needs to ultimately become a core part of your non-production data provisioning strategy. You will prove value with smaller contained deployments, but the real value to your business is in full enterprise wide adoption.
It can be scary to be one of the pioneers in your industry advocating a new technology, but with the right roadmap and well thought out milestone’s we’ve proven it’s possible for an enterprise to go on that journey taking small steps in a cost efficient manner. We’ve built the industry's first Synthetic Data Maturity Model that clearly outlines these steps based on our first hand experience. There are now large organisations that are well on their way through this model and can see the clear steps to achieving enterprise wide adoption.
Execution
Delivering synthetic data in a large organisation means connecting to sensitive source datasets and therefore often battling through the very data provisioning problem you’re trying to solve. At the core of all true synthetic data products are generative AI algorithms that need configuring. Both these factors mean deploying synthetic data on complex enterprise data sets can be time consuming and a steep learning curve for the team involved. However these types of execution challenges are not unique to synthetic data and the good news is most enterprises are good at solving them. Some gotchas we’ve seen include;
- Delivery resource commitment - getting the right people in the organisation committed to delivering the first proof point to build momentum. Those people tend to include CDOs and Enterprise Architects.
- Sponsorship - as synthetic data is a horizontal and enterprise wide solution, we’ve found engaging CDOs, CTOs, Chief Enterprise Architects and CISOs early in the process to build a long term strategic plan is critical to getting the buy-in you need to eventually realise maximum value from synthetic data.
- Defined roadmap - really plan to get the first proof point as a valuable production-grade use case and have a clear agreed plan to roll out beyond that first use case as soon as possible.
Conclusion
Mature synthetic data solutions are technically ready to scale within enterprises. We’re seeing a clear groundswell of support from enterprises and regulators to commit to synthetic data as a viable privacy enhancing technology that delivers value across a range of well understood applications.
There are some outstanding challenges to overcome, but that is happening right now. At Hazy we’re lucky enough to be right at the leading edge of solving the hardest of these challenges. Some of my proudest moments at Hazy have been listening to customers rave about the quality and experience of our customer success team who are helping them on this scaling journey.
If you’d like to learn more about how we’re doing this, get in touch.