How synthetic data bridges data science and traditional engineering

By Armando Vieira on 16 Mar 2020

Traditional software engineering and machine learning and data science development are two very different entities. In the modern enterprise, these two areas work in parallel. In order to achieve shared business objectives, IT and data science have to better understand each other and how each handles data during development and in production. Synthetic data is one way to allow both to move fast to safely solve customer needs.

What distinguishes the two? The traditional programming approach can be framed as relatively formal and well-defined — you hypothesise how to solve a certain problem and the software development team builds that requested functionality. Based on the data and the program, the software has to perform the analysis exactly the way specified. The (human) programmer knows how to solve the problem, codes the rules, and the machine delivers the answer.

Machine learning is a horse of a totally different colour. This process usually starts with the data and lets the machine work out the best solution. You just specify the goals — or in machine learning terms, the loss functions — and eventually the boundaries of which data can be used. That's all.

Sadly for humans almost all the intelligence in artificial intelligence and machine learning comes from letting algorithms automatically learn from data, not from the lines hard-coded by software engineers. The only things required are the data and the goals, or in machine learning terms, the loss-function. This means that machine learning development is not, and can't be, separated from the data environment.

The explorative and scientific nature of machine learning development means that the setup available to data scientists has to be extremely flexible and conceptually different than software programming. Easy data access must be readily available. Nobody walks into the process with predefined notions of exactly which machine learning technique to use because, at the start, every technique needs exploration. The most optimised solution may become clear only after that process has finished.

So, what do these distinct approaches mean in terms of development and production environments? Traditional programming lends itself to a much stricter rule-based process that follows predefined design principles. The starting point for machine learning development, on the other hand, is much more explorative and open-ended, which may have a significant impact on how the development environment needs to be set up.

Some organisations have a tendency to downplay the impact of the development environment setup and the impact it will have on data science productivity. Big mistake. You simply cannot apply the existing infrastructure setup — which was designed for traditional software development environments — to a data science environment.

Traditional software design is much more restrictive in terms of programming languages and principles. Machine learning gets a lot messier. It can involve several languages, different modules for pre-processing and post-processing, and the constant updating of different libraries and package versions. Machine learning evolves so fast that every day you can be facing a new version to train your neural network more efficiently. This is especially true if you work in natural language processing (NLP).

A machine learning development workflow starts and ends with the data. Because it's the data that trains the model.

The explorative and mutant nature of machine learning cannot be done in isolation from real data — this "unreal" data includes any that was extracted months ago. Machine learning has to be done on the fly on real data.

For this to work, data scientists and data engineers need to have constant access to data, which means that they need a stable data pipeline.

However in many organisations, like banks, access to data almost always requires a long procurement process that can take months or even years. By this time, the data is stale and utterly useless.

This is where synthetic data comes in. Hazy helps provide a virtualised environment for data scientists to play with a synthetic version of the real data that contains almost all relevant features of the original data but without compromising privacy. Hazy enables fast development and deployment of complex machine learning models, with our gold-standard mix of

cross dataset referential integrity
differential privacy
large multi-table database support
air-gapped on-premise deployment

Virtualised production environments keep the development and production environments on the same infrastructure setup. This facilitates machine learning productivity when moving between development and production, with faster and more efficient feedback loops. You also gain a cost efficiency benefit when you don't need to duplicate the infrastructure because both are built on the same data pipeline.

When we find a way to take down the silos between software and data engineers, without any privacy compromise, we can work faster, smarter, and at a lower risk.

For the latest news and insights