AI is hungry for data.

Training and testing the machine-learning tools to perform desired tasks consumes huge lakes of data.  More data often means better AI.

Yet gathering this data, especially data concerning people’s behavior and transactions, can be risky. For example, In January of this year, the US FTC reached a consent order with a company called Everalbum, a developer of photography apps. The FTC accused Everalbum of deception and unfairness in collecting and retaining facial recognition data to be used as AI training databases. The FTC not only forced Everalbum to delete the pictures in the database and to cease using the AI program trained on that database. So the entire investment in AI is rendered useless because the data used to train that AI was suspect.

Examples abound of privacy infringement in collecting AI training data. Venturebeat writes, “The Royal Free London NHS Foundation Trust, a division of the U.K’s National Health Service based in London, provided Alphabet’s DeepMind with data on 1.6 million patients without their consent. Google — whose health data-sharing partnership with Ascension became the subject of scrutiny in November — abandoned plans to publish scans of chest X-rays over concerns that they contained personally identifiable information. This past summer, Microsoft quietly removed a data set (MS Celeb) with more than 10 million images of people after it was revealed that some weren’t aware they had been included.”

So how do you feed the AI beast on datasets about personal outcomes and transactions without jeopardizing the privacy of the data subjects?  Easy.  Make up the data. The AI may need to learn using transactional information, but the training data doesn’t need to be from real transactions. Fake transactions could work just as well. Gartner recently predicted that within a decade most of the data used in training AI will be artificially generated.

You have seen simulated data in action if you have ever used a flight simulator. Algorithms can create data sets that mimic data gathered in the real world. According to the Nvidia blog, “Donald B. Rubin, a Harvard statistics professor, was helping branches of the U.S. government sort out issues such as an undercount especially of poor people in a census when he hit upon an idea. He described it in a 1993 paper often cited as the birth of synthetic data.” But the rise of AI has accelerated the development of synthetic data.

The Nvidia blog also observes that generating synthetic data can be much less expensive than purchasing similar captured data, noting “Because synthetic datasets are automatically labeled and can deliberately include rare but crucial corner cases, it’s sometimes better than real-world data.” Manual labelling of unstructured data is time consuming and expensive. Synthetic data can be pre-labeled in creation, saving significant resources. Edge cases may not appear in any world-measured data set, but can be built into synthetic data sets. Well-designed algorithms for creating synthetic data sets can keep on generating data, and the data sets themselves can be re-used many times for AI training and testing.

Given privacy concerns with measured data, healthcare is a field where synthetic data may be exceptionally useful for training machine learning systems. To that end, the U.S. Department of Health and Human Services initiated a synthetic health data challenge in furtherance of the Department’s ambitious effort to create a synthetic health data engine. HHS is interested in developing synthetic data not only for AI training, but to allow researchers to test analysis and systems prior to achieving access to the measured clinical data, thus speeding completion of effective research projects. The challenge includes money prizes to be awarded by the National Coordinator for Health Information Technology. HHS intends ultimately to model the medical history of synthetic patients. “The resulting data are free from cost and privacy and security restrictions and have the potential to support a variety of academia, research, industry, and government initiatives.”

We do not have space here to cover all of the applications for synthetic data, but important functions include initiation of cloud migration – reducing risks of pushing sensitive and regulated data into a cloud platform by moving synthetic data to build out working networks within the cloud. Also, because real-life testing of robots and drones is expensive and slow, synthetic data can allow developers to test robotics in simulations.

As more legal and business accountability is demanded of AI, and as machine learning systems make more decisions that affect us, we should expect to answer questions about data nutrition.  What data was fed into the system to make this AI work? I expect that eventually entities making or using AI will be expected to produce for public inspection the data diet of their products. And unlike people, for whom a natural diet seems to work best, a synthetic diet may be the best thing for an AI.

It’s likely you will be reading more about synthetic data, especially in the context of training and testing databases.  We are seeing the early stages of development, but synthetic data holds the promise of a dominant source of value to companies in the future. AI developers are learning an important truth – you are what you eat.