You may have never thought about the data sets for training and testing AI, but you should.

Software runs the world.

The coming generation of software will include machine learning, so lawyers and businesses should educate themselves on the workings of machine learning that are likely to cause legal, regulatory, or contractual risk.

Machine learning – one of the most common design and programming methods to be classified as artificial intelligence – uses data sets fed into development tools that should produce a computer program that can make differentiations and/or predictions based on what it learned from analyzing all of the data in those sets. These data sets must be considered in operational planning, business deals, and reputation management.

The world outside of AI development teams knows very little about training or testing data sets – what they are made of, how they are created, what is chosen to stay in, and what is chosen to stay out. It seems to most people that the magic of AI/machine learning comes from the tools that learn from those sets, not the training data sets themselves.

But appropriate training data sets are crucial to creating correctly functioning AI. One of the problems with such data sets is that they need to be enormous for machines to learn from them. In other words, you or I probably couldn’t stitch together a couple of local databases and make an effective training set. Creating one involves significant work, so it would be better if you could simply use a data set that academics had already pulled together.

The recent removal of important datasets created and managed by MIT demonstrates a pitfall of enormous collections of information. Last week the school removed from circulation the AI training library called 80 Million Tiny Images, created in 2008 to help produce advanced object detection for machine learning.  It is an enormous collection of pictures with descriptive labels. The database can be fed into neural networks, teaching them to associate patterns in the pictures with the descriptions.

According to the Register, “The dataset holds more than 79,300,000 images, scraped from Google Images, arranged in 75,000-odd categories. A smaller version, with 2.2 million images, could be searched and perused online from the website of MIT’s Computer Science and Artificial Intelligence Lab. This visualization, along with the full downloadable database, were removed on Monday” from MIT’s public servers. “The key problem is that the dataset includes, for example, pictures of Black people and monkeys labeled with the N-word; women in bikinis, or holding their children, labeled whores; parts of the anatomy labeled with crude terms, and so on – needlessly linking everyday imagery to slurs and offensive language, and baking prejudice and bias into future AI models.”

MIT scientists admitted that they automatically obtained the images and descriptions in the Tiny Image database from the internet without checking whether offensive pictures or language was pulled into the library. So anyone who has used this set to train their visual AI programs has fed a bias problem into the machine. Apparently the entire library was constructed by using code to search the web for images related to a huge list of words, which included derogatory terms.

Further, this library, like many other huge training datasets, captures pictures without permission of either the photographer or the subjects. Privacy, copyright, and rights of publicity are generally not considered by the engineers or academics pulling images, voices, or text to train AI. Such ignorance, intentional or otherwise, can lead to later problems with the programs trained on these datasets.

But many resources can point AI developers to free datasets that may have been started with more careful considerations. Governments provide access through the UK Data Service and the US National Center for Educational Statistics, plus a comprehensive visualization of US public data at Data USA. Image datasets servicing the same function as the Tiny Image database include Google’s Open Images, Imagenet, which is often used to check new visual differentiating algorithms, and even the Stanford Dogs Dataset, which contains more than 20,000 images of 120 dog breeds.

Online resources provide lists of datasets for every function from building autonomous vehicles to natural language processing, from finance and economics to sentiment analysis for chatbots. There is even a Wikipedia page listing datasets for machine-learning research which included curated repositories of datasets.

Of course, if you are not afraid of a little work, you could build your own datasets for your machine learning project. You will need both a training data set and a test data set to see how well the algorithm learned what you tried to teach it. You would need a strategy to preprocess the data to assure that it is clean, free of bias, and in the right format. There are sites that can walk you through the whole process.

Training datasets and testing datasets are of crucial importance in building the next generation of useful narrow AI for all purposes. For lawyers advising in this space, build your knowledge of AI training and testing into vendor agreements and business contracts so that your clients will be protected from surprises in AI projects. We must learn the important business and technical risks associated with datasets to properly advise clients.