Artificial Intelligence (AI) relies on oceans of data, most people know this.

But many people do not yet understand how data shapes AI before the AI is functional, or how data is used by AI in production. Each raises its own set of practical, technical and social issues. This lack of understanding can lead people to conflate data used in AI formation with the data AI uses as it operates.

Lawyers, legislators and regulators are no exception to this misunderstanding, and their narrow view of AI data is damaging the AI law and policy debate.

Those proposing to regulate AI must learn to recognize a distinction between data that we allow to train an AI and data that we allow in current AI applications.

AI that utilizes facial recognition technology collects and processes data about a person’s face for the purposes of identification, authentication, categorization, emotion analysis and other groundbreaking applications. A company developing AI that uses facial recognition feeds that system vast swaths of image libraries and, with the help of Machine Learning, the system can distinguish the images to a nuanced degree. The more information the system is fed, the more accurate it becomes over time. Just like any type of training, the more you do, the better you will become. The more complex the task, the more training required to become proficient. The AI only uses any single face to break down the geometry among the face’s features and compare that geometry with the same measurements of other faces. It does not identify people or otherwise harm faces, people or reputations.

The datasets used to train AI are critical in ensuring systems have fair and accurate predictions. Policymakers raise concern around biases demonstrated in the use of facial recognition. In 2018, the American Civil Liberties Union used the faces of members of Congress as the test data set for Amazon’s facial recognition tool, “Rekognition” which identifies individuals that had been arrested for a crime. While people of color make up 20% of Congress, these members made up 39% of the people who were falsely identified as having a prior arrest. In testing other facial recognition technologies, the National Institute of Standards and Technology (“NIST”) found that algorithms have a harder time recognizing people with darker skins, which causes the algorithm to misidentify individuals leading to outcomes like with Rekognition.

Fortunately, AI is becoming more accurate at an unbelievable pace. NIST found that from 2014 to 2018 facial recognition algorithms made significant strides with 28 algorithms producing more accurate results than the most accurate algorithm in 2014. Furthermore, the most accurate algorithm in 2018 saw a 28-fold reduction in false negative in comparison to its 2014 test. 

Although AI is becoming more accurate, there are still disparities in the false negatives and false positives between different ethnicities. AI researchers point to the lack of ethnically diverse datasets available to train facial recognition technology. A UCLA project highlighted that “data tends to be generated narrowly, with many factors contributing to a frequent Caucasian-centric homogeneity of data subjects These include the composition of minority demographics in cities where research occurs, and other socioeconomic factors that may influence the extent to which non-white subjects appear in western datasets that the researchers wish could have a more global applicability.”

The solution to achieving more accurate, fair and race-neutral AI is diverse, representative datasets. This is becoming increasingly more difficult given how States are regulating biometric information, such as the cases arising from the Illinois Biometric Privacy Act. In Vance v. Microsoft Corp.,1 the Plaintiff alleged that millions of photographs uploaded to Flickr were made publicly available for developing facial recognition algorithms. IBM used these photos to create a dataset called “Diversity in Faces” which was used by Microsoft to improve “the fairness and accuracy of its facial recognition products” which “improve[d] the effectiveness of its facial recognition on a diverse array of faces.” Microsoft moved to dismiss the case because the Flickr users whose faces were in the database had allowed public access to their faces, the data subjects weren’t identified in any way, and their faces weren’t used in a manner that could harm the data subjects. The court refused to dismiss Microsoft from the case on the grounds that the Illinois law demanded that Microsoft request and receive permission from each data subject to use their faces in this manner. While the court’s decision is consistent with a plain reading of the Illinois biometric law, punishing the development of more inclusive and accurate AI training data sets harms society for no apparent gain to anyone – including the Illinois resident data subjects.

State laws that paint with a broad brush are likely to inhibit, or at least substantially slow achieving, diverse datasets to help train AI to produce race-neutral results. States laws nevertheless could create a distinction between biometric information used to uniquely identify an individual or AI in application and biometric information used solely for creating robust, ethnically diverse datasets or AI in training. 

This distinction would appropriately differentiate between data used to train AI and data used by AI applications. It would require an individual’s consent for AI applications and would allow AI companies to collect publicly available images, from places where an individual would have no reasonable expectation of privacy, for AI in training. Future use of AI is certain. States need to strike a balance that allows individuals to exercise their privacy rights towards current applications of AI and allowing AI to train and develop in order produce results that are fair to all segments of the population. 

1 No. C20-1082JLR, 2021 U.S. Dist. LEXIS 72286 (W.D. Wash. Apr. 14, 2021)