Definitions are important.

How we define words sets the context for how we regulate them. In the U.S., the definitions of legally defended private information are changing, affecting the entire scope of information protection. The change in definitions reflects a desire to protect more of the data that describes how we live.

Early digital age protections of data in the U.S. tending to apply very specific definitions. First, the government began protecting the particular types of data that concerned legislators, regulators, and the general public – financial/banking information, descriptions of health care, and information relating to children.  This was the data that people felt was most private and most likely to be abused.  It was the data that many people would have been concerned about sharing with strangers.

The definitions around these laws reflected the specificity of their intent.

Then came official recognition of identity theft as a growing societal problem. As information was digitized, connected to the web, and accessed remotely, Americans saw how this data could be used to impersonate the people it was supposed to describe and identify.  Then came the passage of state laws, soon to encompass all 50 states, requiring notification of affected data subjects when their data had been exposed to unauthorized people.

The terms defined in this first wave of data breach notice laws were based on lists.  Each law listed a set of information categories likely to facilitate the theft of a citizen’s identity. The data breach notice law definitions of personally identifiable data tended to match a piece of identifying information – name or address – with a piece of data that would allow a criminal to access accounts.  This last category included account numbers, credit card numbers, social security numbers, driver’s license numbers, and even birth dates and mother’s maiden name. If it wasn’t on the list, it did not trigger the statute.  Different states added or subtracted pieces of information from the standard list, but the concept of listed trigger data remained the same.

The CCPA shattered this concept. As the first omnibus privacy act in the U.S., the California Consumer Privacy Act brought European thinking to privacy protection law.  Rather than a limited vertical market like finance or health care, or a narrow legal goal like stopping identity theft, the CCPA sought to create new rights that individuals would have to protect data collected about them, and the CCPA sought to impose those rights down on businesses who previously felt that they were owners of the data. The CCPA never defined anything as fundamental or nebulous as “ownership” of the data, but it did offer a new, breathtakingly broad definition of the personal information at the heart of the statute.

The CCPA definition was not a list. For years, demographics experts have known that 85% of the U.S. population could be identified by name if you had just three pieces of information about them: gender, zip code, and birth date. The more information about a person in your file, the easier it is to identify her and know many more things about her.  So it has been clear to privacy professionals for a long time that relevant personally identifiable information is not a list of names or addresses, but a mathematical calculation.  If your company had seven, eight, or nine facts about a person – even seemingly disparate facts like where they were at a given time and what they bought – with the right math your company could probably identify that person. This mathematical accretion concept better encompasses a useful definition of personally identifiable information to enforce a broader set of rights than the standard lists would do.

The European Union had already built this concept into law when it passed the GDPR. The GDPR includes protections for personal data, which is broadly defined, and then a tighter set for sensitive data, which is defined by category. I expect to discuss definitions and protection of sensitive data in this space next week. The GDPR defines personal data as “any information relating to an identified or identifiable natural person (‘data subject’); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person.”

While ‘any information relating to an identifiable person” is broad, the California definition is both broad and vague. The CCPA defines personal information as “information that identifies, relates to, describes, is reasonably capable of being associated with, or could reasonably be linked, directly or indirectly, with a particular consumer or household.” It may be years before this definition is tested and clarified in the courts. Until that time, we will need to operate under the knowledge that any information reasonably capable of being associated with a person is regulated data.  What about a slice of data that can’t, by itself be associated with a person, but might help describe someone when linked with other data?  That seems to fall within this definition.  What falls outside? Given the state of today’s machine learning and analytics, almost nothing.

If California chooses to interpret and enforce this definition broadly, hardly a behavioral action or descriptive fact about a person will escape its purview.  Businesses that market to consumers are not ready to meet this standard for preserving, protecting, and restricting the use of data. We have jumped from one extreme to the other on defining personal information.