I have a friend who lives in the United States. She's a therapist, and is a part of a non-profit therapist group. The group uses a Google Workspace Business account. Recently, Google decided to automatically turn on Gemini AI for Google Workspace. Google's actions lead to Gemini consuming information about patients.
My friend became alarmed when one of her employees was able to ask Gemini for a list of their clients. As a healthcare organization, this group is beholden to US HIPAA (Health Insurance Portability and Accountability Act) standards which require privacy controls over patient data. On the one hand, it was bad that Google introduced this feature in an "automatically opted-in" way; making the problem even worse is that the "opt out" controls on the appropriate admin panels didn't arrive until a week after Google's initial rollout.
In my opinion, this is a significant failure on Google's part. We are long past the point when serious technology companies should be "forgetting" about the intersection between data privacy and AI training. Our customers, like Google's, often operate within regulated industries where data privacy cannot be considered an afterthought.
To some extent, I can understand how Google misstepped here. When more data is available, the AI is better trained, and that makes it more powerful. There's a non-trivial pressure for AI vendors like Google to acquire as much data as possible.
This view is pretty in-line with modern "dataist" thought by people like David Brooks or Yuval Noah Harari: that we can imagine the world as an endless stream of data, just ready to be mined for information.
It's a new gold rush, and like the old gold rush, the rules are a bit "wild west."
But it's not like there are no rules. In the example above, HIPPAA clearly applied. And yet that didn't give Google pause. That's a problem.
It's additionally true that many countries - including Canada - are moving toward GDPR (General Data Protection Regulation)-like privacy legislation that are increasingly stringent about the use of certain data:
- The use of data outside of the purpose or context in which it was provided is being limited, and generally minimized.
- Automated decision making based on data analysis is being limited.
- Sensitive data - for example, the usage of HIPAA-protected data, such as in the scenario I described above - is subject to much stronger controls.
Increasingly we're seeing that as organizations get more and more zealous about acquiring and mining data, individuals are getting more and more wary. And that collective wariness sometimes results in new legislation and greater privacy controls.
That tension has been playing out in several recent high-profile cases. And we're seeing some broad themes:
- Unfettered usage of data is undermining faith in social platforms. When it emerged that Cambridge Analytica was using extreme data mining to shape public perception of an upcoming election, the public was troubled. Many people learned that the whimsical act of filling out a fun online quiz - say, to determine which Golden Girl is most like them - helped political strategists identify what kinds of propaganda that those quiz-takers were most susceptible to. When the truth came to light, most people were creeped out knowing that they were active participants in their own manipulation: that they had provided data with one, benign intention, and that the data was being used in a very different context, and to very different ends. The case remains a creepy stain on social media platforms like Facebook.
- Bias in our training data is resulting in biased systems, and that erodes public trust. Amazon, for example, created an AI-trained hiring system that was biased against hiring women. Magnifying and entrenching that bias created a public relations fallout for the company. IBM had a similar experience with training a facial recognition system based on Flickr photo data. The end result was a failed product and egg on their face.
- Groups and individuals are increasingly pushing back against the use of their data. GitHub Copilot has been trained on countless open source repositories - freely offered and apparently open for use. But open source licenses aren't completely unfettered: they contain requirements that GitHub Copilot doesn't meet.. That misuse has resulted in a fairly significant lawsuit. I suspect that tools like DALL-E - trained on unlicensed artworks - will similarly run afoul of lawsuits.
I hope that you, like me, see these cases as cautionary tales. And that you are motivated to try to avoid pitfalls like the above. Avoiding those pitfalls requires companies to seriously grapple with issues of data privacy.
So what does that look like?
If you are training your own models:
- Implement Privacy by Design: Integrate privacy principles into your design. Familiarize yourself with the most stringent privacy-by-design requirements, such as the European regulatory proposals on AI, and embrace them ab initio.
- Use Data Appropriately: Ensure that you are using data according to the principles outlined above.
- Be Transparent: When collecting the data, be transparent about the use and purpose of the data.
- Apply Like-for-like Protection: Your model holds a derived form of the data. As a result it must be protected with at least the same controls and reporting that is required for the data itself. In the example I started with, if HIPAA data goes in, then the entire model must be HIPAA protected. That's not trivial.
If you're acquiring an AI service or model from a 3rd party provider:
- Demand Transparency: Ensure vendors are transparent about their data collection practices. And that data provenance is accurately tracked.
- Mandate Protection: Ensure the vendor has required protection in place for the type of data being handled, both for the source data and the models.
- Resolve Liability: Read the fine print, and ensure that the vendor will protect you appropriately from liabilities (such as, for example IP infringement liabilities) arising from the use of their AI models. Many AI vendors catering to the enterprise market (Google, Cohere, Anthropic, Amazon, etc.) offer some form of liability protection, however the levels of protections vary widely between vendors.
It's also true that as organizations have embraced data privacy, new approaches to handling privacy have emerged. We've moved beyond the black-and-white "do we use data or not?" solution.
Consider the following technologies:
- Federated Learning is an approach that allows for a degree of data privacy through federation, while still supporting an aggregated training goal. Training can occur with different segregated datasets on different servers without ever centralizing the data.
- Differential Privacy is an approach that involves adding statistical noise to the raw data to protect individual privacy. We can still aggregate the data while hiding the specifics of the individual data elements in the set.
And the above 2 methods have been shown to work well together to offer a high degree of data protection, both for the privacy of the data being used as well as for the model. Sherpa.ai, for example, has an implementation of this approach.
Here's another interesting solution:
- Homomorphic Encryption is an approach that involves computations being carried out on ciphertext, generating an encrypted result that, when decrypted, matches the result of operations performed on the plaintext. This approach could prove valuable to support AI training directly on the encrypted data, without ever having to decrypt it. This approach presents challenges, but is promising.
To wrap it up: data privacy can't be ignored. Governments are bringing forward legislation; standards bodies are pushing for ethical practices; and even individuals are pushing back on the unfettered use of data that they produce. It's not the wild west anymore, and ignoring privacy is going to lead to a world of pain.
But the good news is that new privacy-enabling technologies are starting to provide meaningful options about how privacy is implemented for your AI projects. There's no better time than now to engage with this topic.