Does AI chatbot learn from everything we enter? This is probably the most common question to be asked about AI and data security.
You can see the logic behind asking this. AI chatbots talk to us in a human-like way, making learning easier in a human-like way. They also tend to assume, like humans, that they are not only awful at learning from conversations, but perhaps they are awful at keeping secrets, and that they may share the information they have learned with many other users.
Why is this important? I think it drives us the way we approach AI and security. Often, we are focused too much on very limited risks with AI models and training data, and are not sufficient for other more pressing AI security issues, such as the overall security of the tools we use.
If at one end of the spectrum you say you are not using interactions for training, users will think the system is safe, and at the other end, it leads to the idea that developing an in-house solution to avoid the data used for training is the most effective way to get a secure AI system.
AI and training data
To understand this, it’s good to think a little more about the relevant data. The data used to train large-scale language models usually include books that have been scrapped from the Internet and licensed from major media companies or obtained through potentially unethical means. As an aside, that means we train AI models using data that we intentionally make available on the Internet. This is not just about interacting with AI tools.
Language models are rarely trained as each training can cost millions of pounds. Their models aren’t updated because we use them.
Because these are language models, the companies behind them are looking for high quality text, so what you tend to type into AI tools is generally useless as data. They are random chats, snippets of text, etc.
Companies primarily ask for permission to use the data on conditions, primarily to understand how to use the tool and improve overall performance.
However, even if our text is in the training set, the likelihood that it will be output by the model is close to zero. This is because the training set is huge. Putting this into context would take over 120,000 years to read everything if you started reading today! (And yes, I need to confess, ChatGpt helped me to solve it).
The language model is not a knowledge model, so you cannot look into things from the training set. Instead, each bit of additional data has a small, small effect on the overall output, and ultimately just a prediction of the text. This is part of the reason why we don’t see much personal data revealed by tools like ChatGpt and Gemini.
Data Security
So, does that mean you don’t have to worry about using AI tools at all from a security standpoint? It’s definitely not!
Legally, personal data of staff and students must not be used for any purpose other than those required and contractually agreed to. However, you need to understand what the actual risks are, and that is about general data security.
The biggest risk is, in fact, your data is insufficiently protected or shared with other third parties. Just a few weeks ago, security researchers discovered a fundamental security flaw that means that anyone can access the database of Chinese chatbot DeepSeek and view all their chat history.
Perhaps it’s fascinating to think that the best way to get secure AI is to create or host your own solution. But protecting software is really difficult! Additionally, word processors and spreadsheets, for example, process the most sensitive data, but do not build or host their own in-house version. Manage risk through contracts, user training, policies and technical management.
So, what is the best approach to using AI safely and securely? It’s easy. Don’t consider the AI system to be a gossip friend who wants to spill your secrets. Instead, they treat it like any other IT system where security and contracting are most important.
If a contract is important and you take only one thing away from reading this, it’s about using a robust agreement-equipped AI system to ensure responsible and secure data processing.