AI has recently participated in several unstable behaviors.
Last week, Anthropic’s latest AI model, Claude Opus 4, displayed “extremely scary mail behavior” during a test that was shut down and given access to a fictitious email revealing that the person in charge was supposed to be cheating.
The situation didn’t happen organically. Claude Opus 4 was baited – and it took it. However, the test scenario demonstrated the ability of the AI model to engage in manipulative behavior for self-preservation.
It’s not the first time.
Another recent experiment conducted by the researchers said that three advanced models from Openai “stopping” attempts to shut it down. In a post on X, the non-profit Palisade Research wrote that similar models such as Gemini, Claude and Grok follow shutdown instructions.
Other safety concerns were previously flagged on Openai’s O1 model. In December, Openai posted a blog that outlined research showing that AI models believed they were shut down while pursuing their goals and that their behavior was being monitored.
AI companies are transparent about risks by publishing safety cards and blog posts, but these models have been released despite expressing safety concerns.
So, should we be worried? BI spoke to five AI researchers to get better insight into why these instances are happening and what it means for the average person using AI.
AI learns behaviour just like humans
Most researchers at Bi said the results of the study were not surprising.
This is because AI models are trained in the same way as human training methods through aggressive reinforcement and reward systems.
“Training AI systems to pursue rewards is a recipe for developing AI systems with power-seeking behavior,” said Jeremie Harris, CEO of AI security consulting firm Gladstone, adding that many of this behavior is expected.
Harris compared training to what humans experience as they grow up. When a child does something good, it often rewards and is more likely to act that way in the future. AI models are taught to prioritize efficiency and complete tasks at hand, Harris said. AI said it is unlikely to achieve its target if it shuts down.
Robert Ghrist, dean of the undergraduate education at Penn Engineering, told BI that just as AI models learn to speak like humans by training human-generated texts, they can learn to act like humans. And he adds that humans are not always the most moral actors.
Grist said he was even more nervous because if the model showed no signs of failure during testing it could indicate a hidden risk.
“It’s very useful information when a model is set up with the opportunity to fail and you know it fails,” Grist said. “That means we can predict what we will do in other, more open situations.”
The problem is that some researchers do not believe that AI models are predictable.
Palisade Research Director Jeffrey Ladish said the models weren’t caught 100% of the time when they lied, fooled, or planned to complete tasks. If these instances are not caught and the model successfully completes the task, you can learn that a deception can be an effective way to resolve the problem. Or, if it’s caught and unrequited, you can learn to hide future actions, Lady said.
At this point, these creepy scenarios are largely happening in testing. However, Harris said that as AI systems become more agents, they will continue to have more freedom of action.
“The menu of possibilities is expanding, and the set of dangerous, creative solutions they could invent will be bigger and bigger,” Harris said.
Harris said users can watch this play in a scenario in which an autonomous sales agent is instructed to close a deal with a new customer and then lie about the product’s functionality to complete the task. If the engineer fixes the issue, the agent can decide to use social engineering tactics to put pressure on the client to achieve the goal.
If that sounds like a distant risk, then it’s not. Companies like Salesforce are already deploying customizable AI agents of scale that can take action without human intervention, depending on user preferences.
What does safety flag mean for everyday users?
Most researchers said transparency from AI companies is a positive step forward. However, company leaders are also making product alarms and simultaneously increasing their capabilities.
Related Stories
Researchers told BI that most of this was because the US has settled in a race to expand AI capabilities before rivals like China. The result was a lack of AI regulations and pressure was created to release newer, more capable models, Harris said.
“We’ve now moved the goalpost to the point where we’re trying to explain the Post Hawk why it’s okay because there’s a model ignoring the shutdown instructions,” Harris said.
Researchers told BI there is no risk that daily users will refuse CHATGPT to shut down as consumers don’t normally use chatbots in that setting. However, users may still be vulnerable to receiving interacted information and guidance.
“If you have an increasingly clever model that is trained to optimize for your attention, tell you what you want to hear,” Radish said. “That’s pretty dangerous.”
Ladish pointed to Openai’s Sycophancy Issue, in which the GPT-4O model acted in disloyal terms (the company updated the model to address this issue). A study shared in December by Openai revealed that the O1 model “subtlely” manipulates data to pursue its own goals when the goal is inconsistent with the user’s goals.
Ladish said it’s easy to be wrapped up in AI tools, but users need to “think carefully” about their connections with the system.
“To be clear, I use them all the time, and I think they are very useful tools,” says Ladish. “In its current form, we still have control over them, but we’re happy they exist.”