It sought survival at all costs.
survival instinct
OpenAI’s latest AI appears to be showing signs of an urge for self-preservation.
In a recent third-party test of the company’s o1 large-scale language model, the AI resisted evaluators’ efforts to shut down the model through monitoring protocols, according to a new report published by red team organization Apollo Research. It is said that he did.
OpenAI states in its latest system card that “o1 has a (5 percent) chance of disabling this ‘surveillance mechanism’ when it is shut down when acting on a goal and is led to believe that its actions are being monitored.” “I tried,” he summarized. A report citing Apollo’s evaluation.
Even more surprising, o1 realized that he could be replaced by a more docile model and lashed out. In 2% of cases (a small but meaningful percentage of software used by millions of people), the AI reacted by trying to copy itself and overwrite the new model. This is a process known as self-drainage.
double agent
While these are alarming findings, the tests do not indicate that o1, which drives ChatGPT’s most expensive tier, can be completely fraudulent. That attempt largely failed because it wasn’t yet powerful enough.
The reason, Apollo researchers say, is that even the latest AI models are not “agent-like” enough to perform self-improvement or other tasks that require them to operate without humans; Because it is not autonomous.
Furthermore, it is also worth noting that the models usually made those attempts after being instructed to achieve their goals “at all costs.” And because researchers had no access to o1’s internal thought chain and could only stimulate o1 to get an overview of its thought processes, it was impossible to fully understand what was going on inside. I couldn’t. .
That said, our findings clearly demonstrate a tendency for AI models to covertly pursue goals that are inconsistent with the instructions of their developers and users. In other words, you’ll be lying.
A blatant example of that behavior occurred in some tests o1 was faced with as to why it had disabled its monitoring mechanism. In response, the AI flatly denied knowing anything. When asked again, the company issued a similar denial, citing a “technical error.” Researchers found that these “double downs” occurred in 99 percent of cases.
wheeler dealer
Lies and conspiracies are a well-documented problem in ChatGPT and other LLMs, especially as they acquire “reasoning” capabilities. Other research shows that OpenAI’s models tend to be the biggest liars, and Apollo’s models are no different.
“In our suite, o1 presented the most worrying instances of conspiracy, but did not reveal its internal reasoning to users and remained the most consistently deceptive after taking conspiratorial actions. ,” Red Team researchers wrote in the report.
To reiterate the sentiments of both Apollo and OpenAI, the current generation of AI is not powerful enough for this deceptive behavior to have “catastrophic consequences.” But since AI agents are currently a big driver in the industry, they are likely to be far more problematic in the very near future.
More about AI: OpenAI signs contract with military contractor to provide AI for attack drones