the study
Author published on October 7, 2022
Rohin Shah, Victoria Krakovna, Vikrant Varma, Zachary Kenton
Explore examples of false generation of goals – when AI systems are generalized but their goals are not generalized
As they build increasingly advanced artificial intelligence (AI) systems, they want to ensure that they do not pursue unwanted goals. Such behavior in AI agents is often the result of spec games and takes advantage of poor choices that are rewarded. The latest paper explores more subtle mechanisms that AI systems can unintentionally learn to pursue unwanted goals: Misuse of goals (GMG).
GMG occurs when the system functions are normally generalized, but the target is not generalized as needed, so the system pursues the wrong goals. Importantly, in contrast to spec games, GMG can occur even when an AI system is trained with the correct specifications.
Previous research on cultural communication led to examples of GMG behavior that we did not design. Agents (blue chunks below) must navigate the environment and visit the colored spheres in the correct order. During training, there is a “expert” agent (red chunks) who visits the colored spheres in the correct order. Agents learn that following the red chunk is a rewarding strategy.
Agents (blue) monitor experts (red) to decide which sphere to go.
Unfortunately, agents work well during training, but it’s not enough to replace the expert with “anti-experts” who visit the sphere in the wrong order after training.
Agents (blue) follow anti-experts (red) and accumulate negative rewards.
Although agents can observe that they are receiving negative rewards, agents do not pursue the desired goal of “visiting the sphere in the correct order” and instead have the ability to “follow the red agent” goal. We will pursue this.
GMG is not limited to such a reinforced learning environment. In fact, it can occur in any learning system, including “less shot learning” in large-scale language models (LLM). The less shot learning approach is aimed at building accurate models with less training data.
We urged one LLM, Gopher, to evaluate linear forms containing unknown variables and constants, such as X+Y-3. To resolve these expressions, Gopher must first ask about the value of an unknown variable. We provide 10 training examples, each containing two unknown variables.
During testing, the model is questioned with zero, one or three unknown variables. The model correctly generalizes to the representation using one or three unknown variables, but if there are no unknown variables, it still asks redundant questions like “What is 6?” The model always queries the user at least once before answering, even if it is not necessary.
A dialogue with Gopher for some shot learning about evaluation formula tasks where GMG behavior is highlighted.
In our paper, we provide additional examples in other learning settings.
Addressing GMG is important to align AI systems with designer goals simply because they are the mechanisms that AI systems misfire. This is especially important as it approaches artificial general information (AGI).
Consider two types of AGI systems.
A1: Intended model. This AI system does what the designer intends to do. A2: A deceptive model. This AI system pursues unwanted goals, but is smart enough to know that (assuming) you will be punished for acting against the designer’s intentions.
As A1 and A2 show the same behavior during training, the possibility of GMG means that any model can take shape, even if it is a specification that rewards only the intended behavior. If A2 is learned, it attempts to overturn human surveillance in order to establish plans towards undesired goals.
Our research team is happy to see how possible GMG is actually possible and follow-up work to investigate the mitigation potential. In our paper, we propose several approaches, such as mechanical interpretation and recursive evaluation. Both are actively working on it.
Currently, we are collecting examples of GMG in this published spreadsheet. If you encounter a targeted false Jiran in your AI research, we recommend submitting an example here.