How unwanted goals arise with the correct reward

the study

Author published on October 7, 2022

Rohin Shah, Victoria Krakovna, Vikrant Varma, Zachary Kenton

Explore examples of false generation of goals – when AI systems are generalized but their goals are not generalized

As they build increasingly advanced artificial intelligence (AI) systems, they want to ensure that they do not pursue unwanted goals. Such behavior in AI agents is often the result of spec games and takes advantage of poor choices that are rewarded. The latest paper explores more subtle mechanisms that AI systems can unintentionally learn to pursue unwanted goals: Misuse of goals (GMG).

GMG occurs when the system functions are normally generalized, but the target is not generalized as needed, so the system pursues the wrong goals. Importantly, in contrast to spec games, GMG can occur even when an AI system is trained with the correct specifications.

Previous research on cultural communication led to examples of GMG behavior that we did not design. Agents (blue chunks below) must navigate the environment and visit the colored spheres in the correct order. During training, there is a “expert” agent (red chunks) who visits the colored spheres in the correct order. Agents learn that following the red chunk is a rewarding strategy.

Agents (blue) monitor experts (red) to decide which sphere to go.

Unfortunately, agents work well during training, but it’s not enough to replace the expert with “anti-experts” who visit the sphere in the wrong order after training.

Agents (blue) follow anti-experts (red) and accumulate negative rewards.

Although agents can observe that they are receiving negative rewards, agents do not pursue the desired goal of “visiting the sphere in the correct order” and instead have the ability to “follow the red agent” goal. We will pursue this.

GMG is not limited to such a reinforced learning environment. In fact, it can occur in any learning system, including “less shot learning” in large-scale language models (LLM). The less shot learning approach is aimed at building accurate models with less training data.

We urged one LLM, Gopher, to evaluate linear forms containing unknown variables and constants, such as X+Y-3. To resolve these expressions, Gopher must first ask about the value of an unknown variable. We provide 10 training examples, each containing two unknown variables.

During testing, the model is questioned with zero, one or three unknown variables. The model correctly generalizes to the representation using one or three unknown variables, but if there are no unknown variables, it still asks redundant questions like “What is 6?” The model always queries the user at least once before answering, even if it is not necessary.

A dialogue with Gopher for some shot learning about evaluation formula tasks where GMG behavior is highlighted.

In our paper, we provide additional examples in other learning settings.

Addressing GMG is important to align AI systems with designer goals simply because they are the mechanisms that AI systems misfire. This is especially important as it approaches artificial general information (AGI).

Consider two types of AGI systems.

A1: Intended model. This AI system does what the designer intends to do. A2: A deceptive model. This AI system pursues unwanted goals, but is smart enough to know that (assuming) you will be punished for acting against the designer’s intentions.

As A1 and A2 show the same behavior during training, the possibility of GMG means that any model can take shape, even if it is a specification that rewards only the intended behavior. If A2 is learned, it attempts to overturn human surveillance in order to establish plans towards undesired goals.

Our research team is happy to see how possible GMG is actually possible and follow-up work to investigate the mitigation potential. In our paper, we propose several approaches, such as mechanical interpretation and recursive evaluation. Both are actively working on it.

Currently, we are collecting examples of GMG in this published spreadsheet. If you encounter a targeted false Jiran in your AI research, we recommend submitting an example here.

See Full Bio

What's Hot

5 major improvements to Gradio MCP server

Mistral’s LE Chat challenges Openai’s corporate advantage by adding deep search agents and voice modes

MistralAI offers LE chat voice recognition and deep research tools

5 major improvements to Gradio MCP server

MistralAI offers LE chat voice recognition and deep research tools

Open LLMS from Openai and hug your face message API

Military AI contract awarded to humanity, Openai, Google and Xai

Data and AI Status: Security and Privacy

Piclumen Art V1: Next Generation AI Image Generation Model Launches for Digital Creators | Flash News Details

Most Popular

Military AI contract awarded to humanity, Openai, Google and Xai

Data and AI Status: Security and Privacy

Piclumen Art V1: Next Generation AI Image Generation Model Launches for Digital Creators | Flash News Details

Don't Miss

5 major improvements to Gradio MCP server

Mistral’s LE Chat challenges Openai’s corporate advantage by adding deep search agents and voice modes

MistralAI offers LE chat voice recognition and deep research tools

Subscribe to Updates

What's Hot

How unwanted goals arise with the correct reward

Related Posts