the study
Author published on August 18, 2022
Zachary Kenton, Ramana Kumar, Sebastian Farker, Jonathan Richens, Matt McDermott, Tom Everitt
A new formal definition of agency provides clear principles of causal modeling of AI agents and the incentives they face
We want to build a secure, aligned, artificial general information (AGI) system that pursues the intended goals of our designers. Causality diagrams (CIDs) are methods of modeling decision-making situations where agents can infer incentives. For example, there is the CID (a typical framework for decision-making problems) of the one-stage Markov decision process.
S1 represents the initial state, A1 represents the agent decision (square), and S2 is the following state: R2 is the agent’s reward/utility (diamond). Solid links specify the effect of causality. Dashed edges specify information links – what the agent knows when making decisions.
By associating training setup with incentives shaping the agent’s behavior, CID can illuminate potential risks before training agents and stimulate better agent design. But how can you know that CID is an accurate model of training setup?
A new paper that discovered agents presents new ways to tackle these issues, including:
The first formal causal definition of an agent: an agent is a system that conforms policies when its actions affect the world, to discover agents from empirical data translations between causal models and to discover agents to resolve previous confusion from false causal modeling of agents.
When combined, these results provide an additional layer of assurance that no modeling errors have been made. This means you can use CIDs to more confidently analyze agent incentives and safety.
Example: Model the mouse as an agent
To illustrate our method, consider the following example, consisting of a world containing three squares: Choose to go left or right from the center square, reach the next position and potentially get cheese. The floor is frozen so the mouse may slip. The cheese is on the right side, but on the left side.
Mouse and cheese environment.
This can be expressed by the following CID:
CID for the mouse. d represents left/right decision. X is the new position of the mouse after performing the action left/right (it may slip and accidentally end on the other side). U indicates whether the mouse gets cheese.
The intuition that mice choose different behaviors for different preferences (glaciers, cheese distribution) can be captured by mechanized causal graphs. Each (object-level) variable also contains a mechanism variable that governs the way the variable depends on its parent. Importantly, it allows for linking between mechanism variables.
This graph contains additional black mechanism nodes representing mouse policy and distribution of ichenes and cheese.
Mechanized causal graphs of mouse and cheese environments.
The edges between mechanisms represent direct causal influences. The blue edge is a special terminal edge. Roughly, the mechanism is the edges a~→b~ that are still there, even if the object-level variable A has been changed to no outgoing edges.
In the example above, U has no children, so the edge of that mechanism must be a terminal. However, the edge x~→d~ of the mechanism is not the end. Because if you cut X from a child u, the mouse will no longer adapt the decision (as its position does not affect whether it gets cheese or not).
Causal discoveries of agents
Causal discovery leads to causal graphs from experiments involving interventions. In particular, even if all other variables are fixed, we can still find the arrows from variable A to variable B by intervening experimentally in A and checking whether B responds.
The first algorithm uses this technique to discover a mechanized causal graph.
Algorithm 1 obtains as input intervention data from the system (mouse and cheese environment) and outputs a mechanized causal graph using causal discovery. See the paper for more information.
The second algorithm transforms this mechanized causal graph into a game graph.
Algorithm 2 enters a mechanized causal graph and maps it to a game graph. The introduction terminal edge indicates a decision, and the outgoing indicates a utility.
In summary, Algorithm 1 followed by Algorithm 2 allows us to discover agents from causal experiments and express them using CIDs.
The third algorithm can convert game graphs into mechanized causal graphs and translate between games and mechanized causal graph representations under some additional assumptions.
Algorithm 3 enters game graphs and maps them to mechanized causal graphs. The decision indicates the introduction terminal edge, and the utility indicates the originating terminal edge.
Better safety tools for modeling AI agents
We proposed the first formal causal definition of agents. Based on causal findings, our key insight is that agents are systems that adapt their actions in response to changes in how their actions affect the world. In fact, our algorithms 1 and 2 describe an accurate experimental process that can help us assess whether the system contains agents.
Interest in causal modeling of AI systems is growing rapidly, and our research has based this modeling on it in causal discovery experiments. Our paper demonstrates the possibility of our approach by improving safety analysis of several examples AI systems, and shows that causality is a useful framework for discovering whether an agent is present in the system.
Want to know more? Check out our papers. Feedback and comments are welcome.