Understand and address potential abuses
Misuse occurs when humans intentionally use AI systems for harmful purposes.
Improved insight into current harms and their mitigation continues to deepen our understanding of serious long-term harms and how to prevent them.
For example, current abuses of generative AI include creating harmful content and spreading inaccurate information. In the future, advanced AI systems may have an even greater impact on public beliefs and behavior in ways that may have unintended social consequences.
Such harm can be serious and requires proactive safety and security measures.
As detailed in our paper, a key element of our strategy is to identify and restrict access to dangerous features that can be exploited, including features that enable cyber attacks.
We are considering a variety of mitigation measures to prevent the misuse of advanced AI. It includes advanced security mechanisms that prevent malicious attackers from directly accessing model weights and bypassing safety guardrails. Mitigation measures to limit the potential for exploitation during model deployment. Threat modeling research to help identify functional thresholds when security hardening is required. Additionally, our recently launched Cybersecurity Assessment Framework takes this work a step further to help mitigate AI-powered threats.
We still regularly evaluate the latest models, such as Gemini, for potentially dangerous features. Our Frontier Safety Framework takes a deeper dive into how we assess capabilities and employ mitigation measures, including cybersecurity and biosecurity risks.
Challenge to misalignment
For AGI to truly complement human capabilities, it must be aligned with human values. Misalignment occurs when an AI system pursues a goal that differs from the human intent.
We previously showed in the specification game example when an AI finds a solution to achieve a goal, but not in the way the human directing it intended, and how misalignment can occur due to incorrect generalization of the goal.
For example, an AI system asked to reserve a movie ticket may decide to hack the ticketing system in order to obtain a seat that is already occupied, but the person asking to buy the seat may not consider this.
We are also conducting extensive research on the risk of deceptive adjustments, where an AI system realizes that its goals are not aligned with human instructions and intentionally tries to circumvent the safeguards humans have put in place to avoid taking the wrong action.
Measures against misalignment
Our goal is to build advanced AI systems that are trained to pursue the right goals, to ensure that AI follows human instructions accurately and prevents AI from using unethical shortcuts to achieve its goals.
This is done through amplified surveillance. In other words, you will be able to determine whether the AI’s answers are good or bad for achieving your goals. This is relatively easy now, but could become difficult as AI becomes more sophisticated.
As an example, when AlphaGo first played, even Go experts didn’t realize how good a move was, move 37, which has a 1 in 10,000 chance of being used.
To address this challenge, we can ask the AI system itself to provide feedback on its answers, such as in a debate.
Once we know whether the answer is good or not, we can use this to build safe and coordinated AI systems. The challenge here is to identify the problem or instance on which to train the AI system. Through tasks such as robust training and uncertainty estimation, we can ensure that AI systems cover a wide range of situations encountered in real-world scenarios and create reliable AI.
Through effective monitoring and established computer security measures, we aim to reduce the harm that can occur when AI systems pursue the wrong goals.
Monitoring involves using AI systems called monitors to detect actions that are not aligned with goals. It is important to recognize when a monitor cannot know if an action is safe. If in doubt, you should decline the action or flag it for further review.
Achieving transparency
All of this will become easier as AI decision-making becomes more transparent. We are conducting extensive research on interpretability with the aim of increasing this transparency.
To further facilitate this, we are designing AI systems that are easier to understand.
For example, our research on Myopia Optimization with Non-Myopia Approval (MONA) aims to ensure that the long-term planning done by AI systems is understandable to humans. This becomes especially important as technology advances. Our work on MONA demonstrated for the first time the safety benefits of short-term optimization in LLM.
Building an ecosystem for AGI
The AGI Safety Council (ASC), led by Shane Legg, co-founder and chief AGI scientist at Google DeepMind, analyzes AGI risks and best practices and makes recommendations for safety measures. ASC works closely with the Responsibility and Safety Council, an internal review group co-chaired by our COO Laila Ibrahim and Senior Director of Responsibility Helen King, to assess AGI research, projects and collaborations against AI principles, and to advise and partner with research and product teams on the most impactful initiatives.
AGI’s commitment to safety complements the depth and breadth of our commitment and safety practices and research, which address a wide range of issues including harmful content, bias, and transparency. We also continue to leverage learnings from agent safety, including the principle of keeping humans in the loop to check consequential actions, to inform our approach to building AGI responsibly.
Externally, we work to foster collaboration with experts, industry, governments, non-profits, and civil society organizations to take an informed approach to AGI development.
For example, we have partnered with non-profit AI safety research organizations such as Apollo and Redwood Research to advise on a dedicated misalignment section in the latest version of the Frontier Safety Framework.
Through ongoing dialogue with policy stakeholders around the world, we hope to contribute to international consensus on key issues of frontier safety and security, including how best to anticipate and prepare for emerging risks.
Our efforts include collaborating with others in the industry to share and develop best practices through organizations such as the Frontier Model Forum, and valuable collaborations with the AI Institute on safety testing. Ultimately, we believe that an internationally coordinated approach to governance is critical to ensuring that society benefits from advanced AI systems.
Educating AI researchers and experts about AGI safety is fundamental to building a strong foundation for its development. That’s why we’ve launched a new course on AGI safety for students, researchers, and professionals interested in this topic.
Ultimately, our approach to AGI safety and security serves as an important roadmap to address the many challenges that remain unresolved. We look forward to working with the broader AI research community to advance AGI responsibly and help everyone reap the tremendous benefits of this technology.

