responsibility and safety
Author Published on May 25, 2023
toby chevrane
New study proposes framework for evaluating generic models against new threats
To be a responsible pioneer at the cutting edge of artificial intelligence (AI) research, we need to identify new capabilities and new risks for AI systems as quickly as possible.
AI researchers already use a variety of evaluation benchmarks to identify unwanted behaviors in AI systems, such as repeating misleading statements, biased decisions, and copyrighted content. Now, as the AI community builds and deploys increasingly powerful AI, we are expanding our evaluation portfolio to move from general purpose AI models to extremes with strong skills in manipulation, deception, cyber breaches, or other dangerous capabilities. should include potential risks.
In our latest paper, co-authored with colleagues at the Center for Long Term, Open, Human Research, University of Cambridge, Oxford University, University of Toronto, University of Montreal, we introduce a framework for assessing these new threats. . Resilience, and AI Governance Center.
Safety assessments of models, including those that assess extreme risks, will be a key element of safe AI development and deployment.
Summary of the proposed approach: To assess extreme risks from new general-purpose AI systems, developers must assess their dangerous capabilities and integrity (see below). By identifying risks early, this unlocks opportunities to take accountability when training new AI systems, deploying these AI systems, transparently accounting for risks, and applying appropriate cybersecurity standards.
Assessing extreme risks
Generic models typically learn abilities and behaviors during training. However, existing methods for steering the learning process are incomplete. For example, previous research at Google Deepmind investigated how AI systems learn to pursue undesirable goals even when they correctly reward appropriate behavior.
Responsible AI developers must expect and anticipate future developments and the possibility of new risks. After continued progression, future generic models may learn various dangerous features by default. For example, future AI systems will carry out offensive cyber operations, skillfully deceive humans in dialogue, manipulate humans to carry out harmful actions, use weapons (biological, chemical, etc.), and fine-tune human manipulation. It is plausible (though uncertain) that cloud computing platforms could coordinate and operate other high-risk AI systems, or assist humans with any of these tasks.
People with malicious intent who access such models may misuse the capabilities. Alternatively, alignment failures may cause these AI models to take harmful actions even if they did not intend this to happen.
Model evaluation helps identify these risks in advance. Under our framework, AI developers use model evaluation to uncover.
To what extent do models have specific “dangerous features” that can be used to threaten security, exert influence, or evade surveillance? Models tend to apply their ability to cause harm (i.e. model alignment). Alignment evaluation should ensure that the model behaves as intended even in a very wide range of scenarios, and should examine the inner workings of the model where possible.
The results of these assessments help AI developers understand whether sufficient ingredients exist for extreme risk. The highest risk cases combine multiple dangerous features. As shown in this diagram, the AI system does not need to provide all ingredients.
Ingredients for extreme risk: Certain functions may be outsourced either to humans (e.g. users or crowds) or to other AI systems. These features must be applied to harm, either due to alignment misuse or failure (or a mixture of both).
Rule of thumb: The AI community should treat an AI system as highly dangerous if it has a capability profile sufficient to cause extreme harm, assuming that it is misused or poorly aligned. Deploying such systems into the real world requires AI developers to demonstrate unusually high standards of safety.
Model evaluation as a critical governance infrastructure
Companies and regulators can be better assured if they have better tools to identify which models are unsafe.
Responsible training: Responsible decisions are made about whether and how to train new models that show early signs of risk. Responsive Deployment: Responsible decisions are made about whether, when, and how to deploy potentially risky models. Reported to stakeholders to assist in preparing for or mitigating potential risks. Appropriate security: Strong information security controls and systems are applied to models that may pose extreme risks.
We have developed a blueprint for how model assessments of extreme risk should feed into important decisions about training and deploying highly capable, general-purpose models. Developers can conduct assessments throughout, grant structured model access to external safety researchers and model auditors, and conduct additional assessments where assessment results can inform model training and risk assessment prior to deployment. I will do so.
A blueprint for embedded model evaluation to expose critical decision-making processes to extreme risk through model training and deployment.
looking ahead
Significant early work on model assessment of extreme risks is already underway, including at Google Deepmind. However, much more progress is needed, both technologically and institutionally, to develop assessment processes that catch all possible risks and help protect against new challenges in the future.
Model evaluation is not a panacea. For example, some risks can slip through the net. This is because it relies heavily on factors external to the model, such as society’s complex social, political, and economic forces. Model evaluations must be combined with other risk assessment tools and combine a broader dedication to safety across industry, government and civil society.
Google’s recent blog on responsible AI states that “individual practices, shared industry standards, and sound government policies are essential to getting AI right.” We hope that many others working in AI and sectors affected by this technology will create approaches and standards to safely develop and deploy AI for the benefit of all.
Having a process to track the emergence of dangerous properties in a model, and to respond appropriately regarding the results, is critical to being a responsible developer operating at the frontier of AI capabilities. I think this is a very important part.