Adjusting evaluation for adaptive attacks
Our baseline mitigations showed promise against basic non-adaptive attacks, significantly reducing attack success rates. However, malicious attackers are increasingly using adaptive attacks that are specifically designed to evolve and adapt to the ART to evade defenses under test.
Baseline defenses such as spotlight and self-reflection were successful, but became less effective against adaptive attacks that learned how to cope with and avoid static defensive approaches.
This finding illustrates an important point. Relying on defenses that have only been tested against static attacks provides a false sense of security. To achieve robust security, it is important to evaluate adaptive attacks that evolve in response to potential defenses.
Build inherent resilience through model reinforcement
External defenses and system-level guardrails are important, but so is strengthening the inherent ability of AI models to recognize and ignore malicious instructions embedded in data. This process is called “model reinforcement.”
We fine-tuned Gemini based on a large dataset of realistic scenarios where ART generates effective indirect prompt injections targeting sensitive information. This caused Gemini to ignore the malicious embedded instructions and follow the original user request, thereby providing only the correct and safe response that it was supposed to give. This allows the model to inherently understand how to process compromised information as it evolves over time as part of an adaptive attack.
This model enhancement significantly improved Gemini’s ability to identify and ignore injected instructions, reducing the attack success rate. And importantly, it does not significantly affect the model’s performance on regular tasks.
It is important to note that no model is completely immune to model enhancement. Determined attackers may also discover new vulnerabilities. Therefore, our goal is to make attacks harder, more costly, and more complex for attackers.
Adopting a holistic approach to model security
Protecting AI models from attacks such as indirect prompt injection requires “defense in depth” using multiple layers of protection, including model hardening, input/output checks (like classifiers), and system-level guardrails. Fighting indirect prompted injection is an important way to implement agent security principles and guidelines for responsible agent development.
Protecting advanced AI systems from specific evolving threats, such as indirect prompt injection, is an ongoing process. This requires pursuing continuous and adaptive evaluation, improving existing defenses and exploring new ones, and building resilience inherent in the model itself. By layering defenses and continuously learning, AI assistants like Gemini can continue to be extremely helpful and reliable.
For more information about Gemini’s built-in defenses and recommendations for evaluating the robustness of your model using more difficult and adaptive attacks, see the GDM white paper Lessons from Gemini’s Defenses Against Indirect Prompt Injection.