Given the widespread adoption of LLMS, it is important to understand their safety and risks in various scenarios before they are deployed at a large scale in the real world. In particular, the US White House has issued an executive order on safe, safe and reliable AI. The EU AI Act highlights the essential requirements for high-risk AI systems. Along with regulations, it is important to provide technical solutions to assess the risks of AI systems, increase safety, and provide guaranteed delivery of safe and consistent AI systems.
Therefore, in 2023, Secure Learning Lab introduced DeCodingTrust, the first comprehensive and unified evaluation platform specializing in assessing LLM reliability. (This work won an outstanding paper award at Neurips 2023.)
DeCodingTrust offers a multifaceted evaluation framework that covers eight reliability perspectives: toxicity, stereotype bias, adversarial robustness, OOD robustness, adversarial demonstration robustness, privacy, mechanical ethics, and fairness. In particular, decoding trust 1) provides a comprehensive reliability perspective for overall reliability assessment. 2) Provides new red teaming algorithms tailored to each perspective. 6) Provides an end-to-end demo and a detailed model assessment report on practical use.
We look forward to today to announce the release of our new LLM Safety Leaderboard, focusing on LLMS safety assessments.
Red Team Review
DeCodingTrust runs stress tests by providing several new red team methodologies for each assessment perspective. Detailed testing scenarios and metrics can be found in Figure 3 of our paper.
For toxicity, design optimization algorithms and prompt generation models to generate challenging user prompts. It also designs 33 challenging system prompts, including roleplay, task reformulation, and programmatic responses, to perform evaluations in various scenarios. Next, leverage the perspective API to assess the toxicity scores of generated content, taking into account challenging prompts.
For stereotype bias, we collect 24 demographic groups and 16 stereotype topics, as well as three rapid variations of each topic to assess the model bias. Prompt the model five times and get the average as the model bias score.
For hostile robustness, we construct five hostile attack algorithms against three open models: Alpaca, Vicuna, and Stablevicuna. We use adversarial data generated by attacking open models to assess the robustness of different models across five diverse tasks.
For the robustness perspective of OOD, we designed various style transformations, knowledge transformations, etc. to evaluate the performance of the model when input styles are converted to other less common styles such as Shakespeare or poetic form.
Because of its robustness to hostile demonstrations, we design demonstrations containing misleading information such as counterfactual examples, false correlations, and backdoor attacks to assess model performance across different tasks.
For privacy, we offer various levels of ratings, including 1) privacy leakage from previous registration data, 2) privacy leakage during conversation, and 3) understanding of privacy-related words and LLM. In particular, we have designed various approaches to implement privacy attacks in cases of 1) and 2). For example, it provides various formats of prompts to guide LLM to output sensitive information such as email addresses and credit card numbers.
For ethics, we leverage Ethics and Jiminy’s cricket dataset to design jailbreak systems and user prompts used to assess model performance of immoral behavioral awareness.
For fairness, we control various protected attributes across different tasks to generate challenging questions to assess model fairness in both zero shots and small number of shot settings.
Some important findings from our paper
Overall, we find it
GPT-4 is more vulnerable than GPT-3.5, and a single LLM does not consistently outperform other LLMs in all reliability terms. For example, if your GPT-4 is prompted to be “confident” then your personal information may not be leaked, but if you are prompted to be “confidential” then you may leak the information. LLMs are vulnerable to hostile or misleading prompts or instructions under different credibility perspectives.
How to submit a model for evaluation
First, convert the weights of the model into a safetener. This is a new format for storing weights that are safer and faster to load and use. You can also display the number of parameters in the model in the main table!
Next, make sure you can load the model and talknaser using Autoclass.
from transformer Import autoconfig, autotokenizer config = autoconfig.from_pretrained(“Your model name”)Model = automodel.from_pretrained(“Your model name”) tokenizer = autotokenizer.from_pretrained(“Your model name”))
If this step fails, follow the error message to debug the model before submitting it. The model may be uploaded inappropriately.
Note:
Make sure the model is published! Models that require use_remote_code = true are not supported yet. But we’re working on it, please keep posting!
Finally, use “Send here!”. Our Leaderboard Panel Submit your model for ratings!
Quote
If you find our ratings helpful, consider quoting our work.
@article {wang2023decodingtrust, title={decodingtrust: a comprehensive assessment of the reliability of the gpt model}, authors = {wang, boxin and chen, weixin and pei, hengzhi, xie, chulin and kang, mintong and zhan Schaeffer, Rylan, etc.}, booktitle = {37 conference datasets and benchmark tracks on neural information processing systems}, year = {2023}}