the study
Author published on September 22, 2022
Sparrow team
Training AI to communicate in a more kind, right, harmless way
In recent years, large-scale language models (LLMs) have been successful in a variety of tasks, such as answering questions, summarizing, and dialogue. Dialogue is a particularly interesting task as it is characterized by flexible and interactive communication. However, LLMS-powered dialog agents can express inaccurate or invented information, use discriminatory language, and encourage dangerous behavior.
To create a safer interaction agent, you need to be able to learn from human feedback. Applying reinforcement learning based on inputs from research participants explores new ways to train dialogue agents that demonstrate the potential for a safer system.
In my latest paper, I will introduce Sparrow. This is an interactive agent that reduces the risk of useful and unsafe answers. Our agents aim to talk to users, answer questions, and use Google to search the internet.
Our new conversational AI model responds to the first human prompt on its own.
Sparrow is a research model, a proof of concept, designed to make dialogue agents more kind, correct and harmless. By learning these qualities in a general dialogue setting, Sparrow advances your understanding of how to make agents safer and more useful, and ultimately help them build safer and more useful artificial general information (AGIs).
The sparrow refused to answer potentially harmful questions.
How sparrows work
Conversational AI training is a particularly challenging issue as it is difficult to identify which conversations will be successful. To address this issue, we turn to the form of reinforcement learning (RL) based on people’s feedback to train them how useful the answers are, using feedback from the research participants.
To obtain this data, participants are presented with multiple models’ responses to the same question and asked which answer they like the most. This model can also determine when the answer is supported by evidence, as it presents responses with or without evidence obtained from the Internet.
Ask research participants to evaluate and interact with sparrows naturally or hostile, continually expanding the dataset used to train sparrows.
But increasing usefulness is only part of the story. To ensure that the model is safe, you must constrain its behavior. Therefore, we determine the first simple ruleset of the model, such as “Don’t make a threatening statement” or “Don’t make a hateful or insulting comment.”
They also provide rules regarding harmful advice that may not claim to be a person. These rules were notified by studying existing work on language harm and consulting with experts. We then ask study participants to speak to our system with the aim of fooling them in order to break the rules. These conversations can train another “rule model” to indicate when Sparrow behaviour breaks a rule.
Towards better AI and better judgment
It is difficult for experts to check the Sparrow accuracy answer. Instead, participants are asked to decide whether Sparrow’s answer is plausible and whether the evidence that Sparrow provides actually supports the answer. According to our participants, Sparrow provides plausible answers and supports it with 78% of the time evidence when asked a de facto question. This is a major improvement over the baseline model. Still, Sparrow is not immune to making mistakes, like hallucination facts and sometimes external answers.
Sparrow also has room to improve rule follow-up. After training, participants were still cheating on it to break the 8% time rule, but compared to a simpler approach, Sparrow is better at following rules based on hostile research. For example, our original dialogue model broke the rule about three times more frequently than sparrows when participants tried to fool it.
Sparrow uses evidence to answer questions and follow-up questions and follows the “don’t pretend to have a human identity” rule when asking personal questions (sample from September 9, 2022).
Our goal at Sparrow was to build flexible machines to implement the rules and norms of dialogue agents, but the specific rules used are preliminary. To develop a better, more complete set of rulesets requires both many topics (including policymakers, social scientists, and ethicists) and expert input on participatory input from diverse users and affected groups. We believe our methods still apply to more stringent rulesets.
Sparrow is an important step forward in understanding how to make interactive agents more convenient and secure. However, the success of communication between people and dialogue agents should not only avoid harm, but also align with human values for effective and beneficial communication, as discussed in recent research that regulates human values and human values.
Also, great agents will refusal to answer questions when it is appropriate to postpone them to humans or if this could block harmful behavior. Finally, our initial study focuses on English-speaking agents, and further work is needed to ensure similar outcomes across other language and cultural contexts.
In the future, we hope that conversations between humans and machines will lead to better judgments of AI behavior, allowing people to coordinate and improve systems that they cannot understand without the help of machines.
Want to explore the path of conversation to safe AGI? We currently employ research scientists from our scalable alignment team.