In high-stakes situations such as healthcare and weekday risks, it is safer to say “I don’t know” than to incorrectly answer. While doctors, game show contestants and standardized test takers understand this, most artificial intelligence applications prefer to give potentially wrong answers rather than admitting uncertainty.
Computer scientists at Johns Hopkins believe they have a solution. It is a new way that allows AI models to spend more time through problems, and uses confidence scores to determine that AI should say “I don’t know” rather than risking the wrong answer.
The research team will present their results at the 63rd Annual General Meeting of the 63rd Society of Computational Linguistics, which will be held in Vienna, Austria from July 27th to August 1st.
“If you want to be confident, thinking long means giving the system a more correct answer and a more incorrect answer.”
William Juray
PhD student, Whiting School of Engineering
“It all started when I saw that cutting-edge, large-scale language models spend more time solving more difficult problems. So will this additional thinking time help these models determine whether the problems have been resolved correctly and report them to users?” First author William Juray says he is a doctoral student studying computer science, affiliated with the Whiting School of Engineering’s Language and Speech Processing Center.
To investigate, the team generated length-length inference chains to answer difficult mathematical problems and measure how chain length affects both the final answer of the model and its confidence. Researchers only received model responses if confidence exceeded the given threshold.
They found that thinking more generally improves model accuracy and confidence. However, even if there is sufficient time to consider, the model can still make wild guesses and give the wrong answer. In fact, researchers have found that if a model sets a high standard for confidence and makes it think longer, the accuracy of the model actually decreases.
“This happens because the accuracy of the answer is just a part of the performance of the system,” explains Jurayj. “Increasing confidence means that the system thinks longer means that it provides a more correct answer and a more incorrect answer. In some settings, the extra correct answer is worth the risk. But in other, high-stakes environments, it may not be.”
Motivated by this discovery, the team proposed three different “odds” settings to punish the wrong answer. danger! Odds, if the correct answer is rewarded for the wrong one at the same rate, the odds are punished. And the odds of high stakes where the wrong answer is punished much more than the correct answer is rewarded.
They found that, under stricter odds, the model should refuse to answer the question if, after the model consumes a computational budget, it is not sufficiently confident in its answer. And at a higher confidence threshold, this means that more questions cannot be answered, but that’s not necessarily a bad thing.
“Students may find it a little frustrating to wait 10 minutes just to know that they need to solve a math problem because they are not sure about the AI model,” Jurayj says. “However, in a high-stakes environment, this is much more preferred than waiting five minutes for an answer that appears to be correct but not.”
Now, the team is encouraging the larger AI research community to report performance that avoids model questions under exams and risk! The odds allow everyone to benefit from AI with better coordinated and confident.
“We hope that the research community will accept invitations to report performance in a setting at a non-zero cost for false answers, as this will naturally motivate the development of better ways to quantify uncertainty.”
Additional authors for this work include graduate students Jeffrey Chen and Benjamin van Durum.