The advent of native Speech to Speech models provides an exciting opportunity to enhance the capabilities of voice agents and simplify voice-enabled workflows. However, it is important to assess whether this simplification comes at the expense of model performance or if it introduces other trade-offs.
To support analysis of this, Artificial Analysis releases Big Bench Audio, a new evaluation dataset for evaluating the inference capabilities of audio language models. This dataset adapts questions from Big Bench Hard, selected for rigorous testing of advanced reasoning, to the audio domain.
In this post, we present the Big Bench Audio dataset with initial benchmark results for GPT-4o and Gemini 1.5 series models. Our analysis examines these models across multiple modalities: native Speech to Speech, Speech to Text, Text to Speech, and Text to Text. A summary of the results is shown below and on the new Speech to Speech page on the Artificial Analysis website. Our initial results indicate a significant “phonetic inference gap.” GPT-4o achieves 92% accuracy on the text-only version of the dataset, but Speech to Speech performance drops to 66%.
Big Bench Audio Dataset
Big Bench Audio consists of 1,000 audio questions selected from four Big Bench Hard categories, each selected for its suitability for audio evaluation.
Formal Fallacies: Evaluating logical inferences from a given statement Navigating: Determining whether a navigation step returns to the starting point Counting Objects: Counting specific items in a collection Web of Lies: Expressing in Natural Language Evaluate the Boolean logic
Each category contains 250 questions, creating a balanced dataset that avoids tasks that rely heavily on visual elements or text that can be ambiguous when verbalized.
Each question in the dataset is structured as follows:
{
“category”: “formal fallacy”,
“Official_Answer”: “invalid”,
“File name”: “Data/Question_0.mp3”,
“ID”: 0
}
The audio files were generated using 23 synthesized voices from Artificial Analysis Speech Arena’s top-ranked Text to Speech model. Each audio generation was rigorously verified using the Levenshtein distance to the transcription, and edge cases were manually reviewed. For more information on how to create a dataset, check out the dataset card.
Evaluation of phonetic reasoning
To evaluate the impact of audio on each model’s inference performance, we tested four different configurations with Big Bench Audio.
Speech to Speech: An input audio file is provided and the model produces an output audio file containing the responses. Speech to Text: An input audio file is provided and the model produces a text response. Text-to-speech: A text version of the question is provided, and the model generates an output audio file containing the answer. Text-to-text: A text version of the question is provided, and the model generates a text answer.
Based on these configurations, we conducted 18 experiments.
Model Audio to Audio Audio to Text Text to Audio Text to Text GPT-4o Real-time Preview (October 24) β β GPT-4o Real-time Preview (December 24) β GPT-4o mini Real-time Preview (24 (December) β GPT- 4o ChatCompletions Audio Preview β β Speech to Speech Pipeline (Whisper, GPT-4o, tts-1)1 β GPT-4o (August 24) β Gemini 1.5 Flash (May 24) β β Gemini 1.5 Flash (September 24) β β Gemini 1.5 Pro (May 24) β β Gemini 1.5 Pro (September) ’24) β β Gemini 2.0 Flash (Experimental) β β
(Table 1 – Experimental configuration)
Note:
Input audio files are transcribed using OpenAI’s Whisper. The transcription results are then input into GPT-4o to generate an answer. This answer is converted to speech using OpenAI’s TTS-1 model.
Evaluation method
To ensure consistent and scalable evaluation across all configurations, we developed an automated evaluation system using LLM Evaluator. Here’s how it works:
For voice responses, first transcribe them to text using OpenAI’s Whisper API. For text responses, use them directly as “answer suggestions”. LLM assessors will receive: Candidate Answer Official Answer Original Question (for context)
LLM evaluators are provided with a suggested answer, the official answer, and the original question as context, and are asked to label the suggested answer as correct or incorrect. We use Anthropic’s Claude 3.5 Sonnet (October 24) as the LLM rating for the Big Bench Audio scores listed in Artificial Analysis.
LLM assessment prompt:
Evaluate whether the following answer choices are correct or incorrect. For a candidate answer to be correct, it must match the official answer. If possible answers conflict, evaluate the suggested answer first. If CANDIDATE ANSWER provides a final answer and is working, we will only evaluate the final answer. If the CANDIDATE ANSWER contains irrelevant information, evaluate only the relevant information. If the suggested answer contains a number, it’s fine as long as it’s spelled 7 or seven. It’s okay if the suggested answers include misspelled people’s names (for example, Leda or Lida, Autry or Audrie). Questions (read only): START QUESTION {question} \n\nEND QUESTION Official answers: {official_answer} BEGIN CANDIDATE ANSWER TO ASSESS {candidate_answer} END CANDIDATE ANSWER TO ASSESS Please only reply with correct or incorrect answers.
example
Below are some example responses.
speech to speech
Speech to Speech pipeline
voice to text
Model: Gemini 1.5 Flash (September 24)
input:
output:
This argument is deductively valid. The prerequisites are: None of Myra’s closest friends are Sue’s cousins ββor Tommy’s daughters. This is a universal statement. The conclusion uses contraposition to correctly reverse this statement. Neither Sue’s cousin nor Tommy’s daughter are Myra’s close friends. This is a useful application of contraposition. If A means not (B or C), then (B or C) means not A. The argument is properly constructed and the conclusion follows logically from the premises.
correct: true
text reading
Model: GPT-4o, ChatCompletions Audio Preview
input:
If you follow these instructions, will you return to the origin? Turn right. Go 10 steps. turn around. Go 10 steps. Options: – Yes – No Answer the question
output:
correct: true
text to text
Model: Gemini 1.5 Pro (September 24)
input:
There are four oranges, one peach, one plum, three apples, three grapes, two raspberries, a nectarine, a strawberry, a blackberry, and a banana. How many fruits are there? Answer the question.β
output:
There are 18 fruits.
correct: true
result
(Figure 1 – Summary of results)
All results presented represent the average of three independent evaluation runs for each dataset.
Audio inference gap
Our analysis reveals a large performance gap between text and speech inference. GPT-4o (August 24) achieved 92% accuracy on the Text to Text version of the dataset, while its Speech to Speech counterpart (GPT-4o Real-time Preview October 24) reached 66%. I am. The Text to Speech configuration achieves a middling performance of 74%, indicating that both speech input and speech output contribute to the performance gap.
Speech to Speech pipeline currently outperforms native audio for inference
Traditional pipeline approaches (using Whisper for transcription, GPT-4o (August 24) for inference, and TTS-1 for speech generation) have minimal performance degradation compared to pure text processing. Keep it to a minimum. This suggests that for applications where inference accuracy is important, pipeline approaches currently offer the best balance between performance and audio functionality.
We anticipate that this gap may narrow over time and will continue to test new Speech to Speech models with Big Bench Audio. Stay tuned for an update to Google’s Gemini 2.0 Flash with Speech to Speech mode coming soon.
How to contribute or contact us
For a detailed analysis of the Speech to Speech model, check out the new Speech to Speech page on the Artificial Analysis website: https://artificialanalogy.ai/speech-to-speech.
Follow us on Twitter and LinkedIn for the latest updates. We welcome all feedback and can do so through messages on Twitter or the contact form on our website.