"The real meaning is often hidden in how something is said, not just what is said."
UNSPOKEN is the first bilingual (Chinese-English) benchmark designed to evaluate metaphorical reasoning capabilities in Audio Language Models (ALMs). Unlike traditional transcription-based evaluations, UNSPOKEN challenges models to understand non-literal language by leveraging subtle acoustic cues like prosody, emotional inflection, and phonetic ambiguity.
Current ALMs excel at literal speech understanding but struggle with the nuanced world of metaphors, irony, and cultural references. UNSPOKEN reveals that even state-of-the-art models achieve only 68.9% accuracy - still significantly below the human average of 80.9%, though.
- 🎯 Audio-Centric Evaluation: Grounded in actual audio, not just transcriptions
- 🌍 Bilingual Coverage: 2,764 validated QA pairs in Chinese and English
- 🧠 Multi-Dimensional Reasoning: Semantic, acoustic, and contextual understanding
- 📊 Fine-Grained Categories: 6 metaphor types (Puns, Cultural Metaphors, Irony, etc.)
- ⚡ Easy Integration: Simple API for evaluating your own ALMs
To ensure the quality and reliability of our benchmark, we adopt a three-step approach as shown in Figure 3. First, we define metaphorical reasoning in spoken scenarios along three dimensions, establishing a clear framework to guide data collection and annotation. Second, we manually curate metaphorical segments and construct question-answer pairs, further enriching distractor options with LLMs to enhance task difficulty and diversity. Moreover, a rigorous human filtering strategy is employed to ensure the accuracy and contextual appropriateness of the data. Finally, we categorize the dataset based on metaphor types and conduct a detailed analysis of their characteristics, providing valuable insights for future research on metaphor comprehension in audio language models.
The final version of Unspoken comprises 2,764 validated question–answer pairs. To support fine-grained analysis and enable interpretable benchmarking, we categorize each item into one of six metaphor types: Pun (PU), Cultural Metaphors (CM), Irony and Contrast (IC), Implied Analogy (IA), Foreshadowing and Payoff (FP), and a general Other category. These categories were validated through assessments by domain experts and professional stand-up comedians. This rigorous, human-centric design ensures the ecological validity of the benchmark and offers a structured lens for exploring the cognitive mechanisms underlying metaphor comprehension.
|
|
git clone https://github.com/Hongru0306/UNSPOKEN.git
cd unspoken
pip install -r requirements.txtAlternatively, you can setup environements with conda:
conda create --name <env_name> --file environment.txtThe following code snippet demonstartes the basic evaluation setup:
from eval.model import Qwen2OmniAudio
from eval.utils import run_experiment
import pandas as pd
## Load your model
model = Qwen2OmniAudio(model_path='your-model-path')
## Load dataset
df = pd.read_csv('./final_utf8.csv')
## Run evaluation
run_experiment(model, df, task='direct', path='./sliced_mp3')
### Advanced Prompting Strategies
from eval.prompt import DIRECT_SINGLE_EN, COT_SINGLE_EN, XLT_SINGLE_EN
"""
Three prompting strategies available:
- Direct: Standard question-answering
- Chain-of-Thought: Step-by-step reasoning
- Cross-Lingual Transfer: Language-switching prompts
"""Or run your evaluation with a single bash command:
python ./eval/main.py --model Qwen2OmniAudio --model_path '<your_model_path>' --task direct --input ./final_utf8.csv --audio_path ./sliced_mp3We provide implementations for popular ALMs:
- Qwen2.5-Omni (
Qwen2OmniAudio, see lines 273–311) - GPT-4o Audio (
GPT4oAudioPreview, see lines 119-167) - Gemini Audio (
GeminiAudio, see lines 25-114) - Qwen2-Audio (
Qwen2Audio, see lines 231-268) - GLM-4 Voice (
GLM4Audio, see lines 169-210) - [More on schedule ... ]
To deeply investigate the capabilities of ALMs in metaphorical reasoning, we conducted extensive evaluations of various mainstream open-source and closed-source models using Unspoken. Notably, we further tested and analyzed these models under four different prompting strategies and four few-shot scenarios, aiming to provide valuable insights for future research in this field. All results reported in Table 2 were obtained under zero-shot conditions without specially designed prompts.
The inclusion of few-shot examples does not enhance model performance; instead, it significantly degrades the performance of models that have undergone instruction tuning, such as Qwen2-Audio-Instruct, Baichuan-Audio-Instruct, and Kimi-Audio-Instruct. This decline can be attributed to two primary factors. First, the models exhibit limited ability to handle long input sequences. Second, instruction-tuned models have likely already acquired robust reasoning capabilities, and the addition of few-shot examples may interfere with the generalized reasoning mechanisms learned during training.
Our detailed error analysis reveals five key failure modes:
from eval.main import run_experiment
from eval.model import YourCustomModel
model = YourCustomModel()
results = run_experiment(model, dataset, task='cot') # Chain-of-Thought
### Custom Prompts
from eval.prompt import create_custom_prompt
prompt = create_custom_prompt(strategy="reasoning_chain",
language="en",
question_type="multiple"If you use UNSPOKEN in your research, please cite our paper:
@inproceedings{xiao2025unspoken,
title = {Can Audio Language Models Listen Between the Lines? A Study on Metaphorical Reasoning via Unspoken},
author = {Xiao, Hongru and Li, Xiang and Pan, Duyi and Zhang, Longfei and Song, Zhixue and Han, Jiale and Lai, Songning and Chen, Wenshuo and Tang, Jing and Wang, Benyou},
booktitle = {Proceedings of the 33rd ACM International Conference on Multimedia (MM '25)},
year = {2025},
isbn = {979-8-4007-2035-2/2025/10},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
doi = {10.1145/3746027.3758173},
location = {Dublin, Ireland},
series = {MM '25}
}







