diff --git a/data/xml/2025.cl.xml b/data/xml/2025.cl.xml index f3f6702fc9..428f1f264d 100644 --- a/data/xml/2025.cl.xml +++ b/data/xml/2025.cl.xml @@ -115,4 +115,135 @@ lippincott-2025-automatic + + + Computational Linguistics, Volume 51, Issue 2 - June 2025 + MIT Press +
Cambridge, MA
+ June + 2025 + cl + 51 + + + <fixed-case>D</fixed-case>i<fixed-case>B</fixed-case>i<fixed-case>MT</fixed-case>: A Gold Evaluation Benchmark for Studying Lexical Ambiguity in Machine Translation + FedericoMartelli + StefanoPerrella + NiccolòCampolungo + TinaMunda + SvetlaKoeva + CaroleTiberius + RobertoNavigli + 10.1162/coli_a_00541 + Despite the remarkable progress made in the field of Machine Translation (MT), current systems still struggle when translating ambiguous words, especially when these express infrequent meanings. In order to investigate and analyze the impact of lexical ambiguity on automatic translations, several tasks and evaluation benchmarks have been proposed over the course of the last few years. However, work in this research direction suffers from critical shortcomings. Indeed, existing evaluation datasets are not entirely manually curated, which significantly compromises their reliability. Furthermore, current literature fails to provide detailed insights into the nature of the errors produced by models translating ambiguous words, lacking a thorough manual analysis across languages. With a view to overcoming these limitations, we propose Disambiguation Biases in MT (DiBiMT), an entirely manually curated evaluation benchmark for investigating disambiguation biases in eight language combinations and assessing the ability of both commercial and non-commercial systems to handle ambiguous words. We also examine and detail the errors produced by models in this scenario by carrying out a manual error analysis in all language pairs. Additionally, we perform an extensive array of experiments aimed at studying the behavior of models when dealing with ambiguous words. Finally, we show the ineffectiveness of standard MT evaluation settings for assessing the disambiguation capabilities of systems and highlight the need for additional efforts in this research direction and ad-hoc testbeds such as DiBiMT. Our benchmark is available at: https://nlp.uniroma1.it/dibimt/. + 343–413 + 2025.cl-2.1 + martelli-etal-2025-dibimt + + + Train and Constrain: Phonologically Informed Tongue Twister Generation from Topics and Paraphrases + TylerLoakman + ChenTang + ChenghuaLin + 10.1162/coli_a_00544 + Previous work in phonologically and phonetically grounded language generation has mainly focused on domains such as puns and poetry. In this article, we present new work on the generation of English tongue twisters—a form of language that is required to be conditioned on a phoneme level to maximize sound overlap, while maintaining semantic consistency with an input topic or phrase and still being grammatically correct. We present TwisterLister, a pipeline for generating phonologically informed tongue twisters from large language models (LLMs) that we use to generate TwistList 2.0, the largest annotated dataset of tongue twisters to date, consisting of 17k+ examples from a combination of human and LLM authors. Our generation pipeline involves the use of a phonologically constrained vocabulary alongside LLM prompting to generate novel, non-derivative tongue twister examples. We additionally present the results of automatic and human evaluation of smaller models trained on our generated dataset to demonstrate the extent to which phonologically motivated language types can be generated without explicit injection of phonological knowledge. Additionally, we introduce a phoneme-aware constrained decoding module (PACD) that can be integrated into an autoregressive language model and demonstrate that this method generates good quality tongue twisters both with and without fine-tuning the underlying language model. We also design and implement a range of automatic metrics for the task of tongue twister generation that is phonologically motivated and captures the unique essence of tongue twisters, primarily based on phonemic edit distance (PED).1 + 415–466 + 2025.cl-2.2 + loakman-etal-2025-train + + + Eliciting and Improving the Causal Reasoning Abilities of Large Language Models with Conditional Statements + XiaoLiu + DaYin + ChenZhang + DongyanZhao + YansongFeng + 10.1162/coli_a_00548 + Causal reasoning, the ability to identify cause-and-effect relationships, is crucial in human thinking. Although large language models (LLMs) succeed in many NLP tasks, it is still challenging for them to conduct complex causal reasoning like abductive reasoning and counterfactual reasoning. Complex causal structures are rarely expressed explicitly in the text, which could make learning them challenging for LLMs. Given the fact that programming code may express causal relations more often and explicitly with conditional statements like if, we want to explore whether large language models of code (Code-LLMs) acquire better causal reasoning abilities, and whether code prompts better describe the causal structure than text prompts. Our experiments show that compared with general-purpose LLMs like Llama-2 and GPT-3, Code-LLMs like CodeLlama and Codex are significantly better in causal reasoning. Code prompts not only work well for Code-LLMs, but also help improve the performance of most general-purpose LLMs. To understand why code prompts are effective, we intervene on the prompts from different aspects, and discover that the programming structure is crucial in code prompt design, while models are more robust towards format perturbations. We further explore whether exposing models with more code with conditional statements aids in enhancing causal reasoning abilities. We finetune LLMs on such code corpus, and find their performance improves when prompted with either code prompts or text prompts.1 + 467–504 + 2025.cl-2.3 + liu-etal-2025-eliciting + + + Investigating Idiomaticity in Word Representations + WeiHe + Tiago KramerVieira + MarcosGarcia + CarolinaScarton + MarcoIdiart + AlineVillavicencio + 10.1162/coli_a_00546 + Idiomatic expressions are an integral part of human languages, often used to express complex ideas in compressed or conventional ways (e.g., eager beaver as a keen and enthusiastic person). However, their interpretations may not be straightforwardly linked to the meanings of their individual components in isolation and this may have an impact for compositional approaches. In this article, we investigate to what extent word representation models are able to go beyond compositional word combinations and capture multiword expression idiomaticity and some of the expected properties related to idiomatic meanings. We focus on noun compounds of varying levels of idiomaticity in two languages (English and Portuguese), presenting a dataset of minimal pairs containing human idiomaticity judgments for each noun compound at both type and token levels, their paraphrases and their occurrences in naturalistic and sense-neutral contexts, totalling 32,200 sentences. We propose this set of minimal pairs for evaluating how well a model captures idiomatic meanings, and define a set of fine-grained metrics of Affinity and Scaled Similarity, to determine how sensitive the models are to perturbations that may lead to changes in idiomaticity. Affinity is a comparative measure of the similarity between an experimental item, a target and a potential distractor, and Scaled Similarity incorporates a rescaling factor to magnify the meaningful similarities within the spaces defined by each specific model. The results obtained with a variety of representative and widely used models indicate that, despite superficial indications to the contrary in the form of high similarities, idiomaticity is not yet accurately represented in current models. Moreover, the performance of models with different levels of contextualization suggests that their ability to capture context is not yet able to go beyond more superficial lexical clues provided by the words and to actually incorporate the relevant semantic clues needed for idiomaticity. By proposing model-agnostic measures for assessing the ability of models to capture idiomaticity, this article contributes to determining limitations in the handling of non-compositional structures, which is one of the directions that needs to be considered for more natural, accurate, and robust language understanding. The source code and additional materials related to this paper are available at our GitHub repository.1 + 505–555 + 2025.cl-2.4 + he-etal-2025-investigating + + + Dotless <fixed-case>A</fixed-case>rabic Text for Natural Language Processing + Maged S.Al-Shaibani + IrfanAhmad + 10.1162/coli_a_00535 + This article introduces a novel representation of Arabic text as an alternative approach for Arabic NLP, inspired by the dotless script of ancient Arabic. We explored this representation through extensive analysis on various text corpora, differing in size and domain, and tokenized using multiple tokenization techniques. Furthermore, we examined the information density of this representation and compared it with the standard dotted Arabic text using text entropy analysis. Utilizing parallel corpora, we also drew comparisons between Arabic and English text analysis to gain additional insights. Our investigation extended to various upstream and downstream NLP tasks, including language modeling, text classification, sequence labeling, and machine translation, examining the implications of both the representations. Specifically, we performed seven different downstream tasks using various tokenization schemes comparing the standard dotted text with dotless Arabic text representations. Performance using both the representations was comparable across different tokenizations. However, dotless representation achieves these results with significant reduction in vocabulary sizes, and in some scenarios showing reduction of up to 50%. Additionally, we present a system that restores dots to the dotless Arabic text. This system is useful for tasks that require Arabic texts as output. + 557–598 + 2025.cl-2.5 + al-shaibani-ahmad-2025-dotless + + + <fixed-case>LMLPA</fixed-case>: Language Model Linguistic Personality Assessment + JingyaoZheng + XianWang + SimoHosio + XiaoxianXu + Lik-HangLee + 10.1162/coli_a_00550 + Large language models (LLMs) are increasingly used in everyday life and research. One of the most common use cases is conversational interactions, enabled by the language generation capabilities of LLMs. Just as between two humans, a conversation between an LLM-powered entity and a human depends on the personality of the conversants. However, measuring the personality of a given LLM is currently a challenge. This article introduces the Language Model Linguistic Personality Assessment (LMLPA), a system designed to evaluate the linguistic personalities of LLMs. Our system helps to understand LLMs’ language generation capabilities by quantitatively assessing the distinct personality traits reflected in their linguistic outputs. Unlike traditional human-centric psychometrics, the LMLPA adapts a personality assessment questionnaire, specifically the Big Five Inventory, to align with the operational capabilities of LLMs, and also incorporates the findings from previous language-based personality measurement literature. To mitigate sensitivity to the order of options, our questionnaire is designed to be open-ended, resulting in textual answers. Thus, the Artificial Intelligence (AI) rater is needed to transform ambiguous personality information from text responses into clear numerical indicators of personality traits. Utilizing Principal Component Analysis and reliability validation methods, our findings demonstrate that LLMs possess distinct personality traits that can be effectively quantified by the LMLPA. This research contributes to Human-Centered AI and Computational Linguistics, providing a robust framework for future studies to refine AI personality assessments and expand their applications in multiple areas, including education and manufacturing. + 599–640 + 2025.cl-2.6 + zheng-etal-2025-lmlpa + + + Kallini et al. (2024) Do Not Compare Impossible Languages with Constituency-based Ones + TimHunter + 10.1162/coli_a_00554 + A central goal of linguistic theory is to find a precise characterization of the notion “possible human language”, in the form of a computational device that is capable of describing all and only the languages that can be acquired by a typically developing human child. The success of recent large language models (LLMs) in NLP applications arguably raises the possibility that LLMs might be computational devices that meet this goal. This would only be the case if, in addition to succeeding in learning human languages, LLMs struggle to learn “impossible” human languages. Kallini et al. (2024) conducted experiments aiming to test this by training GPT-2 on a variety of synthetic languages, and found that it learns some more successfully than others. They present these asymmetries as support for the idea that LLMs’ inductive biases align with what is regarded as “possible” for human languages, but the most significant comparison has a confound that makes this conclusion unwarranted. + 641–650 + 2025.cl-2.7 + hunter-2025-kallini + + + Language Models and Externalism: A Reply to Mandelkern and <fixed-case>L</fixed-case>inzen + GaryOstertag + 10.1162/coli_a_00551 + Do texts generated by language models (LMs) refer? Mandelkern and Linzen (2024) argue that externalist principles point to an affirmative conclusion. What grounds reference, according to their externalism, is a term’s “natural history”. For example, ‘water’ refers to H2O among English speakers, and not to the phenomenally indistinguishable chemical XYZ, because H2O, and not XYZ, is implicated in the natural history of ‘water’. Appealing to the literature on contrastive explanation, I show that a term’s natural history does not generally ground its referential properties. Thus, Mandelkern and Linzen’s quick route to the referentiality of LM-generated texts fails. + 651–659 + 2025.cl-2.8 + ostertag-2025-language + + + <fixed-case>LLM</fixed-case>-based <fixed-case>NLG</fixed-case> Evaluation: Current Status and Challenges + MingqiGao + XinyuHu + XunjianYin + JieRuan + XiaoPu + XiaojunWan + 10.1162/coli_a_00561 + Evaluating natural language generation (NLG) is a vital but challenging problem in natural language processing. Traditional evaluation metrics mainly capturing content (e.g., n-gram) overlap between system outputs and references are far from satisfactory, and large language models (LLMs) such as ChatGPT have demonstrated great potential in NLG evaluation in recent years. Various automatic evaluation methods based on LLMs have been proposed, including metrics derived from LLMs, prompting LLMs, fine-tuning LLMs, and human–LLM collaborative evaluation. In this survey, we first give a taxonomy of LLM-based NLG evaluation methods, and discuss their pros and cons, respectively. Lastly, we discuss several open problems in this area and point out future research directions. + 661–687 + 2025.cl-2.9 + gao-etal-2025-llm + + + Socially Aware Language Technologies: Perspectives and Practices + DiyiYang + DirkHovy + DavidJurgens + BarbaraPlank + 10.1162/coli_a_00556 + Language technologies have advanced substantially, particularly with the introduction of large language models. However, these advancements can exacerbate several issues that models have traditionally faced, including bias, evaluation, and risk. In this perspective piece, we argue that many of these issues share a common core: a lack of awareness of the social factors, interactions, and implications of the social environment in which NLP operates. We call this social awareness. While NLP is improving at addressing linguistic issues, there has been relatively limited progress in incorporating social awareness into models to work in all situations for all users. Integrating social awareness into NLP will improve the naturalness, usefulness, and safety of applications while also opening up new applications. Today, we are only at the start of a new, important era in the field. + 689–703 + 2025.cl-2.10 + yang-etal-2025-socially + +
diff --git a/data/xml/2025.tacl.xml b/data/xml/2025.tacl.xml index 403239f3bc..6ae00058ff 100644 --- a/data/xml/2025.tacl.xml +++ b/data/xml/2025.tacl.xml @@ -105,6 +105,18 @@ 2025.tacl-1.6 ahuja-etal-2025-learning + + Navigating Cultural Chasms: Exploring and Unlocking the Cultural <fixed-case>POV</fixed-case> of Text-To-Image Models + MorVentura + EyalBen-David + AnnaKorhonen + RoiReichart + 10.1162/tacl_a_00732 + Text-To-Image (TTI) models, such as DALL-E and StableDiffusion, have demonstrated remarkable prompt-based image generation capabilities. Multilingual encoders may have a substantial impact on the cultural agency of these models, as language is a conduit of culture. In this study, we explore the cultural perception embedded in TTI models by characterizing culture across three tiers: cultural dimensions, cultural domains, and cultural concepts. Based on this ontology, we derive prompt templates to unlock the cultural knowledge in TTI models, and propose a comprehensive suite of evaluation techniques, including intrinsic evaluations using the CLIP space, extrinsic evaluations with a Visual-Question-Answer models and human assessments, to evaluate the cultural content of TTI-generated images. To bolster our research, we introduce the CulText2I dataset, based on six diverse TTI models and spanning ten languages. Our experiments provide insights regarding Do, What, Which, and How research questions about the nature of cultural encoding in TTI models, paving the way for cross-cultural applications of these models.1 + 142–166 + 2025.tacl-1.10 + ventura-etal-2025-navigating + A Confidence-based Acquisition Model for Self-supervised Active Learning and Label Correction Carel vanNiekerk @@ -151,5 +163,306 @@ 2025.tacl-1.9 strobl-etal-2025-transformers + + Benchmarking Uncertainty Quantification Methods for Large Language Models with <fixed-case>LM</fixed-case>-Polygraph + RomanVashurin + EkaterinaFadeeva + ArtemVazhentsev + LyudmilaRvanova + DaniilVasilev + AkimTsvigun + SergeyPetrakov + RuiXing + AbdelrahmanSadallah + KirillGrishchenkov + AlexanderPanchenko + TimothyBaldwin + PreslavNakov + MaximPanov + ArtemShelmanov + 10.1162/tacl_a_00737 + The rapid proliferation of large language models (LLMs) has stimulated researchers to seek effective and efficient approaches to deal with LLM hallucinations and low-quality outputs. Uncertainty quantification (UQ) is a key element of machine learning applications in dealing with such challenges. However, research to date on UQ for LLMs has been fragmented in terms of techniques and evaluation methodologies. In this work, we address this issue by introducing a novel benchmark that implements a collection of state-of-the-art UQ baselines and offers an environment for controllable and consistent evaluation of novel UQ techniques over various text generation tasks. Our benchmark also supports the assessment of confidence normalization methods in terms of their ability to provide interpretable scores. Using our benchmark, we conduct a large-scale empirical investigation of UQ and normalization techniques across eleven tasks, identifying the most effective approaches. + 220–248 + 2025.tacl-1.11 + vashurin-etal-2025-benchmarking + + + Supervised Neural Topic Modeling with Label Alignment + RuihaoChen + HegangChen + YuyinLu + YanghuiRao + ChunjiangZhu + 10.1162/tacl_a_00738 + Neural topic modeling is a scalable automated technique for text data mining. In various downstream tasks of topic modeling, it is preferred that the discovered topics well align with labels. However, due to the lack of guidance from labels, unsupervised neural topic models are less powerful in this situation. Existing supervised neural topic models often adopt a label-free prior to generate the latent document-topic distributions and use them to predict the labels and thus achieve label-topic alignment indirectly. Such a mechanism faces the following issues: 1) The label-free prior leads to topics blending the latent patterns of multiple labels; and 2) One is unable to intuitively identify the explicit relationships between labels and the discovered topics. To tackle these problems, we develop a novel supervised neural topic model which utilizes a chain-structured graphical model with a label-conditioned prior. Soft indicators are introduced to explicitly construct the label-topic relationships. To obtain well-organized label-topic relationships, we formalize an entropy-regularized optimal transport problem on the embedding space and model them as the transport plan. Moreover, our proposed method can be flexibly integrated with most existing unsupervised neural topic models. Experimental results on multiple datasets demonstrate that our model can greatly enhance the alignment between labels and topics while maintaining good topic quality. + 249–263 + 2025.tacl-1.12 + chen-etal-2025-supervised + + + From Robustness to Improved Generalization and Calibration in Pre-trained Language Models + JosipJukić + JanŠnajder + 10.1162/tacl_a_00739 + Enforcing representation smoothness in pre-trained language models (PLMs) through Jacobian and Hessian regularization provides an effective approach for enhancing both robustness and generalization. Although such regularization methods have proven effective in computer vision, their application in natural language processing, where PLM inputs are derived from a discrete domain, poses unique challenges. We introduce JacHess, a regularization approach for PLMs that minimizes the norms of the Jacobian and Hessian matrices in intermediate representations, using embeddings as substitutes for discrete token inputs. JacHess supports dual-mode regularization, alternating between fine-tuning with labeled data and regularization with unlabeled data. We evaluate JacHess on the GLUE benchmark and demonstrate that it consistently and significantly improves in-distribution generalization and enhances performance under domain shift. Across diverse PLMs, JacHess outperforms comparable representation-based regularization methods and unregularized fine-tuning, while also improving model calibration. Our findings, coupled with a computationally efficient estimator for the Jacobian and Hessian norms, position JacHess as a robust and widely applicable solution for enhancing PLM performance. + 264–280 + 2025.tacl-1.13 + jukic-snajder-2025-robustness + + + How “Real” is Your Real-Time Simultaneous Speech-to-Text Translation System? + SaraPapi + PeterPolák + DominikMacháček + OndřejBojar + 10.1162/tacl_a_00740 + Simultaneous speech-to-text translation (SimulST) translates source-language speech into target-language text concurrently with the speaker’s speech, ensuring low latency for better user comprehension. Despite its intended application to unbounded speech, most research has focused on human pre-segmented speech, simplifying the task and overlooking significant challenges. This narrow focus, coupled with widespread terminological inconsistencies, is limiting the applicability of research outcomes to real-world applications, ultimately hindering progress in the field. Our extensive literature review of 110 papers not only reveals these critical issues in current research but also serves as the foundation for our key contributions. We: 1) define the steps and core components of a SimulST system, proposing a standardized terminology and taxonomy; 2) conduct a thorough analysis of community trends; and 3) offer concrete recommendations and future directions to bridge the gaps in existing literature, from evaluation frameworks to system architectures, for advancing the field towards more realistic and effective SimulST solutions. + 281–313 + 2025.tacl-1.14 + papi-etal-2025-real + + + Self-Rationalization in the Wild: A Large-scale Out-of-Distribution Evaluation on <fixed-case>NLI</fixed-case>-related tasks + JingYang + MaxGlockner + AndersonRocha + IrynaGurevych + 10.1162/tacl_a_00741 + Free-text explanations are expressive and easy to understand, but many datasets lack annotated explanation data, making it challenging to train models for explainable predictions. To address this, we investigate how to use existing explanation datasets for self-rationalization and evaluate models’ out-of-distribution (OOD) performance. We fine-tune T5-Large and OLMo-7B models and assess the impact of fine-tuning data quality, the number of fine-tuning samples, and few-shot selection methods. The models are evaluated on 19 diverse OOD datasets across three tasks: natural language inference (NLI), fact-checking, and hallucination detection in abstractive summarization. For the generated explanation evaluation, we conduct a human study on 13 selected models and study its correlation with the Acceptability score (T5-11B) and three other LLM-based reference-free metrics. Human evaluation shows that the Acceptability score correlates most strongly with human judgments, demonstrating its effectiveness in evaluating free-text explanations. Our findings reveal: 1) few annotated examples effectively adapt models for OOD explanation generation; 2) compared to sample selection strategies, fine-tuning data source has a larger impact on OOD performance; and 3) models with higher label prediction accuracy tend to produce better explanations, as reflected by higher Acceptability scores.1 + 314–342 + 2025.tacl-1.15 + yang-etal-2025-self + + + <fixed-case>DEAR</fixed-case>: Disentangled Event-Agnostic Representation Learning for Early Fake News Detection + XiaoPu + HaoWu + XiuliBi + YuWu + XinboGao + 10.1162/tacl_a_00743 + Detecting fake news early is challenging due to the absence of labeled articles for emerging events in training data. To address this, we propose a Disentangled Event-Agnostic Representation (DEAR) learning approach. Our method begins with a BERT-based adaptive multi-grained semantic encoder that captures hierarchical and comprehensive textual representations of the input news content. To effectively separate latent authenticity-related and event-specific knowledge within the news content, we employ a disentanglement architecture. To further enhance the decoupling effect, we introduce a cross-perturbation mechanism that perturbs authenticity-related representation with the event-specific one, and vice versa, deriving a robust and discerning authenticity-related signal. Additionally, we implement a refinement learning scheme to minimize potential interactions between two decoupled representations, ensuring that the authenticity signal remains strong and unaffected by event-specific details. Experimental results demonstrate that our approach effectively mitigates the impact of event-specific influence, outperforming state-of-the-art methods. In particular, it achieves a 6.0% improvement in accuracy on the PHEME dataset over MDDA, a similar approach that decouples latent content and style knowledge, in scenarios involving articles from unseen events different from the topics of the training set. + 343–356 + 2025.tacl-1.16 + pu-etal-2025-dear + + + <fixed-case>LLM</fixed-case> Reading Tea Leaves: Automatically Evaluating Topic Models with Large Language Models + XiaohaoYang + HeZhao + DinhPhung + WrayBuntine + LanDu + 10.1162/tacl_a_00744 + Topic modeling has been a widely used tool for unsupervised text analysis. However, comprehensive evaluations of a topic model remain challenging. Existing evaluation methods are either less comparable across different models (e.g., perplexity) or focus on only one specific aspect of a model (e.g., topic quality or document representation quality) at a time, which is insufficient to reflect the overall model performance. In this paper, we propose WALM (Word Agreement with Language Model), a new evaluation method for topic modeling that considers the semantic quality of document representations and topics in a joint manner, leveraging the power of Large Language Models (LLMs). With extensive experiments involving different types of topic models, WALM is shown to align with human judgment and can serve as a complementary evaluation method to the existing ones, bringing a new perspective to topic modeling. Our software package is available at https://github.com/Xiaohao-Yang/Topic_Model_Evaluation. + 357–375 + 2025.tacl-1.17 + yang-etal-2025-llm + + + The <fixed-case>T</fixed-case>hai <fixed-case>U</fixed-case>niversal <fixed-case>D</fixed-case>ependency Treebank + PanyutSriwirote + Wei QiLeong + CharinPolpanumas + SanthawatThanyawong + William ChandraTjhi + WiroteAroonmanakun + Attapol T.Rutherford + 10.1162/tacl_a_00745 + Automatic dependency parsing of Thai sentences has been underexplored, as evidenced by the lack of large Thai dependency treebanks with complete dependency structures and the lack of a published evaluation of state-of-the-art models, especially transformer-based parsers. In this work, we addressed these gaps by introducing the Thai Universal Dependency Treebank (TUD), a new Thai treebank consisting of 3,627 trees annotated according to the Universal Dependencies (UD) framework. We then benchmarked 92 dependency parsing models that incorporate pretrained transformers on Thai-PUD and our TUD, achieving state-of-the-art results and shedding light on the optimal model components for Thai dependency parsing. Our error analysis of the models also reveals that polyfunctional words, serial verb construction, and lack of rich morphosyntactic features present main challenges for Thai dependency parsing. + 376–391 + 2025.tacl-1.18 + sriwirote-etal-2025-thai + + + Diverse <fixed-case>AI</fixed-case> Feedback For Large Language Model Alignment + TianshuYu + Ting-EnLin + YuchuanWu + MinYang + FeiHuang + YongbinLi + 10.1162/tacl_a_00746 + Recent advances in large language models (LLMs) focus on aligning models with human values to minimize harmful content. However, existing methods often rely on a single type of feedback, such as preferences, annotated labels, or critiques, which can lead to overfitting and suboptimal performance. In this paper, we propose Diverse AIFeedback (DAIF), a novel approach that integrates three types of feedback—critique, refinement, and preference—tailored to tasks of varying uncertainty levels. Through an analysis of information gain, we show that critique feedback is most effective for low-uncertainty tasks, refinement feedback for medium-uncertainty tasks, and preference feedback for high-uncertainty tasks. Training with this diversified feedback reduces overfitting and improves alignment. Experimental results across three tasks—question answering, dialog generation, and text summarization–demonstrate that DAIF outperforms traditional methods relying on a single feedback type.1 + 392–407 + 2025.tacl-1.19 + yu-etal-2025-diverse + + + How Much Semantic Information is Available in Large Language Model Tokens? + David A.Haslett + Zhenguang G.Cai + 10.1162/tacl_a_00747 + Large language models segment many words into multiple tokens, and companies that make those models claim that meaningful subword tokens are essential. To investigate whether subword tokens bear meaning, we segmented tens of thousands of words from each of 41 languages according to three generations of GPT tokenizers. We found that words sharing tokens are more semantically similar than expected by chance or expected from length alone, that tokens capture morphological information even when they don’t look like morphemes, and that tokens capture more information than is explained by morphology. In languages that use a script other than the Latin alphabet, GPT-4 tokens are uninformative, but GPT-4o has improved this situation. These results suggest that comparing tokens to morphemes overlooks the wider variety of semantic information available in word form and that standard tokenization methods successfully capture much of that information. + 408–423 + 2025.tacl-1.20 + haslett-cai-2025-much + + + Phonetic Reconstruction of the Consonant System of Middle <fixed-case>C</fixed-case>hinese via Mixed Integer Optimization + XiaoxiLuo + WeiweiSun + 10.1162/tacl_a_00742 + This paper is concerned with phonetic reconstruction of the consonant system of Middle Chinese. We propose to cast the problem as a Mixed Integer Programming problem, which is able to automatically explore homophonic information from ancient rhyme dictionaries and phonetic information from modern Chinese dialects, the descendants of Middle Chinese. Numerical evaluation on a wide range of synthetic and real data demonstrates the effectiveness and robustness of the new method. We apply the method to information from Guǎngyùn and 20 modern Chinese dialects to obtain a new phonetic reconstruction result. A linguistically motivated discussion of this result is also provided.1 + 424–441 + 2025.tacl-1.21 + luo-sun-2025-phonetic + + + Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment + KarelD’Oosterlinck + WinnieXu + ChrisDevelder + ThomasDemeester + AmanpreetSingh + ChristopherPotts + DouweKiela + ShikibMehri + 10.1162/tacl_a_00748 + Large Language Models (LLMs) are often aligned using contrastive alignment objectives and preference pair datasets. The interaction between model, paired data, and objective makes alignment a complicated procedure, sometimes producing subpar results. We study this and find that (i) preference data gives a better learning signal when the underlying responses are contrastive, and (ii) alignment objectives lead to better performance when they specify more control over the model during training. Based on these insights, we introduce Contrastive Learning from AI Revisions (CLAIR), a data-creation method which leads to more contrastive preference pairs, and Anchored Preference Optimization (APO), a controllable and more stable alignment objective. We align Llama-3-8B-Instruct using various comparable datasets and alignment objectives and measure MixEval-Hard scores, which correlate highly with human judgments. The CLAIR preferences lead to the strongest performance out of all datasets, and APO consistently outperforms less controllable objectives. Our best model, trained on 32K CLAIR preferences with APO, improves Llama-3-8B-Instruct by 7.65%, closing the gap with GPT4-turbo by 45%. Our code and datasets are available. + 442–460 + 2025.tacl-1.22 + d-oosterlinck-etal-2025-anchored + + + <fixed-case>TANQ</fixed-case>: An Open Domain Dataset of Table Answered Questions + MubasharaAkhtar + ChenxiPang + AndreeaMarzoca + YaseminAltun + Julian MartinEisenschlos + 10.1162/tacl_a_00749 + Language models, potentially augmented with tool usage such as retrieval, are becoming the go-to means of answering questions. Understanding and answering questions in real-world settings often requires retrieving information from different sources, processing and aggregating data to extract insights, and presenting complex findings in form of structured artifacts such as novel tables, charts, or infographics. In this paper, we introduce TANQ,1 the first open-domain question answering dataset where the answers require building tables from information across multiple sources. We release the full source attribution for every cell in the resulting table and benchmark state-of-the-art language models in open, oracle, and closed book setups. Our best-performing baseline, Gemini Flash, reaches an overall F1 score of 60.7, lagging behind human performance by 12.3 points. We analyze baselines’ performance across different dataset attributes such as different skills required for this task, including multi-hop reasoning, math operations, and unit conversions. We further discuss common failures in model-generated answers, suggesting that TANQ is a complex task with many challenges ahead. + 461–480 + 2025.tacl-1.23 + akhtar-etal-2025-tanq + + + Few-Shot Multilingual Open-Domain <fixed-case>QA</fixed-case> from Five Examples + FanJiang + TomDrummond + TrevorCohn + 10.1162/tacl_a_00750 + Recent approaches to multilingual open- domain question answering (MLODQA) have achieved promising results given abundant language-specific training data. However, the considerable annotation cost limits the application of these methods for underrepresented languages. We introduce a few-shot learning approach to synthesize large-scale multilingual data from large language models (LLMs). Our method begins with large-scale self-supervised pre-training using WikiData, followed by training on high-quality synthetic multilingual data generated by prompting LLMs with few-shot supervision. The final model, FsModQA, significantly outperforms existing few-shot and supervised baselines in MLODQA and cross-lingual and monolingual retrieval. We further show our method can be extended for effective zero-shot adaptation to new languages through a cross-lingual prompting strategy with only English-supervised data, making it a general and applicable solution for MLODQA tasks without costly large-scale annotation. + 481–504 + 2025.tacl-1.24 + jiang-etal-2025-shot + + + Navigating the Landscape of Hint Generation Research: From the Past to the Future + AnubhavJangra + JamshidMozafari + AdamJatowt + SmarandaMuresan + 10.1162/tacl_a_00751 + Digital education has gained popularity in the last decade, especially after the COVID-19 pandemic. With the improving capabilities of large language models to reason and communicate with users, envisioning intelligent tutoring systems that can facilitate self-learning is not very far-fetched. One integral component to fulfill this vision is the ability to give accurate and effective feedback via hints to scaffold the learning process. In this survey article, we present a comprehensive review of prior research on hint generation, aiming to bridge the gap between research in education and cognitive science, and research in AI and Natural Language Processing. Informed by our findings, we propose a formal definition of the hint generation task, and discuss the roadmap of building an effective hint generation system aligned with the formal definition, including open challenges, future directions and ethical considerations. + 505–528 + 2025.tacl-1.25 + jangra-etal-2025-navigating + + + Know Your Limits: A Survey of Abstention in Large Language Models + BingbingWen + JihanYao + ShangbinFeng + ChenjunXu + YuliaTsvetkov + BillHowe + Lucy LuWang + 10.1162/tacl_a_00754 + Abstention, the refusal of large language models (LLMs) to provide an answer, is increasingly recognized for its potential to mitigate hallucinations and enhance safety in LLM systems. In this survey, we introduce a framework to examine abstention from three perspectives: the query, the model, and human values. We organize the literature on abstention methods, benchmarks, and evaluation metrics using this framework, and discuss merits and limitations of prior work. We further identify and motivate areas for future research, such as whether abstention can be achieved as a meta-capability that transcends specific tasks or domains, and opportunities to optimize abstention abilities in specific contexts. In doing so, we aim to broaden the scope and impact of abstention methodologies in AI systems.1 + 529–556 + 2025.tacl-1.26 + wen-etal-2025-know + + + <fixed-case>T</fixed-case>axo<fixed-case>P</fixed-case>ro: A Plug-In <fixed-case>L</fixed-case>o<fixed-case>RA</fixed-case>-based Cross-Domain Method for Low-Resource Taxonomy Completion + HongyuanXu + YuhangNiu + CiyiLiu + YanlongWen + XiaojieYuan + 10.1162/tacl_a_00755 + Low-resource taxonomy completion aims to automatically insert new concepts into the existing taxonomy, in which only a few in-domain training samples are available. Recent studies have achieved considerable progress by incorporating prior knowledge from pre-trained language models (PLMs). However, these studies tend to overly rely on such knowledge and neglect the shareable knowledge across different taxonomies. In this paper, we propose TaxoPro, a plug-in LoRA-based cross-domain method, that captures shareable knowledge from the high- resource taxonomy to improve PLM-based low-resource taxonomy completion techniques. To prevent negative interference between domain-specific and domain-shared knowledge, TaxoPro decomposes cross- domain knowledge into domain-shared and domain-specific components, storing them using low-rank matrices (LoRA). Additionally, TaxoPro employs two auxiliary losses to regulate the flow of shareable knowledge. Experimental results demonstrate that TaxoPro improves PLM-based techniques, achieving state-of-the-art performance in completing low-resource taxonomies. Code is available at https://github.com/cyclexu/TaxoPro. + 557–576 + 2025.tacl-1.27 + xu-etal-2025-taxopro + + + Exploring Practical Gaps in Using Cross Entropy to Implement Maximum Mutual Information Criterion for Rationalization + WeiLiu + ZhiyingDeng + ZhongyuNiu + JunWang + HaozhaoWang + RuixuanLi + 10.1162/tacl_a_00758 + Rationalization is a framework that aims to build self-explanatory NLP models by extracting a subset of human-intelligible pieces of their inputting texts. It involves a cooperative game where a selector selects the most human-intelligible parts of the input as the rationale, followed by a predictor that makes predictions based on these selected rationales. Existing literature uses the cross-entropy between the model’s predictions and the ground-truth labels to measure the informativeness of the selected rationales, guiding the selector to choose better ones. In this study, we first theoretically analyze the objective of rationalization by decomposing it into two parts: the model-agnostic informativeness of the rationale candidates and the predictor’s degree of fit. We then provide various empirical evidence to support that, under this framework, the selector tends to sample from a limited small region, causing the predictor to overfit these localized areas. This results in a significant mismatch between the cross-entropy objective and the informativeness of the rationale candidates, leading to suboptimal solutions. To address this issue, we propose a simple yet effective method that introduces random vicinal1 perturbations to the selected rationale candidates. This approach broadens the predictor’s assessment to a vicinity around the selected rationale candidate. Compared to recent competitive methods, our method significantly improves rationale quality (by up to 6.6%) across six widely used classification datasets. The term “vicinal” is borrowed from vicinal risk minimization (Chapelle et al., 2000); “vicinal” means neighboring or adjacent. + 577–594 + 2025.tacl-1.28 + liu-etal-2025-exploring + + + A Comparative Approach for Auditing Multilingual Phonetic Transcript Archives + FarhanSamir + Emily P.Ahn + ShreyaPrakash + MártonSoskuthy + VeredShwartz + JianZhu + 10.1162/tacl_a_00759 + Curating datasets that span multiple languages is challenging. To make the collection more scalable, researchers often incorporate one or more imperfect classifiers in the process, like language identification models. These models, however, are prone to failure, resulting in some language partitions being unreliable for downstream tasks. We introduce a statistical test, the Preference Proportion Test, for identifying such unreliable partitions. By annotating only 20 samples for a language partition, we are able to identify systematic transcription errors for 10 language partitions in a recent large multilingual transcribed audio archive, X-IPAPack (Zhu et al., 2024). We find that filtering these low-quality partitions out when training models for the downstream task of phonetic transcription brings substantial benefits, most notably a 25.7% relative improvement on transcribing recordings in out-of-distribution languages. Our work contributes an effective method for auditing multilingual audio archives.1 + 595–612 + 2025.tacl-1.29 + samir-etal-2025-comparative + + + Understanding Epistemic Language with a Language-augmented <fixed-case>B</fixed-case>ayesian Theory of Mind + LanceYing + TanZhi-Xuan + LionelWong + VikashMansinghka + Joshua B.Tenenbaum + 10.1162/tacl_a_00752 + How do people understand and evaluate claims about others’ beliefs, even though these beliefs cannot be directly observed? In this paper, we introduce a cognitive model of epistemic language interpretation, grounded in Bayesian inferences about other agents’ goals, beliefs, and intentions: a language-augmented Bayesian theory-of-mind (LaBToM). By translating natural language into an epistemic “language-of-thought” with grammar-constrained LLM decoding, then evaluating these translations against the inferences produced by inverting a generative model of rational action and perception, LaBToM captures graded plausibility judgments of epistemic claims. We validate our model in an experiment where participants watch an agent navigate a maze to find keys hidden in boxes needed to reach their goal, then rate sentences about the agent’s beliefs. In contrast with multimodal LLMs (GPT-4o, Gemini Pro) and ablated models, our model correlates highly with human judgments for a wide range of expressions, including modal language, uncertainty expressions, knowledge claims, likelihood comparisons, and attributions of false belief. + 613–637 + 2025.tacl-1.30 + ying-etal-2025-understanding + + + Culturally Aware and Adapted <fixed-case>NLP</fixed-case>: A Taxonomy and a Survey of the State of the Art + Chen CeciliaLiu + IrynaGurevych + AnnaKorhonen + 10.1162/tacl_a_00760 + The surge of interest in culture in NLP has inspired much recent research, but a shared understanding of “culture” remains unclear, making it difficult to evaluate progress in this emerging area. Drawing on prior research in NLP and related fields, we propose a fine-grained taxonomy of elements in culture that can provide a systematic framework for analyzing and understanding research progress. Using the taxonomy, we survey existing resources and methods for culturally aware and adapted NLP, providing an overview of the state of the art and the research gaps that still need to be filled. + 652–689 + 2025.tacl-1.31 + liu-etal-2025-culturally + + + Sense-specific Historical Word Usage Generation + PierluigiCassotti + NinaTahmasebi + 10.1162/tacl_a_00761 + Large-scale sense-annotated corpora are important for a range of tasks but are hard to come by. Dictionaries that record and describe the vocabulary of a language often offer a small set of real-world example sentences for each sense of a word. However, on their own, these sentences are too few to be used as diachronic sense-annotated corpora. We propose a targeted strategy for training and evaluating generative models producing historically and semantically accurate word usages given any word, sense definition, and year triple. Our results demonstrate that fine-tuned models can generate usages with the same properties as real-world example sentences from a reference dictionary. Thus the generated usages will be suitable for training and testing computational models where large-scale sense-annotated corpora are needed but currently unavailable. + 690–708 + 2025.tacl-1.32 + cassotti-tahmasebi-2025-sense + + + <fixed-case>NLP</fixed-case> Security and Ethics, in the Wild + HeatherLent + ErickGalinkin + YiyiChen + Jens MyrupPedersen + LeonDerczynski + JohannesBjerva + 10.1162/tacl_a_00762 + As NLP models are used by a growing number of end-users, an area of increasing importance is NLP Security (NLPSec): assessing the vulnerability of models to malicious attacks and developing comprehensive countermeasures against them. While work at the intersection of NLP and cybersecurity has the potential to create safer NLP for all, accidental oversights can result in tangible harm (e.g., breaches of privacy or proliferation of malicious models). In this emerging field, however, the research ethics of NLP have not yet faced many of the long-standing conundrums pertinent to cybersecurity, until now. We thus examine contemporary works across NLPSec, and explore their engagement with cybersecurity’s ethical norms. We identify trends across the literature, ultimately finding alarming gaps on topics like harm minimization and responsible disclosure. To alleviate these concerns, we provide concrete recommendations to help NLP researchers navigate this space more ethically, bridging the gap between traditional cybersecurity and NLP ethics, which we frame as “white hat NLP”. The goal of this work is to help cultivate an intentional culture of ethical research for those working in NLP Security. + 709–743 + 2025.tacl-1.33 + lent-etal-2025-nlp +