diff --git a/bin/add_revision.py b/bin/add_revision.py index 8d15cf554e..22bb6e8f59 100755 --- a/bin/add_revision.py +++ b/bin/add_revision.py @@ -248,7 +248,7 @@ def main(args): repo.git.add(get_xml_file(args.anthology_id)) if repo.is_dirty(index=True, working_tree=True, untracked_files=True): repo.index.commit( - f"Add revision for {args.anthology_id} (closes #{args.issue})" + f"Add {change_type} for {args.anthology_id} (closes #{args.issue})" ) diff --git a/data/xml/2021.tacl.xml b/data/xml/2021.tacl.xml index 61c7cd2da1..b52db02919 100644 --- a/data/xml/2021.tacl.xml +++ b/data/xml/2021.tacl.xml @@ -966,8 +966,11 @@ 10.1162/tacl_a_00419 Despite the progress made in recent years in addressing natural language understanding (NLU) challenges, the majority of this progress remains to be concentrated on resource-rich languages like English. This work focuses on Persian language, one of the widely spoken languages in the world, and yet there are few NLU datasets available for this language. The availability of high-quality evaluation datasets is a necessity for reliable assessment of the progress on different NLU tasks and domains. We introduce ParsiNLU, the first benchmark in Persian language that includes a range of language understanding tasks—reading comprehension, textual entailment, and so on. These datasets are collected in a multitude of ways, often involving manual annotations by native speakers. This results in over 14.5k new instances across 6 distinct NLU tasks. Additionally, we present the first results on state-of-the-art monolingual and multilingual pre-trained language models on this benchmark and compare them with human performance, which provides valuable insights into our ability to tackle natural language understanding challenges in Persian. We hope ParsiNLU fosters further research and advances in Persian language understanding.1 1147–1162 - 2021.tacl-1.68 + 2021.tacl-1.68 khashabi-etal-2021-parsinlu + + Fix author name + Author info update. What Helps Transformers Recognize Conversational Structure? Importance of Context, Punctuation, and Labels in Dialog Act Recognition diff --git a/data/xml/2023.emnlp.xml b/data/xml/2023.emnlp.xml index c6a6471009..66136c05fb 100644 --- a/data/xml/2023.emnlp.xml +++ b/data/xml/2023.emnlp.xml @@ -11083,13 +11083,14 @@ YoavTulpan 12883-12895 Online social platforms provide a bustling arena for information-sharing and for multi-party discussions. Various frameworks for dialogic discourse parsing were developed and used for the processing of discussions and for predicting the productivity of a dialogue. However, most of these frameworks are not suitable for the analysis of contentious discussions that are commonplace in many online platforms. A novel multi-label scheme for contentious dialog parsing was recently introduced by Zakharov et al. (2021). While the schema is well developed, the computational approach they provide is both naive and inefficient, as a different model (architecture) using a different representation of the input, is trained for each of the 31 tags in the annotation scheme. Moreover, all their models assume full knowledge of label collocations and context, which is unlikely in any realistic setting. In this work, we present a unified model for Non-Convergent Discourse Parsing that does not require any additional input other than the previous dialog utterances. We fine-tuned a RoBERTa backbone, combining embeddings of the utterance, the context and the labels through GRN layers and an asymmetric loss function. Overall, our model achieves results comparable with SOTA, without using label collocation and without training a unique architecture/model for each label. Our proposed architecture makes the labeling feasible at large scale, promoting the development of tools that deepen our understanding of discourse dynamics. - 2023.emnlp-main.796 + 2023.emnlp-main.796 tsur-tulpan-2023-deeper 10.18653/v1/2023.emnlp-main.796 We are Who We Cite: Bridges of Influence Between Natural Language Processing and Other Academic Fields diff --git a/data/xml/2023.findings.xml b/data/xml/2023.findings.xml index f5142606a4..b19a2841e7 100644 --- a/data/xml/2023.findings.xml +++ b/data/xml/2023.findings.xml @@ -23608,9 +23608,11 @@ HeuiseokLim 10334-10343 Large language models (LLMs) have demonstrated impressive capabilities in natural language processing. However, their ability to establish causal relationships, particularly in the context of temporal interventions and language hallucinations, remains challenging. This paper presents CReTIHC, a novel dataset designed to test and enhance the causal reasoning abilities of LLMs. The dataset is constructed using a unique approach that incorporates elements of verbal hallucinations and temporal interventions through the reengineering of existing causal inference datasets. This transformation creates complex scenarios that push LLMs to critically evaluate the information presented and identify cause-and-effect relationships. The CReTIHC dataset serves as a pioneering tool for improving LLM’s causal inference capabilities, paving the way for a more nuanced understanding of causal relationships in natural language processing (NLP) tasks. The whole dataset is publicly accessible at: (https://github.com/ChangwooChun/CReTIHC) - 2023.findings-emnlp.693 + 2023.findings-emnlp.693 chun-etal-2023-cretihc 10.18653/v1/2023.findings-emnlp.693 + + Author info update. On the Dimensionality of Sentence Embeddings @@ -27814,9 +27816,11 @@ MichaelElhadad 15164-15172 We call into question the recently popularized method of direct model editing as a means of correcting factual errors in LLM generations. We contrast model editing with three similar but distinct approaches that pursue better defined objectives: (1) retrieval-based architectures, which decouple factual memory from inference and linguistic capabilities embodied in LLMs; (2) concept erasure methods, which aim at preventing systemic bias in generated text; and (3) attribution methods, which aim at grounding generations into identified textual sources. We argue that direct model editing cannot be trusted as a systematic remedy for the disadvantages inherent to LLMs, and while it has proven potential in improving model explainability, it opens risks by reinforcing the notion that models can be trusted for factuality. We call for cautious promotion and application of model editing as part of the LLM deployment process, and for responsibly limiting the use cases of LLMs to those not relying on editing as a critical component. - 2023.findings-emnlp.1012 + 2023.findings-emnlp.1012 pinter-elhadad-2023-emptying 10.18653/v1/2023.findings-emnlp.1012 + + Updates. A Causal View of Entity Bias in (Large) Language Models diff --git a/data/xml/2024.argmining.xml b/data/xml/2024.argmining.xml index 272b37ed81..504691e410 100644 --- a/data/xml/2024.argmining.xml +++ b/data/xml/2024.argmining.xml @@ -174,9 +174,11 @@ IrynaGurevych 130-149 Argument retrieval is the task of finding relevant arguments for a given query. While existing approaches rely solely on the semantic alignment of queries and arguments, this first shared task on perspective argument retrieval incorporates perspectives during retrieval, ac- counting for latent influences in argumenta- tion. We present a novel multilingual dataset covering demographic and socio-cultural (so- cio) variables, such as age, gender, and politi- cal attitude, representing minority and major- ity groups in society. We distinguish between three scenarios to explore how retrieval systems consider explicitly (in both query and corpus) and implicitly (only in query) formulated per- spectives. This paper provides an overview of this shared task and summarizes the results of the six submitted systems. We find substantial challenges in incorporating perspectivism, especially when aiming for personalization based solely on the text of arguments without explicitly providing socio profiles. Moreover, re- trieval systems tend to be biased towards the majority group but partially mitigate bias for the female gender. While we bootstrap per- spective argument retrieval, further research is essential to optimize retrieval systems to facilitate personalization and reduce polarization. - 2024.argmining-1.14 + 2024.argmining-1.14 falk-etal-2024-overview 10.18653/v1/2024.argmining-1.14 + + Corrected a typo. Sövereign at The Perspective Argument Retrieval Shared Task 2024: Using <fixed-case>LLM</fixed-case>s with Argument Mining diff --git a/data/xml/2024.conll.xml b/data/xml/2024.conll.xml index 2bfaebc3b6..c7fe8bd1e3 100644 --- a/data/xml/2024.conll.xml +++ b/data/xml/2024.conll.xml @@ -209,9 +209,11 @@ YevgeniBerzak 219-230 The effect of surprisal on processing difficulty has been a central topic of investigation in psycholinguistics. Here, we use eyetracking data to examine three language processing regimes that are common in daily life but have not been addressed with respect to this question: information seeking, repeated processing, and the combination of the two. Using standard regime-agnostic surprisal estimates we find that the prediction of surprisal theory regarding the presence of a linear effect of surprisal on processing times, extends to these regimes. However, when using surprisal estimates from regime-specific contexts that match the contexts and tasks given to humans, we find that in information seeking, such estimates do not improve the predictive power of processing times compared to standard surprisals. Further, regime-specific contexts yield near zero surprisal estimates with no predictive power for processing times in repeated reading. These findings point to misalignments of task and memory representations between humans and current language models, and question the extent to which such models can be used for estimating cognitively relevant quantities. We further discuss theoretical challenges posed by these results. - 2024.conll-1.17 + 2024.conll-1.17 klein-etal-2024-effect 10.18653/v1/2024.conll-1.17 + + The current PDF is missing the SM (supplementary materials). We provide here the right file that includes the SM. Revisiting Hierarchical Text Classification: Inference and Metrics diff --git a/data/xml/2024.emnlp.xml b/data/xml/2024.emnlp.xml index 7cb3d97434..9cb387885b 100644 --- a/data/xml/2024.emnlp.xml +++ b/data/xml/2024.emnlp.xml @@ -10584,10 +10584,12 @@ David A.CliftonUniversity of Oxford 13696-13710 The adoption of large language models (LLMs) to assist clinicians has attracted remarkable attention. Existing works mainly adopt the close-ended question-answering (QA) task with answer options for evaluation. However, many clinical decisions involve answering open-ended questions without pre-set options. To better understand LLMs in the clinic, we construct a benchmark ClinicBench. We first collect eleven existing datasets covering diverse clinical language generation, understanding, and reasoning tasks. Furthermore, we construct six novel datasets and clinical tasks that are complex but common in real-world practice, e.g., open-ended decision-making, long document processing, and emerging drug analysis. We conduct an extensive evaluation of twenty-two LLMs under both zero-shot and few-shot settings. Finally, we invite medical experts to evaluate the clinical usefulness of LLMs - 2024.emnlp-main.759 + 2024.emnlp-main.759 2024.emnlp-main.759.data.zip liu-etal-2024-large 10.18653/v1/2024.emnlp-main.759 + + Amend the wording of the funding acknowledgement. Holistic Automated Red Teaming for Large Language Models through Top-Down Test Case Generation and Multi-turn Interaction diff --git a/data/xml/2024.parlaclarin.xml b/data/xml/2024.parlaclarin.xml index 0faac55ae9..5a9222718e 100644 --- a/data/xml/2024.parlaclarin.xml +++ b/data/xml/2024.parlaclarin.xml @@ -108,6 +108,7 @@ 2024.parlaclarin-1.9 2024.parlaclarin-1.9.OptionalSupplementaryMaterial.docx menzel-2024-exploring + 2024.parlaclarin-1.9e1 Quantitative Analysis of Editing in Transcription Process in <fixed-case>J</fixed-case>apanese and <fixed-case>E</fixed-case>uropean Parliaments and its Diachronic Changes diff --git a/data/xml/2024.semeval.xml b/data/xml/2024.semeval.xml index cf82111cd1..a9bfff2608 100644 --- a/data/xml/2024.semeval.xml +++ b/data/xml/2024.semeval.xml @@ -3652,6 +3652,7 @@ jullien-etal-2024-semeval 10.18653/v1/2024.semeval-1.271 <fixed-case>S</fixed-case>em<fixed-case>E</fixed-case>val Task 1: Semantic Textual Relatedness for <fixed-case>A</fixed-case>frican and <fixed-case>A</fixed-case>sian Languages diff --git a/data/xml/2025.acl.xml b/data/xml/2025.acl.xml index cb623fdd14..d315b94ef0 100644 --- a/data/xml/2025.acl.xml +++ b/data/xml/2025.acl.xml @@ -2323,9 +2323,11 @@ QingGuNanjing University 3168-3181 Extracting sentence embeddings from large language models (LLMs) is a promising direction, as LLMs have demonstrated stronger semantic understanding capabilities. Previous studies typically focus on prompt engineering to elicit sentence embeddings from LLMs by prompting the model to encode sentence information into the embedding of the last token.However, LLMs are mostly decoder-only models with causal attention and the earlier tokens in the sentence cannot attend to the latter tokens, resulting in biased encoding of sentence information and cascading effects on the final decoded token.To this end, we propose a novel Token Prepending (TP) technique that prepends each layer’s decoded sentence embedding to the beginning of the sentence in the next layer’s input, allowing earlier tokens to attend to the complete sentence information under the causal attention mechanism.The proposed TP technique is a plug-and-play and training-free technique, which means it can be seamlessly integrated with various prompt-based sentence embedding methods and autoregressive LLMs.Extensive experiments on various Semantic Textual Similarity (STS) tasks and downstream classification tasks demonstrate that our proposed TP technique can significantly improve the performance of existing prompt-based sentence embedding methods across different LLMs, while incurring negligible additional inference cost. - 2025.acl-long.159 + 2025.acl-long.159 fu-etal-2025-token 10.18653/v1/2025.acl-long.159 + + Added explanations. No Questions are Stupid, but some are Poorly Posed: Understanding Poorly-Posed Information-Seeking Questions @@ -4745,9 +4747,11 @@ EvgenyBurnaevSkolkovo Institute of Science and Technology 6463-6480 Parameter Efficient Fine-Tuning (PEFT) methods have gained popularity and democratized the usage of Large Language Models (LLMs). Recent studies have shown that a small subset of weights significantly impacts performance. Based on this observation, we introduce a novel PEFT method, called Gaussian noise Injected Fine Tuning of Salient Weights (GIFT-SW). Our method updates only salient columns, while injecting Gaussian noise into non-salient ones. To identify these columns, we developed a generalized sensitivity metric that extends and unifies metrics from previous studies. Experiments with LLaMA models demonstrate that GIFT-SW outperforms full fine-tuning and modern PEFT methods under the same computational budget. Moreover, GIFT-SW offers practical advantages to recover performance of models subjected to mixed-precision quantization with keeping salient weights in full precision. - 2025.acl-long.324 + 2025.acl-long.324 zhelnin-etal-2025-gift 10.18653/v1/2025.acl-long.324 + + Updated ack. Quaff: Quantized Parameter-Efficient Fine-Tuning under Outlier Spatial Stability Hypothesis @@ -4828,9 +4832,12 @@ Marco AntonioStranisci 6625-6639 Canceling is a morally-driven phenomenon that hinders the development of safe social media platforms and contributes to ideological polarization. To address this issue we present the Canceling Attitudes Detection (CADE) dataset, an annotated corpus of canceling incidents aimed at exploring the factors of disagreements in evaluating people’s canceling attitudes on social media. Specifically, we study the impact of annotators’ morality in their perception of canceling, showing that morality is an independent axis for the explanation of disagreement on this phenomenon. Annotator’s judgments heavily depend on the type of controversial events and involved celebrities. This shows the need to develop more event-centric datasets to better understand how harms are perpetrated in social media and to develop more aware technologies for their detection. - 2025.acl-long.330 + 2025.acl-long.330 lo-etal-2025-unacceptable 10.18653/v1/2025.acl-long.330 + + The revision corrects some errors in the authors affiliations, in the first page. Concretely, the revision corrects the affiliations (3), (4), and (5). + Fixed minor error. <fixed-case>F</fixed-case>loor<fixed-case>P</fixed-case>lan-<fixed-case>LL</fixed-case>a<fixed-case>M</fixed-case>a: Aligning Architects’ Feedback and Domain Knowledge in Architectural Floor Plan Generation @@ -5462,9 +5469,11 @@ ZiweiLiuNanyang Technological University 7561-7582 Recent advancements in visual generative models have enabled high-quality image and video generation, opening diverse applications. However, evaluating these models often demands sampling hundreds or thousands of images or videos, making the process computationally expensive, especially for diffusion-based models with inherently slow sampling. Moreover, existing evaluation methods rely on rigid pipelines that overlook specific user needs and provide numerical results without clear explanations. In contrast, humans can quickly form impressions of a model’s capabilities by observing only a few samples. To mimic this, we propose the Evaluation Agent framework, which employs human-like strategies for efficient, dynamic, multi-round evaluations using only a few samples per round, while offering detailed, user-tailored analyses. It offers four key advantages: 1) efficiency, 2) promptable evaluation tailored to diverse user needs, 3) explainability beyond single numerical scores, and 4) scalability across various models and tools. Experiments show that Evaluation Agent reduces evaluation time to 10% of traditional methods while delivering comparable results. The Evaluation Agent framework is fully open-sourced to advance research in visual generative models and their efficient evaluation. - 2025.acl-long.374 + 2025.acl-long.374 zhang-etal-2025-evaluation 10.18653/v1/2025.acl-long.374 + + Metadata correction. Large Language Models Struggle to Describe the Haystack without Human Help: A Social Science-Inspired Evaluation of Topic Models @@ -6191,7 +6200,7 @@ XingyaoWangAll Hands AI and University of Illinois Urbana-Champaign 8697-8727 Code localization–identifying precisely where in a codebase changes need to be made–is a fundamental yet challenging task in software maintenance. Existing approaches struggle to efficiently navigate complex codebases when identifying relevant code snippets.The challenge lies in bridging natural language problem descriptions with the target code elements, often requiring reasoning across hierarchical structures and multiple dependencies.We introduce LocAgent, a framework that addresses code localization through a graph-guided agent.By parsing codebases into directed heterogeneous graphs, LocAgent creates a lightweight representation that captures code structures and their dependencies, enabling LLM agents to effectively search and locate relevant entities through powerful multi-hop reasoning.Experimental results on real-world benchmarks demonstrate that our approach significantly enhances accuracy in code localization.Notably, our method with the fine-tuned Qwen-2.5-Coder-Instruct-32B model achieves comparable results to SOTA proprietary models at greatly reduced cost (approximately 86% reduction), reaching up to 92.7% accuracy on file-level localization while improving downstream GitHub issue resolution success rates by 12% for multiple attempts (Pass@10). Our code is available at https://github.com/gersteinlab/LocAgent. - 2025.acl-long.426 + 2025.acl-long.426 chen-etal-2025-locagent 10.18653/v1/2025.acl-long.426 @@ -7081,9 +7090,11 @@ FeiWuZhejiang University 9887-9908 Large language models (LLMs) are revolutionizing education, with LLM-based agents playing a key role in simulating student behavior. A major challenge in student simulation is modeling the diverse learning patterns of students at various cognitive levels. However, current LLMs, typically trained as “helpful assistants”, target at generating perfect responses. As a result, they struggle to simulate students with diverse cognitive abilities, as they often produce overly advanced answers, missing the natural imperfections that characterize student learning and resulting in unrealistic simulations. To address this issue, we propose a training-free framework for student simulation. We begin by constructing a cognitive prototype for each student using a knowledge graph, which captures their understanding of concepts from past learning records. This prototype is then mapped to new tasks to predict student performance. Next, we simulate student solutions based on these predictions and iteratively refine them using a beam search method to better replicate realistic mistakes. To validate our approach, we construct the Student_100 dataset, consisting of 100 students working on Python programming and 5,000 learning records. Experimental results show that our method consistently outperforms baseline models, achieving 100% improvement in simulation accuracy and realism. - 2025.acl-long.488 + 2025.acl-long.488 wu-etal-2025-embracing 10.18653/v1/2025.acl-long.488 + + Author update. <fixed-case>CADR</fixed-case>eview: Automatically Reviewing <fixed-case>CAD</fixed-case> Programs with Error Detection and Correction @@ -10641,9 +10652,11 @@ SinaZarrießBielefeld University 14956-14975 Communication among humans relies on conversational grounding, allowing interlocutors to reach mutual understanding even when they do not have perfect knowledge and must resolve discrepancies in each other’s beliefs. This paper investigates how large language models (LLMs) manage common ground in cases where they (don’t) possess knowledge, focusing on facts in the political domain where the risk of misinformation and grounding failure is high. We examine LLMs’ ability to answer direct knowledge questions and loaded questions that presuppose misinformation.We evaluate whether loaded questions lead LLMs to engage in active grounding and correct false user beliefs, in connection to their level of knowledge and their political bias.Our findings highlight significant challenges in LLMs’ ability to engage in grounding and reject false user beliefs, raising concerns about their role in mitigating misinformation in political discourse. - 2025.acl-long.728 + 2025.acl-long.728 lachenmaier-etal-2025-llms 10.18653/v1/2025.acl-long.728 + + Updated format. <fixed-case>G</fixed-case>raph<fixed-case>C</fixed-case>heck: Breaking Long-Term Text Barriers with Extracted Knowledge Graph-Powered Fact-Checking @@ -11753,9 +11766,11 @@ SuhyunKimKyung Hee University 16489-16507 We introduce a novel framework for consolidating multi-turn adversarial “jailbreak” prompts into single-turn queries, significantly reducing the manual overhead required for adversarial testing of large language models (LLMs). While multi-turn human jailbreaks have been shown to yield high attack success rates (ASRs), they demand considerable human effort and time. Our proposed Multi-turn-to-Single-turn (M2S) methods—Hyphenize, Numberize, and Pythonize—systematically reformat multi-turn dialogues into structured single-turn prompts. Despite eliminating iterative back-and-forth interactions, these reformatted prompts preserve and often enhance adversarial potency: in extensive evaluations on the Multi-turn Human Jailbreak (MHJ) dataset, M2S methods yield ASRs ranging from 70.6 % to 95.9 % across various state-of-the-art LLMs. Remarkably, our single-turn prompts outperform the original multi-turn attacks by up to 17.5 % in absolute ASR, while reducing token usage by more than half on average. Further analyses reveal that embedding malicious requests in enumerated or code-like structures exploits “contextual blindness,” undermining both native guardrails and external input-output safeguards. By consolidating multi-turn conversations into efficient single-turn prompts, our M2S framework provides a powerful tool for large-scale red-teaming and exposes critical vulnerabilities in contemporary LLM defenses. All code, data, and conversion prompts are available for reproducibility and further investigations: https://github.com/Junuha/M2S_DATA - 2025.acl-long.805 + 2025.acl-long.805 ha-etal-2025-one 10.18653/v1/2025.acl-long.805 + + Title update. <fixed-case>RAE</fixed-case>mo<fixed-case>LLM</fixed-case>: Retrieval Augmented <fixed-case>LLM</fixed-case>s for Cross-Domain Misinformation Detection Using In-Context Learning Based on Emotional Information @@ -14130,7 +14145,7 @@ Jordan LeeBoyd-GraberUniversity of Maryland, College Park 19586-19587 Language models are often miscalibrated, leading to confidently incorrect answers. We introduce GRACE, a benchmark for language model calibration that incorporates comparison with human calibration. GRACE consists of question-answer pairs, in which each question contains a series of clues that gradually become easier, all leading to the same answer; models must answer correctly as early as possible as the clues are revealed. This setting permits granular measurement of model calibration based on how early, accurately, and confidently a model answers. After collecting these questions, we host live human vs. model competitions to gather 1,749 data points on human and model teams’ timing, accuracy, and confidence. We propose a metric, CalScore, that uses GRACE to analyze model calibration errors and identify types of model miscalibration that differ from human behavior. We find that although humans are less accurate than models, humans are generally better calibrated. Since state-of-the-art models struggle on GRACE, it effectively evaluates progress on improving model calibration. - 2025.acl-long.962 + 2025.acl-long.962 sung-etal-2025-grace 10.18653/v1/2025.acl-long.962 @@ -18347,9 +18362,11 @@ NiranjanBalasubramanianState University of New York, Stony Brook 26039-26057 LLM generated code often contains security issues. We address two key challenges in improving secure code generation. First, obtaining high quality training data covering a broad set of security issues is critical. To address this, we introduce a method for distilling a preference dataset of insecure and secure code pairs from frontier LLMs, along with a security reasoning that explains the issues and the fix. The key idea here is to make use of security knowledge sources to devise a systematic prompting strategy that ensures broad coverage. Second, aligning models to secure code requires focusing on localized regions of code. Direct preference optimization methods, like SimPO, are not designed to handle these localized differences and turn out to be ineffective. We address this with a new localized preference optimization algorithm that masks the security related tokens in both the winning (secure) and losing (insecure) responses. To prevent loss in code quality, we also add a regularizer. Evaluations show that both training on our dataset, DiSCo, and the new preference optimization algorithm, LPO, yield substantial reductions in code insecurity while also improving overall code quality. Code and dataset are available at https://github.com/StonyBrookNLP/disco-lpo. - 2025.acl-long.1263 + 2025.acl-long.1263 hasan-etal-2025-teaching 10.18653/v1/2025.acl-long.1263 + + Author name update. Anything Goes? A Crosslinguistic Study of (Im)possible Language Learning in <fixed-case>LM</fixed-case>s @@ -20980,9 +20997,11 @@ GhassenKarrayUniversity of Zurich 30005-30031 This article examines LLMs’ ability to correctly label simple inferences with partisan conclusions. For this, we develop a dataset with both formal and material inferences, containing logically equivalent pairs of inferences with conclusions that favor either the political left or the political right. This allows us to focus on political bias as a source of decrease in performance. Our samples are synthetically generated and thus highly controlled, covering both English and German. We assess the performance of 16 configurations of both open and proprietary state-of-the-art LLMs on that dataset, finding generally unreliable performance as well as widespread political bias which, in the case of the English samples, persists throughout our experimental settings. - 2025.acl-long.1450 + 2025.acl-long.1450 gubelmann-karray-2025-assessing 10.18653/v1/2025.acl-long.1450 + + Adding acknowledgements section. <fixed-case>PARME</fixed-case>: Parallel Corpora for Low-Resourced <fixed-case>M</fixed-case>iddle <fixed-case>E</fixed-case>astern Languages @@ -21459,9 +21478,11 @@ MinZhangHarbin Institute of Technology, Shenzhen 30678-30701 Large Vision-Language Models (LVLMs) have demonstrated remarkable performance across diverse tasks. Despite great success, recent studies show that LVLMs encounter substantial limitations when engaging with visual graphs. To study the reason behind these limitations, we propose VGCure, a comprehensive benchmark covering 22 tasks for examining the fundamental graph understanding and reasoning capacities of LVLMs. Extensive evaluations conducted on 14 LVLMs reveal that LVLMs are weak in basic graph understanding and reasoning tasks, particularly those concerning relational or structurally complex information. Based on this observation, we propose a structure-aware fine-tuning framework to enhance LVLMs with structure learning abilities through three self-supervised learning tasks. Experiments validate the effectiveness of our method in improving LVLMs’ performance on fundamental and downstream graph learning tasks, as well as enhancing their robustness against complex visual graphs. - 2025.acl-long.1482 + 2025.acl-long.1482 zhu-etal-2025-benchmarking 10.18653/v1/2025.acl-long.1482 + + Updated the Acknowledgements. <fixed-case>ISR</fixed-case>: Self-Refining Referring Expressions for Entity Grounding @@ -21794,9 +21815,11 @@ MinZhangHarbin Institute of Technology, Shenzhen 31156-31171 Reinforcement Learning (RL) algorithms for safety alignment of Large Language Models (LLMs), such as Direct Preference Optimization (DPO), encounter the challenge of distribution shift. Current approaches typically address this issue through online sampling from the target policy, which requires significant computational resources.In this paper, we hypothesize that during off-policy training, while the ranking order of output generated by policy changes, their overall distribution remains relatively stable.This stability allows the conversion of the sampling process from the target policy into a computationallyefficient re-ranking of preference data.Building on this hypothesis, we propose a new framework that leverages the model’s intrinsic safety judgment capability to extract reward signals, which are then used to calculate label confidence for preference reordering. Extensive experiments and theoretical analysis demonstrate that the proposed method effectively addresses the distribution shift issue, remarkably enhancing the safety performance while avoiding about 300x computational overheads. - 2025.acl-long.1504 + 2025.acl-long.1504 qiyuan-etal-2025-efficient 10.18653/v1/2025.acl-long.1504 + + Updated ack. <fixed-case>E</fixed-case>nglish-based acoustic models perform well in the forced alignment of two <fixed-case>E</fixed-case>nglish-based Pacific Creoles @@ -22560,9 +22583,11 @@ YongLi 32400-32423 Large multimodal models exhibit remarkable intelligence, yet their embodied cognitive abilities during motion in open-ended urban aerial spaces remain to be explored. We introduce a benchmark to evaluate whether video-large language models (Video-LLMs) can naturally process continuous first-person visual observations like humans, enabling recall, perception, reasoning, and navigation. We have manually control drones to collect 3D embodied motion video data from real-world cities and simulated environments, resulting in 1.5k video clips. Then we design a pipeline to generate 5.2k multiple-choice questions. Evaluations of 17 widely-used Video-LLMs reveal current limitations in urban embodied cognition. Correlation analysis provides insight into the relationships between different tasks, showing that causal reasoning has a strong correlation with recall, perception, and navigation, while the abilities for counterfactual and associative reasoning exhibit lower correlation with other tasks. We also validate the potential for Sim-to-Real transfer in urban embodiment through fine-tuning. - 2025.acl-long.1558 + 2025.acl-long.1558 zhao-etal-2025-urbanvideo 10.18653/v1/2025.acl-long.1558 + + Added a footnote to clarify affiliations, updated author affiliations, and revised the caption for Figure 2. <fixed-case>HELIOS</fixed-case>: Harmonizing Early Fusion, Late Fusion, and <fixed-case>LLM</fixed-case> Reasoning for Multi-Granular Table-Text Retrieval @@ -23695,9 +23720,11 @@ JonathanMayUniversity of Southern California and USC/ISI 381-413 Faced with an expensive human annotation process, creators of NLP systems increasingly turn to synthetic data generation. While this method shows promise, the extent to which synthetic data can replace human annotation is poorly understood. We investigate the use of synthetic data in Fact Verification (FV) and Evidence-based Question Answering (QA) by incrementally replacing human-generated data with synthetic points on eight diverse datasets. Strikingly, replacing up to 90% of the training data only marginally decreases performance, but replacing the final 10% leads to severe declines. We find that models trained on purely synthetic data can be improved by including as few as 125 human-generated data points. We show that matching the performance gain of a little human data requires an order of magnitude more synthetic data, and then estimate price ratios at which human annotation would be a more cost-effective solution. Our results suggest that even when human annotation at scale is infeasible, there is great value to having a small proportion of the dataset being human-generated. - 2025.acl-short.30 + 2025.acl-short.30 ashok-may-2025-little 10.18653/v1/2025.acl-short.30 + + Added a sponsor. Seeking Rational Demonstrations for Large Language Models: A Domain Generalization Approach to Unsupervised Cross-Domain Keyphrase Generation @@ -24598,9 +24625,11 @@ AiTiAwI2R 22-30 We introduce MERaLiON-AudioLLM, the first general-purpose audio-based large language model designed for multitask learning, with a particular focus on Singlish understanding. Trained on 62 million multimodal instruction samples comprising a total of 260k hours of audio, it exhibits strong generalization across a diverse set of tasks, including—but not limited to—automatic speech recognition, spoken question answering, speech translation, and paralinguistic analysis. Our results show significant improvements in local speech recognition and task-specific understanding, making MERaLiON-AudioLLM a leading solution for region-specific AI applications. An interactive demo has been developed to enable user-friendly interactions, supported by a backend with customized caching and load-balancing mechanisms. We benchmark the model across a broad range of multilingual and multitask scenarios, where it demonstrates competitive performance compared to other open-source models. The demo page, model weights and videos are publically accessible. - 2025.acl-demo.3 + 2025.acl-demo.3 he-etal-2025-meralion 10.18653/v1/2025.acl-demo.3 + + Updated Ack. <fixed-case>N</fixed-case>ame<fixed-case>T</fixed-case>ag 3: A Tool and a Service for Multilingual/Multitagset <fixed-case>NER</fixed-case> diff --git a/data/xml/2025.argmining.xml b/data/xml/2025.argmining.xml index 04961d6bdf..2422fdb783 100644 --- a/data/xml/2025.argmining.xml +++ b/data/xml/2025.argmining.xml @@ -211,9 +211,11 @@ ElsLefever 168-180 Definition generation models trained on dictionary data are generally expected to produce neutral and unbiased output while capturing the contextual nuances. However, previous studies have shown that generated definitions can inherit biases from both the underlying models and the input context. This paper examines the extent to which stance-related bias in argumentative data influences the generated definitions. In particular, we train a model on a slang-based dictionary to explore the feasibility of generating persuasive definitions that concisely reflect opposing parties’ understandings of contested terms. Through this study, we provide new insights into bias propagation in definition generation and its implications for definition generation applications and argument mining. - 2025.argmining-1.16 + 2025.argmining-1.16 evgrafova-etal-2025-stance 10.18653/v1/2025.argmining-1.16 + + Various updates. Reproducing the Argument Quality Prediction of Project Debater diff --git a/data/xml/2025.bionlp.xml b/data/xml/2025.bionlp.xml index eadf2861b0..1598bcf199 100644 --- a/data/xml/2025.bionlp.xml +++ b/data/xml/2025.bionlp.xml @@ -155,11 +155,13 @@ RamakanthKavuluruUniversity of Kentucky 101-113 Extractive question answering over clinical text is a crucial need to help deal with the deluge of clinical text generated in hospitals. While encoder models (e.g., BERT) have been popular for this reading comprehension–style question answering task, recently encoder-decoder models (e.g., T5) are on the rise. There is also the emergence of preference optimization techniques to align decoder-only LLMs with human preferences. In this paper, we combine encoder-decoder models with the direct preference optimization (DPO) method for the RadQA radiology question answering task. Our approach achieves a 12–15 F1 point improvement over previous state-of-the-art models. To the best of our knowledge, this effort is the first to show that DPO method also works for reading comprehension via novel heuristics to generate preference data without human inputs. - 2025.bionlp-1.10 + 2025.bionlp-1.10 2025.bionlp-1.10.SupplementaryMaterial.zip 2025.bionlp-1.10.SupplementaryMaterial.txt nahian-kavuluru-2025-radqa 10.18653/v1/2025.bionlp-1.10 + + Codebase link update. Gender-Neutral Large Language Models for Medical Applications: Reducing Bias in <fixed-case>P</fixed-case>ub<fixed-case>M</fixed-case>ed Abstracts @@ -649,9 +651,11 @@ TitipatAchakulvisutDepartment of Biomedical Engineering, Mahidol University 96-103 This paper presents an approach to answering patient-specific medical questions using electronic health record (EHR) grounding with ArchEHR-QA 2025 datasets. We address medical question answering as an alignment problem, focusing on generating responses factually consistent with patient-specific clinical notes through in-context learning techniques. We show that LLM-generated responses, used as few-shot examples with GPT-4.1 and Gemini-2.5-Pro, significantly outperform baseline approaches (overall score = 49.1), achieving strict precision, recall, and F1-micro scores of 60.6, 53.6, and 56.9, respectively, on the ArchEHR-QA 2025 test leaderboard. It achieves textual similarity between answers and essential evidence using BLEU, ROUGE, SARI, BERTScore, AlignScore, and MEDCON scores of 6.0, 32.1, 65.8, 36.4, 64.3, and 43.6, respectively. Our findings highlight the effectiveness of combining EHR grounding with few-shot examples for personalized medical question answering, establishing a promising approach for developing accurate and personalized medical question answering systems. We release our code at https://github.com/biodatlab/archehr-qa-lamar. - 2025.bionlp-share.12 + 2025.bionlp-share.12 yoadsanit-etal-2025-lamar 10.18653/v1/2025.bionlp-share.12 + + Minor fixes. Neural at <fixed-case>A</fixed-case>rch<fixed-case>EHR</fixed-case>-<fixed-case>QA</fixed-case> 2025: Agentic Prompt Optimization for Evidence-Grounded Clinical Question Answering @@ -912,9 +916,11 @@ HongyiXinShanghai Jiao Tong University 275-280 We propose a unified, multi-stage lay summarization pipeline for BioLaySumm 2025 (Subtask 1.1) that (1) selects and summarizes key article sections via BioBART, (2) retrieves K-shot demonstrations using BGE embeddings for in-context Llama 3 8B prompting, (3) applies LoRA adapters to Llama 3 8B for supervised fine-tuning, (4) merges section summaries with a second BioBART pass, and (5) refines outputs through reinforcement learning (PPO & GRPO) using a composite reward of factuality (AlignScore, SummaC), relevance (ROUGE-L, BERTScore), and readability (LENS, FKGL, DCRS, CLI). On PLOS and eLife validation sets, our complete systemreduces DCRS from 9.23 to 8.56 and reduces CLI from 12.98 to 12.65, ranking 3rd in readability. and outperforms llama3 finetune baseline in AlignScore 0.722 to 0.862, ranking 5th in factuality, demonstrating balanced gains across readability, relevance, and factuality. - 2025.bionlp-share.33 + 2025.bionlp-share.33 xu-etal-2025-team 10.18653/v1/2025.bionlp-share.33 + + This revision mainly updated some citations. <fixed-case>V</fixed-case>e<fixed-case>R</fixed-case>ea<fixed-case>F</fixed-case>ine: Iterative Verification Reasoning Refinement <fixed-case>RAG</fixed-case> for Hallucination-Resistant on Open-Ended Clinical <fixed-case>QA</fixed-case> diff --git a/data/xml/2025.coling.xml b/data/xml/2025.coling.xml index 13b13f2baf..74e2e6eaad 100644 --- a/data/xml/2025.coling.xml +++ b/data/xml/2025.coling.xml @@ -10149,10 +10149,11 @@ JuheePark 794–806 In-vehicle speech recognition (IVSR) systems are crucial components of modern automotive interfaces, enabling hands-free control and enhancing user safety. However, traditional IVSR systems often struggle with interpreting user intent accurately due to limitations in contextual understanding and ambiguity resolution, leading to user frustration. This paper introduces LLM ContextBridge, a novel hybrid architecture that integrates Pretrained Language Model-based intent classification with Large Language Models to enhance both command recognition and dialogue management. LLM ContextBridge serves as a seamless bridge between traditional natural language understanding techniques and LLMs, combining the precise intent recognition of conventional NLU with the contextual handling and ambiguity resolution capabilities of LLMs. This approach significantly improves recognition accuracy and user experience, particularly in complex, multi-turn dialogues. Experimental results show notable improvements in task success rates and user satisfaction, demonstrating that LLM ContextBridge can make IVSR systems more intuitive, responsive, and context-aware. - 2025.coling-industry.66 + 2025.coling-industry.66 chun-etal-2025-llm Fixed references. + Updated author info. Neural Document Segmentation Using Weighted Sliding Windows with Transformer Encoders diff --git a/data/xml/2025.conll.xml b/data/xml/2025.conll.xml index cbce8d3a84..b05e21c3e0 100644 --- a/data/xml/2025.conll.xml +++ b/data/xml/2025.conll.xml @@ -308,10 +308,12 @@ NathanSchneiderGeorgetown University 365-376 Construction Grammar hypothesizes that knowledge of a language consists chiefly of knowledge of form–meaning pairs (“constructions”) that include vocabulary, general grammar rules, and even idiosyncratic patterns. Recent work has shown that transformer language models represent at least some constructional patterns, including ones where the construction is rare overall. In this work, we probe BERT’s representation of the form and meaning of a minor construction of English, the NPN (noun–preposition–noun) construction—exhibited in such expressions as face to face and day to day—which is known to be polysemous. We construct a benchmark dataset of semantically annotated corpus instances (including distractors that superficially resemble the construction). With this dataset, we train and evaluate probing classifiers. They achieve decent discrimination of the construction from distractors, as well as sense disambiguation among true instances of the construction, revealing that BERT embeddings carry indications of the construction’s semantics.Moreover, artificially permuting the word order of true construction instances causes them to be rejected, indicating sensitivity to matters of form. We conclude that BERT does latently encode at least some knowledge of the NPN construction going beyond a surface syntactic pattern and lexical cues. - 2025.conll-1.24 + 2025.conll-1.24 2025.conll-1.24.software.zip scivetti-schneider-2025-construction 10.18653/v1/2025.conll-1.24 + + Minor updates. Evidence of Generative Syntax in <fixed-case>LLM</fixed-case>s diff --git a/data/xml/2025.depling.xml b/data/xml/2025.depling.xml index 19bd4e81f6..602fa945de 100644 --- a/data/xml/2025.depling.xml +++ b/data/xml/2025.depling.xml @@ -16,8 +16,11 @@ 979-8-89176-290-9 - 2025.depling-1.0 + 2025.depling-1.0 depling-ws-syntaxfest-2025-1 + + Corrected a typo. + Minor removal. A Typology of Non-Projective Patterns in Unas’s and Teti’s Pyramid Texts @@ -54,8 +57,10 @@ LeoWannerBarcelona Supercomputing Center 36-53 While the competence of LLMs to cope with agreement constraints has been widely tested in English, only a very limited number of works deals with morphologically rich(er) languages. In this work, we experiment with 25 mono- and multilingual LLMs, applying them to a collection of more than 5,000 test examples that cover the main agreement phenomena in three Romance languages (Italian, Portuguese, and Spanish) and one Slavic Language (Russian). We identify which of the agreement phenomena are most difficult for which models and challenge some common assumptions of what makes a good model. The test suites into which the test examples are organized are openly available and can be easily adapted to other agreement phenomena and other languages for further research. - 2025.depling-1.4 + 2025.depling-1.4 taboas-garcia-wanner-2025-assessing + + Added acknowledgments. Introducing <fixed-case>KIP</fixed-case>arla Forest: seeds for a <fixed-case>UD</fixed-case> annotation of interactional syntax diff --git a/data/xml/2025.findings.xml b/data/xml/2025.findings.xml index e3c763dd4a..0508734bde 100644 --- a/data/xml/2025.findings.xml +++ b/data/xml/2025.findings.xml @@ -5249,9 +5249,11 @@ NicolasThomesorbonne université 7030-7046 Reinforcement learning (RL) is a promising approach for aligning large language models (LLMs) knowledge with sequential decision-making tasks. However, few studies have thoroughly investigated the impact on LLM agents capabilities of fine-tuning them with RL in a specific environment. In this paper, we propose a novel framework to analyze the sensitivity of LLMs to prompt formulations following RL training in a textual environment. Our findings reveal that the performance of LLMs degrades when faced with prompt formulations different from those used during the RL training phase. Besides, we analyze the source of this sensitivity by examining the model’s internal representations and salient tokens. Finally, we propose to use a contrastive loss to mitigate this sensitivity and improve the robustness and generalization capabilities of LLMs. - 2025.findings-naacl.390 + 2025.findings-naacl.390 aissi-etal-2025-reinforcement 10.18653/v1/2025.findings-naacl.390 + + Author detail update. An empirical study of validating synthetic data for formula generation @@ -14021,9 +14023,12 @@ YixuanYuanThe Chinese University of Hong Kong 10345-10359 Large Language Models (LLMs) are transforming healthcare through LLM-based agents that can understand and assist with medical tasks. This survey examines the architectures, applications, and challenges of LLM-based agents in medicine. We analyze key components including system profiles, clinical planning, medical reasoning frameworks, and external capacity enhancement. The survey covers major applications in clinical decision support, medical documentation, training simulations, and healthcare service optimization, along with evaluation frameworks and metrics. While these agents show promise in enhancing healthcare delivery, challenges remain in hallucination management, multimodal integration, implementation, and ethics. We conclude by highlighting future directions in medical reasoning, physical system integration, and training simulations, providing researchers and practitioners with a structured overview of the field’s current state and prospects. - 2025.findings-acl.539 + 2025.findings-acl.539 wang-etal-2025-survey 10.18653/v1/2025.findings-acl.539 + + Corrected a few citations. + Updated citations. Context-Robust Knowledge Editing for Language Models @@ -19455,9 +19460,11 @@ DenizGunduzImperial College London 18189-18204 Speculative decoding accelerates large language model inference using a smaller draft model. In this paper, we establish a surprising connection between speculative sampling and the concept of channel simulation from information theory, which aims at simulating a noisy channel using as few bits as possible. This connection allows us to provide an information-theoretic analysis of the speed up that can be achieved by speculative sampling. Leveraging this link, we derive an explicit relation between generation speed-up and the number of tokens k generated by the draft model for large k, which serves as an upper bound for all k. We also propose a novel speculative sampling method via exponential races called ERSS that matches state-of-the-art performance. - 2025.findings-acl.936 + 2025.findings-acl.936 kobus-gunduz-2025-speculative 10.18653/v1/2025.findings-acl.936 + + Added footnote. Going Beyond Your Expectations in Latency Metrics for Simultaneous Speech Translation @@ -25585,9 +25592,11 @@ JonathanMayUniversity of Southern California and USC/ISI 26744-26759 Non-verbal communication (NVC) is an integral part of human language, but it has been overlooked in natural language processing research. Studying NVC in general is challenging because of its high variance in interpretation among individuals and cultures, but mime—the theatrical technique of suggesting intent using only gesture, expression, and movement—is a subset of NVC with much lower human interpretation variance. As a gateway for evaluating vision-language models on their understanding of NVC, we propose Mime Identification-based Multimodal Evaluation (MIME), a gesture recognition task built upon a novel corpus of mimed activity comprising 86 unique gestures with a variety of perturbations applied to the avatar, background, and viewpoint for evaluating recognition robustness. We find that both open-weight and API-based vision-language models perform significantly worse than humans at identifying mimed gestures in MIME, motivating the need for increased research for instilling more robust understanding of human actions for VLMs. - 2025.findings-acl.1372 + 2025.findings-acl.1372 cho-etal-2025-vision 10.18653/v1/2025.findings-acl.1372 + + Minor updates. Training Language Model to Critique for Better Refinement diff --git a/data/xml/2025.gebnlp.xml b/data/xml/2025.gebnlp.xml index da4506bbbd..292a04b2cc 100644 --- a/data/xml/2025.gebnlp.xml +++ b/data/xml/2025.gebnlp.xml @@ -64,9 +64,11 @@ MaschaKurpicz-BrikiBFH - Bern University of Applied Sciences 33-51 Bias in Natural Language Processing (NLP) applications has become a critical issue, with many methods developed to measure and mitigate bias in word embeddings and language models. However, most approaches focus on single categories such as gender or ethnicity, neglecting the intersectionality of biases, particularly in non-English languages. This paper addresses these gaps by studying both single-category and intersectional biases in Italian word embeddings and language models. We extend existing bias metrics to Italian, introducing GG-FISE, a novel method for detecting intersectional bias while accounting for grammatical gender. We also adapt the CrowS-Pairs dataset and bias metric to Italian. Through a series of experiments using WEAT, SEAT, and LPBS tests, we identify significant biases along gender and ethnic lines, with particular attention to biases against Romanian and South Asian populations. Our results highlight the need for culturally adapted methods to detect and address biases in multilingual and intersectional contexts. - 2025.gebnlp-1.3 + 2025.gebnlp-1.3 puttick-kurpicz-briki-2025-detecting 10.18653/v1/2025.gebnlp-1.3 + + Minor updates. Power(ful) Associations: Rethinking “Stereotype” for <fixed-case>NLP</fixed-case> diff --git a/data/xml/2025.gem.xml b/data/xml/2025.gem.xml index b7accde96b..993d29dc3e 100644 --- a/data/xml/2025.gem.xml +++ b/data/xml/2025.gem.xml @@ -743,8 +743,10 @@ AkikoAizawaNational Institute of Informatics 973-973 Keyphrase generation refers to the task of producing a set of words or phrases that summarises the content of a document. Continuous efforts have been dedicated to this task over the past few years, spreading across multiple lines of research, such as model architectures, data resources, and use-case scenarios. Yet, the current state of keyphrase generation remains unknown as there has been no attempt to review and analyse previous work. In this paper, we bridge this gap by presenting an analysis of over 50 research papers on keyphrase generation, offering a comprehensive overview of recent progress, limitations, and open challenges. Our findings highlight several critical issues in current evaluation practices, such as the concerning similarity among commonly-used benchmark datasets and inconsistencies in metric calculations leading to overestimated performances. Additionally, we address the limited availability of pre-trained models by releasing a strong PLM-based model for keyphrase generation as an effort to facilitate future research. - 2025.gem-1.76 + 2025.gem-1.76 boudin-aizawa-2025-analysis + + Updated the paper. <fixed-case>U</fixed-case>-<fixed-case>MATH</fixed-case>: A University-Level Benchmark for Evaluating Mathematical Skills in Large Language Models diff --git a/data/xml/2025.iwpt.xml b/data/xml/2025.iwpt.xml index 263bfc42a8..7153f07402 100644 --- a/data/xml/2025.iwpt.xml +++ b/data/xml/2025.iwpt.xml @@ -15,8 +15,10 @@ 979-8-89176-294-7 - 2025.iwpt-1.0 + 2025.iwpt-1.0 iwpt-syntaxfest-2025-1 + + Typo correction. An Efficient Parser for Bounded-Order Product-Free <fixed-case>L</fixed-case>ambek Categorial Grammar via Term Graph diff --git a/data/xml/2025.knowllm.xml b/data/xml/2025.knowllm.xml index 0403fa2474..c1be4e66fb 100644 --- a/data/xml/2025.knowllm.xml +++ b/data/xml/2025.knowllm.xml @@ -174,9 +174,11 @@ ZhengChenHong Kong University of Science and Technology 120-139 Chinese homophones, prevalent in Internet culture, bring rich linguistic twists that are challenging for language models. While native speakers disambiguate them through phonological reasoning and contextual understanding, it remains untested how well LLMs perform on this task and whether LLMs also achieve this via similar reasoning processes or merely through memorization of homophone-original word pairs during training.In this paper, we present HomoP-CN, the first Chinese Internet homophones dataset with systematic perturbations for evaluating LLMs’ homophone restoration capabilities. Using this benchmark, we investigated the influence of semantic, phonological, and graphemic features on LLMs’ restoration accuracy, measured the reliance levels of each model on memorization during restoration through consistency ratios under controlled perturbations, and assessed the effectiveness of various prompting strategies, including contextual cues, pinyin augmentation, few-shot learning, and thought-chain approaches. - 2025.knowllm-1.11 + 2025.knowllm-1.11 ma-etal-2025-reasoning 10.18653/v1/2025.knowllm-1.11 + + Correct author list. Superfluous Instruction: Vulnerabilities Stemming from Task-Specific Superficial Expressions in Instruction Templates diff --git a/data/xml/2025.law.xml b/data/xml/2025.law.xml index 6dd0459955..614cd1b9ca 100644 --- a/data/xml/2025.law.xml +++ b/data/xml/2025.law.xml @@ -278,9 +278,11 @@ NataliaPatiño MazzottiGoethe University Frankfurt 279-284 In this paper, we identify types of uncertainty in interlinear glossed text (IGT) annotation, a common notation for language data in linguistic research. - 2025.law-1.23 + 2025.law-1.23 ionov-patino-mazzotti-2025-addressing 10.18653/v1/2025.law-1.23 + + Corrected errors. Illuminating Logical Fallacies with the <fixed-case>CAMPFIRE</fixed-case> Corpus diff --git a/data/xml/2025.naacl.xml b/data/xml/2025.naacl.xml index feb619dc98..71f1e42c2d 100644 --- a/data/xml/2025.naacl.xml +++ b/data/xml/2025.naacl.xml @@ -1045,9 +1045,11 @@ IsabelleAugensteinUniversity of Copenhagen 1607-1627 Studying human values is instrumental for cross-cultural research, enabling a better understanding of preferences and behaviour of society at large and communities therein. To study the dynamics of communities online, we propose a method to computationally analyse values present on Reddit. Our method allows analysis at scale, complementing survey based approaches. We train a value relevance and a value polarity classifier, which we thoroughly evaluate using in-domain and out-of-domain human annotations. Using these, we automatically annotate over nine million posts across 12k subreddits with Schwartz values. Our analysis unveils both previously recorded and novel insights into the values prevalent within various online communities. For instance, we discover a very negative stance towards conformity in the Vegan and AbolishTheMonarchy subreddits. Additionally, our study of geographically specific subreddits highlights the correlation between traditional values and conservative U.S. states. Through our work, we demonstrate how our dataset and method can be used as a complementary tool for qualitative study of online communication. - 2025.naacl-long.77 + 2025.naacl-long.77 borenstein-etal-2025-investigating 10.18653/v1/2025.naacl-long.77 + + Added Ack. Pointwise Mutual Information as a Performance Gauge for Retrieval-Augmented Generation @@ -6288,9 +6290,11 @@ LeiLiSchool of Computer Science, Carnegie Mellon University 9077-9090 The lottery ticket hypothesis posits the existence of “winning tickets” within a randomly initialized neural network. Do winning tickets exist for LLMs in fine-tuning scenarios? How can we find such winning tickets? In this paper, we propose KS-Lottery, a method to identify a small subset of LLM parameters highly effective in multilingual fine-tuning. Our key idea is to use Kolmogorov-Smirnov Test to analyze the distribution shift of parameters before and after fine-tuning. We further theoretically prove that KS-Lottery can find the certified winning tickets in the embedding layer, fine-tuning on the found parameters is guaranteed to perform as well as full fine-tuning. Comparing KS-Lottery with other tuning algorithms on translation tasks, the experimental results show that KS-Lottery finds a much smaller set of parameters for fine-tuning while achieving the comparable performance as full fine-tuning LLM. Surprisingly, we find that fine-tuning 18 tokens’ embedding of LLaMA suffices to reach the fine-tuning translation performance . - 2025.naacl-long.458 + 2025.naacl-long.458 yuan-etal-2025-ks 10.18653/v1/2025.naacl-long.458 + + Minor updates. <fixed-case>PA</fixed-case>-<fixed-case>RAG</fixed-case>: <fixed-case>RAG</fixed-case> Alignment via Multi-Perspective Preference Optimization @@ -8656,9 +8660,11 @@ SaptarshiGhoshIndian Institute of Technology Kharagpur 12688-12704 Large language models (LLMs) are increasingly recognized for their exceptional generative capabilities and versatility across various tasks. However, the high inference costs associated with these models have not received adequate attention, particularly when compared to the focus on training costs in existing research. In response to this gap, our study conducts a comprehensive benchmarking of LLM inference energy across a wide range of NLP tasks, where we analyze the impact of different models, tasks, prompts, and system-related factors on inference energy. Specifically, our experiments reveal several interesting insights, including strong correlation of inference energy with output token length and response time. Also, we find that quantization and optimal batch sizes, along with targeted prompt phrases, can significantly reduce energy usage. This study is the first to thoroughly benchmark LLM inference across such a diverse range of aspects, providing insights and offering several recommendations for improving energy efficiency in model deployment. - 2025.naacl-long.632 + 2025.naacl-long.632 poddar-etal-2025-towards 10.18653/v1/2025.naacl-long.632 + + Add equal contribution note <fixed-case>CSR</fixed-case>-Bench: Benchmarking <fixed-case>LLM</fixed-case> Agents in Deployment of Computer Science Research Repositories diff --git a/data/xml/2025.quasy.xml b/data/xml/2025.quasy.xml index cbcccd726e..f684f0cee3 100644 --- a/data/xml/2025.quasy.xml +++ b/data/xml/2025.quasy.xml @@ -16,8 +16,10 @@ 979-8-89176-293-0 - 2025.quasy-1.0 + 2025.quasy-1.0 quasy-ws-syntaxfest-2025-1 + + Typo correction. Subject-Verb Agreement Alternations in <fixed-case>S</fixed-case>panish Pseudopartitive Constructions: A Corpus Study @@ -54,8 +56,10 @@ SylvainKahaneUniversité Paris Nanterre 26-38 In this paper, we develop a data-driven contrastive framework to extract common and distinctive linguistic descriptions from syntactic treebanks. The extracted contrastive rules are defined by a statistically significant difference in precision and classified as common and distinctive rules across the set of treebanks. We illustrate our method by working on object word order using Universal Dependencies (UD) treebanks in 6 Romance languages: Brazilian Portuguese, Catalan, French, Italian, Romanian and Spanish. We discuss the limitations faced due to inconsistent annotation and the feasibility of conducting contrasting studies using the UD collection. - 2025.quasy-1.5 + 2025.quasy-1.5 herrera-etal-2025-extraction + + Minor fixes. A Quantitative Study of Syntactic Complexity across Genres: Dependency Distance in <fixed-case>E</fixed-case>nglish and <fixed-case>C</fixed-case>hinese diff --git a/data/xml/2025.udw.xml b/data/xml/2025.udw.xml index e3e5eaa7ac..c63d5ade65 100644 --- a/data/xml/2025.udw.xml +++ b/data/xml/2025.udw.xml @@ -16,8 +16,10 @@ 979-8-89176-292-3 - 2025.udw-1.0 + 2025.udw-1.0 udw-ws-2025-1 + + Typo correction. Reference and Modification in <fixed-case>U</fixed-case>niversal <fixed-case>D</fixed-case>ependencies diff --git a/data/xml/2025.woah.xml b/data/xml/2025.woah.xml index 07b1cd1f52..d1ee3eaa46 100644 --- a/data/xml/2025.woah.xml +++ b/data/xml/2025.woah.xml @@ -13,7 +13,7 @@
Vienna, Austria
August 2025 - 2025.woah-1 + 2025.woah-1 woah ws 979-8-89176-105-6 @@ -385,8 +385,7 @@ MarcosGarciaUniversidade de Santiago de Compostela 426-457 Conspiracist narratives posit an omnipotent, evil group causing harm throughout domains. However, modern-day online conspiracism is often more erratic, consisting of loosely connected posts displaying a general anti-establishment attitude pervaded by negative emotions. We gather a dataset of 300 conspiracist and mainstream, Telegram channels in Italian and English and use the automatic extraction of entities and emotion detection to compare structural characteristics of both types of channels. We create a co-occurrence network of entities to analyze how the different types of channels introduce and use them across posts and topics. We find that conspiracist channels are characterized by anger. Moreover, co-occurrence networks of entities appearing in conspiracist channels are more dense. We theorize that this reflects a narrative structure where all actants are pushed into a single domain. Conspiracist channels disproportionately associate the most central group of entities with anger and fear. We do not find evidence that entities in conspiracist narratives occur across more topics. This could indicate an erratic type of online conspiracism where everything can be connected to everything and that is characterized by a high number of entities and high levels of anger. - 2025.woah-1.41 - 2025.woah-1.41.SupplementaryMaterial.zip + 2025.woah-1.41 2025.woah-1.41.SupplementaryMaterial.zip laken-etal-2025-multilingual
diff --git a/data/xml/N13.xml b/data/xml/N13.xml index b9b1a51e96..b230946b36 100644 --- a/data/xml/N13.xml +++ b/data/xml/N13.xml @@ -1176,9 +1176,11 @@ TheresaWilson DavidYarowsky 1010–1019 - N13-1121 + N13-1121 bergsma-etal-2013-broadly
To Link or Not to Link? A Study on End-to-End Tweet Entity Linking diff --git a/data/xml/P19.xml b/data/xml/P19.xml index 31c3570572..0cdab3ee01 100644 --- a/data/xml/P19.xml +++ b/data/xml/P19.xml @@ -6473,11 +6473,13 @@ DavidReitter 5127–5136 We examine the benefits of visual context in training neural language models to perform next-word prediction. A multi-modal neural architecture is introduced that outperform its equivalent trained on language alone with a 2% decrease in perplexity, even when no visual context is available at test. Fine-tuning the embeddings of a pre-trained state-of-the-art bidirectional language model (BERT) in the language modeling framework yields a 3.5% improvement. The advantage for training with visual context when testing without is robust across different languages (English, German and Spanish) and different models (GRU, LSTM, Delta-RNN, as well as those that use BERT embeddings). Thus, language models perform better when they learn like a baby, i.e, in a multi-modal environment. This finding is compatible with the theory of situated cognition: language is inseparable from its physical context. - P19-1506 + P19-1506 P19-1506.Supplementary.pdf Relating Simple Sentence Representations in Deep Neural Networks and the Brain diff --git a/data/yaml/name_variants.yaml b/data/yaml/name_variants.yaml index 1468d7a0bc..064ed1313b 100644 --- a/data/yaml/name_variants.yaml +++ b/data/yaml/name_variants.yaml @@ -23,6 +23,9 @@ - {first: Chalamalasetti, last: Kranti} - canonical: {first: Felicia, last: Körner} id: felicia-koerner + orcid: 0000-0002-4086-5338 + degree: Ludwig Maximilian University of Munich + comment: LMU variants: - {first: Felicia, last: Koerner} - canonical: {first: Pranav, last: A} diff --git a/data/yaml/sigs/sigdial.yaml b/data/yaml/sigs/sigdial.yaml index 01bfa3d963..d0da8faac4 100644 --- a/data/yaml/sigs/sigdial.yaml +++ b/data/yaml/sigs/sigdial.yaml @@ -1,6 +1,6 @@ Name: ACL/ISCA Special Interest Group on Discourse and Dialogue ShortName: SIGDIAL -URL: http://www.aclweb.org/sigdial +URL: https://sigdial.org Meetings: - 2024: - 2024.sigdial-1 # Proceedings of the 25th Annual Meeting of the Special Interest Group on Discourse and Dialogue