diff --git a/bin/add_revision.py b/bin/add_revision.py
index 8d15cf554e..22bb6e8f59 100755
--- a/bin/add_revision.py
+++ b/bin/add_revision.py
@@ -248,7 +248,7 @@ def main(args):
         repo.git.add(get_xml_file(args.anthology_id))
         if repo.is_dirty(index=True, working_tree=True, untracked_files=True):
             repo.index.commit(
-                f"Add revision for {args.anthology_id} (closes #{args.issue})"
+                f"Add {change_type} for {args.anthology_id} (closes #{args.issue})"
             )
 
 
diff --git a/data/xml/2021.tacl.xml b/data/xml/2021.tacl.xml
index 61c7cd2da1..b52db02919 100644
--- a/data/xml/2021.tacl.xml
+++ b/data/xml/2021.tacl.xml
@@ -966,8 +966,11 @@
       <doi>10.1162/tacl_a_00419</doi>
       <abstract>Despite the progress made in recent years in addressing natural language understanding (NLU) challenges, the majority of this progress remains to be concentrated on resource-rich languages like English. This work focuses on Persian language, one of the widely spoken languages in the world, and yet there are few NLU datasets available for this language. The availability of high-quality evaluation datasets is a necessity for reliable assessment of the progress on different NLU tasks and domains. We introduce ParsiNLU, the first benchmark in Persian language that includes a range of language understanding tasks—reading comprehension, textual entailment, and so on. These datasets are collected in a multitude of ways, often involving manual annotations by native speakers. This results in over 14.5k new instances across 6 distinct NLU tasks. Additionally, we present the first results on state-of-the-art monolingual and multilingual pre-trained language models on this benchmark and compare them with human performance, which provides valuable insights into our ability to tackle natural language understanding challenges in Persian. We hope ParsiNLU fosters further research and advances in Persian language understanding.1</abstract>
       <pages>1147–1162</pages>
-      <url hash="fed0d867">2021.tacl-1.68</url>
+      <url hash="6da546ec">2021.tacl-1.68</url>
       <bibkey>khashabi-etal-2021-parsinlu</bibkey>
+      <revision id="1" href="2021.tacl-1.68v1" hash="fed0d867"/>
+      <revision id="2" href="2021.tacl-1.68v2" hash="6da546ec" date="2025-09-09">Fix author name</revision>
+      <revision id="3" href="2021.tacl-1.68v3" hash="6da546ec" date="2025-09-12">Author info update.</revision>
     </paper>
     <paper id="69">
       <title>What Helps Transformers Recognize Conversational Structure? Importance of Context, Punctuation, and Labels in Dialog Act Recognition</title>
diff --git a/data/xml/2023.emnlp.xml b/data/xml/2023.emnlp.xml
index c6a6471009..66136c05fb 100644
--- a/data/xml/2023.emnlp.xml
+++ b/data/xml/2023.emnlp.xml
@@ -11083,13 +11083,14 @@
       <author><first>Yoav</first><last>Tulpan</last></author>
       <pages>12883-12895</pages>
       <abstract>Online social platforms provide a bustling arena for information-sharing and for multi-party discussions. Various frameworks for dialogic discourse parsing were developed and used for the processing of discussions and for predicting the productivity of a dialogue. However, most of these frameworks are not suitable for the analysis of contentious discussions that are commonplace in many online platforms. A novel multi-label scheme for contentious dialog parsing was recently introduced by Zakharov et al. (2021). While the schema is well developed, the computational approach they provide is both naive and inefficient, as a different model (architecture) using a different representation of the input, is trained for each of the 31 tags in the annotation scheme. Moreover, all their models assume full knowledge of label collocations and context, which is unlikely in any realistic setting. In this work, we present a unified model for Non-Convergent Discourse Parsing that does not require any additional input other than the previous dialog utterances. We fine-tuned a RoBERTa backbone, combining embeddings of the utterance, the context and the labels through GRN layers and an asymmetric loss function. Overall, our model achieves results comparable with SOTA, without using label collocation and without training a unique architecture/model for each label. Our proposed architecture makes the labeling feasible at large scale, promoting the development of tools that deepen our understanding of discourse dynamics.</abstract>
-      <url hash="e4a384a1">2023.emnlp-main.796</url>
+      <url hash="e22d2a38">2023.emnlp-main.796</url>
       <bibkey>tsur-tulpan-2023-deeper</bibkey>
       <doi>10.18653/v1/2023.emnlp-main.796</doi>
       <video href="2023.emnlp-main.796.mp4"/>
       <revision id="1" href="2023.emnlp-main.796v1" hash="a167848a"/>
       <revision id="2" href="2023.emnlp-main.796v2" hash="f2d33ff7" date="2025-03-27">Minor updates.</revision>
       <revision id="3" href="2023.emnlp-main.796v3" hash="e4a384a1" date="2025-03-27">The language of the Ethics and Broader Impact was changed upon request from the PEC.</revision>
+      <revision id="4" href="2023.emnlp-main.796v4" hash="e22d2a38" date="2025-09-25">Modifications requested by PEC.</revision>
     </paper>
     <paper id="797">
       <title>We are Who We Cite: Bridges of Influence Between Natural Language Processing and Other Academic Fields</title>
diff --git a/data/xml/2023.findings.xml b/data/xml/2023.findings.xml
index f5142606a4..b19a2841e7 100644
--- a/data/xml/2023.findings.xml
+++ b/data/xml/2023.findings.xml
@@ -23608,9 +23608,11 @@
       <author><first>Heuiseok</first><last>Lim</last></author>
       <pages>10334-10343</pages>
       <abstract>Large language models (LLMs) have demonstrated impressive capabilities in natural language processing. However, their ability to establish causal relationships, particularly in the context of temporal interventions and language hallucinations, remains challenging. This paper presents <b>CReTIHC</b>, a novel dataset designed to test and enhance the causal reasoning abilities of LLMs. The dataset is constructed using a unique approach that incorporates elements of verbal hallucinations and temporal interventions through the reengineering of existing causal inference datasets. This transformation creates complex scenarios that push LLMs to critically evaluate the information presented and identify cause-and-effect relationships. The CReTIHC dataset serves as a pioneering tool for improving LLM’s causal inference capabilities, paving the way for a more nuanced understanding of causal relationships in natural language processing (NLP) tasks. The whole dataset is publicly accessible at: (https://github.com/ChangwooChun/CReTIHC)</abstract>
-      <url hash="ce39d040">2023.findings-emnlp.693</url>
+      <url hash="2d594ded">2023.findings-emnlp.693</url>
       <bibkey>chun-etal-2023-cretihc</bibkey>
       <doi>10.18653/v1/2023.findings-emnlp.693</doi>
+      <revision id="1" href="2023.findings-emnlp.693v1" hash="ce39d040"/>
+      <revision id="2" href="2023.findings-emnlp.693v2" hash="2d594ded" date="2025-09-06">Author info update.</revision>
     </paper>
     <paper id="694">
       <title>On the Dimensionality of Sentence Embeddings</title>
@@ -27814,9 +27816,11 @@
       <author><first>Michael</first><last>Elhadad</last></author>
       <pages>15164-15172</pages>
       <abstract>We call into question the recently popularized method of direct model editing as a means of correcting factual errors in LLM generations. We contrast model editing with three similar but distinct approaches that pursue better defined objectives: (1) retrieval-based architectures, which decouple factual memory from inference and linguistic capabilities embodied in LLMs; (2) concept erasure methods, which aim at preventing systemic bias in generated text; and (3) attribution methods, which aim at grounding generations into identified textual sources. We argue that direct model editing cannot be trusted as a systematic remedy for the disadvantages inherent to LLMs, and while it has proven potential in improving model explainability, it opens risks by reinforcing the notion that models can be trusted for factuality. We call for cautious promotion and application of model editing as part of the LLM deployment process, and for responsibly limiting the use cases of LLMs to those not relying on editing as a critical component.</abstract>
-      <url hash="10937ad7">2023.findings-emnlp.1012</url>
+      <url hash="51898b8f">2023.findings-emnlp.1012</url>
       <bibkey>pinter-elhadad-2023-emptying</bibkey>
       <doi>10.18653/v1/2023.findings-emnlp.1012</doi>
+      <revision id="1" href="2023.findings-emnlp.1012v1" hash="10937ad7"/>
+      <revision id="2" href="2023.findings-emnlp.1012v2" hash="51898b8f" date="2025-09-07">Updates.</revision>
     </paper>
     <paper id="1013">
       <title>A Causal View of Entity Bias in (Large) Language Models</title>
diff --git a/data/xml/2024.argmining.xml b/data/xml/2024.argmining.xml
index 272b37ed81..504691e410 100644
--- a/data/xml/2024.argmining.xml
+++ b/data/xml/2024.argmining.xml
@@ -174,9 +174,11 @@
       <author><first>Iryna</first><last>Gurevych</last></author>
       <pages>130-149</pages>
       <abstract>Argument retrieval is the task of finding relevant arguments for a given query. While existing approaches rely solely on the semantic alignment of queries and arguments, this first shared task on perspective argument retrieval incorporates perspectives during retrieval, ac- counting for latent influences in argumenta- tion. We present a novel multilingual dataset covering demographic and socio-cultural (so- cio) variables, such as age, gender, and politi- cal attitude, representing minority and major- ity groups in society. We distinguish between three scenarios to explore how retrieval systems consider explicitly (in both query and corpus) and implicitly (only in query) formulated per- spectives. This paper provides an overview of this shared task and summarizes the results of the six submitted systems. We find substantial challenges in incorporating perspectivism, especially when aiming for personalization based solely on the text of arguments without explicitly providing socio profiles. Moreover, re- trieval systems tend to be biased towards the majority group but partially mitigate bias for the female gender. While we bootstrap per- spective argument retrieval, further research is essential to optimize retrieval systems to facilitate personalization and reduce polarization.</abstract>
-      <url hash="5f27538b">2024.argmining-1.14</url>
+      <url hash="7f9ba824">2024.argmining-1.14</url>
       <bibkey>falk-etal-2024-overview</bibkey>
       <doi>10.18653/v1/2024.argmining-1.14</doi>
+      <revision id="1" href="2024.argmining-1.14v1" hash="5f27538b"/>
+      <revision id="2" href="2024.argmining-1.14v2" hash="7f9ba824" date="2025-09-05">Corrected a typo.</revision>
     </paper>
     <paper id="15">
       <title>Sövereign at The Perspective Argument Retrieval Shared Task 2024: Using <fixed-case>LLM</fixed-case>s with Argument Mining</title>
diff --git a/data/xml/2024.conll.xml b/data/xml/2024.conll.xml
index 2bfaebc3b6..c7fe8bd1e3 100644
--- a/data/xml/2024.conll.xml
+++ b/data/xml/2024.conll.xml
@@ -209,9 +209,11 @@
       <author><first>Yevgeni</first><last>Berzak</last></author>
       <pages>219-230</pages>
       <abstract>The effect of surprisal on processing difficulty has been a central topic of investigation in psycholinguistics. Here, we use eyetracking data to examine three language processing regimes that are common in daily life but have not been addressed with respect to this question: information seeking, repeated processing, and the combination of the two. Using standard regime-agnostic surprisal estimates we find that the prediction of surprisal theory regarding the presence of a linear effect of surprisal on processing times, extends to these regimes. However, when using surprisal estimates from regime-specific contexts that match the contexts and tasks given to humans, we find that in information seeking, such estimates do not improve the predictive power of processing times compared to standard surprisals. Further, regime-specific contexts yield near zero surprisal estimates with no predictive power for processing times in repeated reading. These findings point to misalignments of task and memory representations between humans and current language models, and question the extent to which such models can be used for estimating cognitively relevant quantities. We further discuss theoretical challenges posed by these results.</abstract>
-      <url hash="edbdb721">2024.conll-1.17</url>
+      <url hash="53bc791b">2024.conll-1.17</url>
       <bibkey>klein-etal-2024-effect</bibkey>
       <doi>10.18653/v1/2024.conll-1.17</doi>
+      <revision id="1" href="2024.conll-1.17v1" hash="edbdb721"/>
+      <revision id="2" href="2024.conll-1.17v2" hash="53bc791b" date="2025-09-02">The current PDF is missing the SM (supplementary materials). We provide here the right file that includes the SM.</revision>
     </paper>
     <paper id="18">
       <title>Revisiting Hierarchical Text Classification: Inference and Metrics</title>
diff --git a/data/xml/2024.emnlp.xml b/data/xml/2024.emnlp.xml
index 7cb3d97434..9cb387885b 100644
--- a/data/xml/2024.emnlp.xml
+++ b/data/xml/2024.emnlp.xml
@@ -10584,10 +10584,12 @@
       <author><first>David A.</first><last>Clifton</last><affiliation>University of Oxford</affiliation></author>
       <pages>13696-13710</pages>
       <abstract>The adoption of large language models (LLMs) to assist clinicians has attracted remarkable attention. Existing works mainly adopt the close-ended question-answering (QA) task with answer options for evaluation. However, many clinical decisions involve answering open-ended questions without pre-set options. To better understand LLMs in the clinic, we construct a benchmark ClinicBench. We first collect eleven existing datasets covering diverse clinical language generation, understanding, and reasoning tasks. Furthermore, we construct six novel datasets and clinical tasks that are complex but common in real-world practice, e.g., open-ended decision-making, long document processing, and emerging drug analysis. We conduct an extensive evaluation of twenty-two LLMs under both zero-shot and few-shot settings. Finally, we invite medical experts to evaluate the clinical usefulness of LLMs</abstract>
-      <url hash="e0a4aaac">2024.emnlp-main.759</url>
+      <url hash="70eb450d">2024.emnlp-main.759</url>
       <attachment type="data" hash="08c7b763">2024.emnlp-main.759.data.zip</attachment>
       <bibkey>liu-etal-2024-large</bibkey>
       <doi>10.18653/v1/2024.emnlp-main.759</doi>
+      <revision id="1" href="2024.emnlp-main.759v1" hash="e0a4aaac"/>
+      <revision id="2" href="2024.emnlp-main.759v2" hash="70eb450d" date="2025-09-11">Amend the wording of the funding acknowledgement.</revision>
     </paper>
     <paper id="760">
       <title>Holistic Automated Red Teaming for Large Language Models through Top-Down Test Case Generation and Multi-turn Interaction</title>
diff --git a/data/xml/2024.parlaclarin.xml b/data/xml/2024.parlaclarin.xml
index 0faac55ae9..5a9222718e 100644
--- a/data/xml/2024.parlaclarin.xml
+++ b/data/xml/2024.parlaclarin.xml
@@ -108,6 +108,7 @@
       <url hash="23516dd9">2024.parlaclarin-1.9</url>
       <attachment type="OptionalSupplementaryMaterial" hash="1e7ba012">2024.parlaclarin-1.9.OptionalSupplementaryMaterial.docx</attachment>
       <bibkey>menzel-2024-exploring</bibkey>
+      <erratum id="1" hash="cd11c97a" date="2025-09-02">2024.parlaclarin-1.9e1</erratum>
     </paper>
     <paper id="10">
       <title>Quantitative Analysis of Editing in Transcription Process in <fixed-case>J</fixed-case>apanese and <fixed-case>E</fixed-case>uropean Parliaments and its Diachronic Changes</title>
diff --git a/data/xml/2024.semeval.xml b/data/xml/2024.semeval.xml
index cf82111cd1..a9bfff2608 100644
--- a/data/xml/2024.semeval.xml
+++ b/data/xml/2024.semeval.xml
@@ -3652,6 +3652,7 @@
       <bibkey>jullien-etal-2024-semeval</bibkey>
       <doi>10.18653/v1/2024.semeval-1.271</doi>
       <video href="2024.semeval-1.271.mp4"/>
+      <erratum id="1" hash="af05a452" date="2025-09-02">2024.semeval-1.271e1</erratum>
     </paper>
     <paper id="272">
       <title><fixed-case>S</fixed-case>em<fixed-case>E</fixed-case>val Task 1: Semantic Textual Relatedness for <fixed-case>A</fixed-case>frican and <fixed-case>A</fixed-case>sian Languages</title>
diff --git a/data/xml/2025.acl.xml b/data/xml/2025.acl.xml
index cb623fdd14..d315b94ef0 100644
--- a/data/xml/2025.acl.xml
+++ b/data/xml/2025.acl.xml
@@ -2323,9 +2323,11 @@
       <author><first>Qing</first><last>Gu</last><affiliation>Nanjing University</affiliation></author>
       <pages>3168-3181</pages>
       <abstract>Extracting sentence embeddings from large language models (LLMs) is a promising direction, as LLMs have demonstrated stronger semantic understanding capabilities. Previous studies typically focus on prompt engineering to elicit sentence embeddings from LLMs by prompting the model to encode sentence information into the embedding of the last token.However, LLMs are mostly decoder-only models with causal attention and the earlier tokens in the sentence cannot attend to the latter tokens, resulting in biased encoding of sentence information and cascading effects on the final decoded token.To this end, we propose a novel Token Prepending (TP) technique that prepends each layer’s decoded sentence embedding to the beginning of the sentence in the next layer’s input, allowing earlier tokens to attend to the complete sentence information under the causal attention mechanism.The proposed TP technique is a plug-and-play and training-free technique, which means it can be seamlessly integrated with various prompt-based sentence embedding methods and autoregressive LLMs.Extensive experiments on various Semantic Textual Similarity (STS) tasks and downstream classification tasks demonstrate that our proposed TP technique can significantly improve the performance of existing prompt-based sentence embedding methods across different LLMs, while incurring negligible additional inference cost.</abstract>
-      <url hash="61de8f17">2025.acl-long.159</url>
+      <url hash="9d6445df">2025.acl-long.159</url>
       <bibkey>fu-etal-2025-token</bibkey>
       <doi>10.18653/v1/2025.acl-long.159</doi>
+      <revision id="1" href="2025.acl-long.159v1" hash="61de8f17"/>
+      <revision id="2" href="2025.acl-long.159v2" hash="9d6445df" date="2025-09-12">Added explanations.</revision>
     </paper>
     <paper id="160">
       <title>No Questions are Stupid, but some are Poorly Posed: Understanding Poorly-Posed Information-Seeking Questions</title>
@@ -4745,9 +4747,11 @@
       <author orcid="0000-0001-8424-0690"><first>Evgeny</first><last>Burnaev</last><affiliation>Skolkovo Institute of Science and Technology</affiliation></author>
       <pages>6463-6480</pages>
       <abstract>Parameter Efficient Fine-Tuning (PEFT) methods have gained popularity and democratized the usage of Large Language Models (LLMs). Recent studies have shown that a small subset of weights significantly impacts performance. Based on this observation, we introduce a novel PEFT method, called Gaussian noise Injected Fine Tuning of Salient Weights (GIFT-SW). Our method updates only salient columns, while injecting Gaussian noise into non-salient ones. To identify these columns, we developed a generalized sensitivity metric that extends and unifies metrics from previous studies. Experiments with LLaMA models demonstrate that GIFT-SW outperforms full fine-tuning and modern PEFT methods under the same computational budget. Moreover, GIFT-SW offers practical advantages to recover performance of models subjected to mixed-precision quantization with keeping salient weights in full precision.</abstract>
-      <url hash="c2469aea">2025.acl-long.324</url>
+      <url hash="8e8c7c41">2025.acl-long.324</url>
       <bibkey>zhelnin-etal-2025-gift</bibkey>
       <doi>10.18653/v1/2025.acl-long.324</doi>
+      <revision id="1" href="2025.acl-long.324v1" hash="c2469aea"/>
+      <revision id="2" href="2025.acl-long.324v2" hash="8e8c7c41" date="2025-09-07">Updated ack.</revision>
     </paper>
     <paper id="325">
       <title>Quaff: Quantized Parameter-Efficient Fine-Tuning under Outlier Spatial Stability Hypothesis</title>
@@ -4828,9 +4832,12 @@
       <author orcid="0000-0001-9337-7250"><first>Marco Antonio</first><last>Stranisci</last></author>
       <pages>6625-6639</pages>
       <abstract>Canceling is a morally-driven phenomenon that hinders the development of safe social media platforms and contributes to ideological polarization. To address this issue we present the Canceling Attitudes Detection (CADE) dataset, an annotated corpus of canceling incidents aimed at exploring the factors of disagreements in evaluating people’s canceling attitudes on social media. Specifically, we study the impact of annotators’ morality in their perception of canceling, showing that morality is an independent axis for the explanation of disagreement on this phenomenon. Annotator’s judgments heavily depend on the type of controversial events and involved celebrities. This shows the need to develop more event-centric datasets to better understand how harms are perpetrated in social media and to develop more aware technologies for their detection.</abstract>
-      <url hash="dcb986cc">2025.acl-long.330</url>
+      <url hash="6610a0a9">2025.acl-long.330</url>
       <bibkey>lo-etal-2025-unacceptable</bibkey>
       <doi>10.18653/v1/2025.acl-long.330</doi>
+      <revision id="1" href="2025.acl-long.330v1" hash="dcb986cc"/>
+      <revision id="2" href="2025.acl-long.330v2" hash="bebad908" date="2025-09-02">The revision corrects some errors in the authors affiliations, in the first page. Concretely, the revision corrects the affiliations (3), (4), and (5).</revision>
+      <revision id="3" href="2025.acl-long.330v3" hash="6610a0a9" date="2025-09-12">Fixed minor error.</revision>
     </paper>
     <paper id="331">
       <title><fixed-case>F</fixed-case>loor<fixed-case>P</fixed-case>lan-<fixed-case>LL</fixed-case>a<fixed-case>M</fixed-case>a: Aligning Architects’ Feedback and Domain Knowledge in Architectural Floor Plan Generation</title>
@@ -5462,9 +5469,11 @@
       <author><first>Ziwei</first><last>Liu</last><affiliation>Nanyang Technological University</affiliation></author>
       <pages>7561-7582</pages>
       <abstract>Recent advancements in visual generative models have enabled high-quality image and video generation, opening diverse applications. However, evaluating these models often demands sampling hundreds or thousands of images or videos, making the process computationally expensive, especially for diffusion-based models with inherently slow sampling. Moreover, existing evaluation methods rely on rigid pipelines that overlook specific user needs and provide numerical results without clear explanations. In contrast, humans can quickly form impressions of a model’s capabilities by observing only a few samples. To mimic this, we propose the Evaluation Agent framework, which employs human-like strategies for efficient, dynamic, multi-round evaluations using only a few samples per round, while offering detailed, user-tailored analyses. It offers four key advantages: 1) efficiency, 2) promptable evaluation tailored to diverse user needs, 3) explainability beyond single numerical scores, and 4) scalability across various models and tools. Experiments show that Evaluation Agent reduces evaluation time to 10% of traditional methods while delivering comparable results. The Evaluation Agent framework is fully open-sourced to advance research in visual generative models and their efficient evaluation.</abstract>
-      <url hash="dca96a83">2025.acl-long.374</url>
+      <url hash="41321ca1">2025.acl-long.374</url>
       <bibkey>zhang-etal-2025-evaluation</bibkey>
       <doi>10.18653/v1/2025.acl-long.374</doi>
+      <revision id="1" href="2025.acl-long.374v1" hash="dca96a83"/>
+      <revision id="2" href="2025.acl-long.374v2" hash="41321ca1" date="2025-09-05">Metadata correction.</revision>
     </paper>
     <paper id="375">
       <title>Large Language Models Struggle to Describe the Haystack without Human Help: A Social Science-Inspired Evaluation of Topic Models</title>
@@ -6191,7 +6200,7 @@
       <author orcid="0000-0002-3483-8624"><first>Xingyao</first><last>Wang</last><affiliation>All Hands AI and University of Illinois Urbana-Champaign</affiliation></author>
       <pages>8697-8727</pages>
       <abstract>Code localization–identifying precisely where in a codebase changes need to be made–is a fundamental yet challenging task in software maintenance. Existing approaches struggle to efficiently navigate complex codebases when identifying relevant code snippets.The challenge lies in bridging natural language problem descriptions with the target code elements, often requiring reasoning across hierarchical structures and multiple dependencies.We introduce LocAgent, a framework that addresses code localization through a graph-guided agent.By parsing codebases into directed heterogeneous graphs, LocAgent creates a lightweight representation that captures code structures and their dependencies, enabling LLM agents to effectively search and locate relevant entities through powerful multi-hop reasoning.Experimental results on real-world benchmarks demonstrate that our approach significantly enhances accuracy in code localization.Notably, our method with the fine-tuned Qwen-2.5-Coder-Instruct-32B model achieves comparable results to SOTA proprietary models at greatly reduced cost (approximately 86% reduction), reaching up to 92.7% accuracy on file-level localization while improving downstream GitHub issue resolution success rates by 12% for multiple attempts (Pass@10). Our code is available at <url>https://github.com/gersteinlab/LocAgent</url>.</abstract>
-      <url hash="9f751217">2025.acl-long.426</url>
+      <url hash="929e1139">2025.acl-long.426</url>
       <bibkey>chen-etal-2025-locagent</bibkey>
       <doi>10.18653/v1/2025.acl-long.426</doi>
     </paper>
@@ -7081,9 +7090,11 @@
       <author><first>Fei</first><last>Wu</last><affiliation>Zhejiang University</affiliation></author>
       <pages>9887-9908</pages>
       <abstract>Large language models (LLMs) are revolutionizing education, with LLM-based agents playing a key role in simulating student behavior. A major challenge in student simulation is modeling the diverse learning patterns of students at various cognitive levels. However, current LLMs, typically trained as “helpful assistants”, target at generating perfect responses. As a result, they struggle to simulate students with diverse cognitive abilities, as they often produce overly advanced answers, missing the natural imperfections that characterize student learning and resulting in unrealistic simulations. To address this issue, we propose a training-free framework for student simulation. We begin by constructing a cognitive prototype for each student using a knowledge graph, which captures their understanding of concepts from past learning records. This prototype is then mapped to new tasks to predict student performance. Next, we simulate student solutions based on these predictions and iteratively refine them using a beam search method to better replicate realistic mistakes. To validate our approach, we construct the Student_100 dataset, consisting of 100 students working on Python programming and 5,000 learning records. Experimental results show that our method consistently outperforms baseline models, achieving 100% improvement in simulation accuracy and realism.</abstract>
-      <url hash="7ea08cf0">2025.acl-long.488</url>
+      <url hash="454672fd">2025.acl-long.488</url>
       <bibkey>wu-etal-2025-embracing</bibkey>
       <doi>10.18653/v1/2025.acl-long.488</doi>
+      <revision id="1" href="2025.acl-long.488v1" hash="7ea08cf0"/>
+      <revision id="2" href="2025.acl-long.488v2" hash="454672fd" date="2025-09-06">Author update.</revision>
     </paper>
     <paper id="489">
       <title><fixed-case>CADR</fixed-case>eview: Automatically Reviewing <fixed-case>CAD</fixed-case> Programs with Error Detection and Correction</title>
@@ -10641,9 +10652,11 @@
       <author orcid="0000-0002-1384-1218"><first>Sina</first><last>Zarrieß</last><affiliation>Bielefeld University</affiliation></author>
       <pages>14956-14975</pages>
       <abstract>Communication among humans relies on conversational grounding, allowing interlocutors to reach mutual understanding even when they do not have perfect knowledge and must resolve discrepancies in each other’s beliefs. This paper investigates how large language models (LLMs) manage common ground in cases where they (don’t) possess knowledge, focusing on facts in the political domain where the risk of misinformation and grounding failure is high. We examine LLMs’ ability to answer direct knowledge questions and loaded questions that presuppose misinformation.We evaluate whether loaded questions lead LLMs to engage in active grounding and correct false user beliefs, in connection to their level of knowledge and their political bias.Our findings highlight significant challenges in LLMs’ ability to engage in grounding and reject false user beliefs, raising concerns about their role in mitigating misinformation in political discourse.</abstract>
-      <url hash="abbb051a">2025.acl-long.728</url>
+      <url hash="93b0d6f1">2025.acl-long.728</url>
       <bibkey>lachenmaier-etal-2025-llms</bibkey>
       <doi>10.18653/v1/2025.acl-long.728</doi>
+      <revision id="1" href="2025.acl-long.728v1" hash="abbb051a"/>
+      <revision id="2" href="2025.acl-long.728v2" hash="93b0d6f1" date="2025-09-12">Updated format.</revision>
     </paper>
     <paper id="729">
       <title><fixed-case>G</fixed-case>raph<fixed-case>C</fixed-case>heck: Breaking Long-Term Text Barriers with Extracted Knowledge Graph-Powered Fact-Checking</title>
@@ -11753,9 +11766,11 @@
       <author><first>Suhyun</first><last>Kim</last><affiliation>Kyung Hee University</affiliation></author>
       <pages>16489-16507</pages>
       <abstract>We introduce a novel framework for consolidating multi-turn adversarial “jailbreak” prompts into single-turn queries, significantly reducing the manual overhead required for adversarial testing of large language models (LLMs). While multi-turn human jailbreaks have been shown to yield high attack success rates (ASRs), they demand considerable human effort and time. Our proposed Multi-turn-to-Single-turn (M2S) methods—Hyphenize, Numberize, and Pythonize—systematically reformat multi-turn dialogues into structured single-turn prompts. Despite eliminating iterative back-and-forth interactions, these reformatted prompts preserve and often enhance adversarial potency: in extensive evaluations on the Multi-turn Human Jailbreak (MHJ) dataset, M2S methods yield ASRs ranging from 70.6 % to 95.9 % across various state-of-the-art LLMs. Remarkably, our single-turn prompts outperform the original multi-turn attacks by up to 17.5 % in absolute ASR, while reducing token usage by more than half on average. Further analyses reveal that embedding malicious requests in enumerated or code-like structures exploits “contextual blindness,” undermining both native guardrails and external input-output safeguards. By consolidating multi-turn conversations into efficient single-turn prompts, our M2S framework provides a powerful tool for large-scale red-teaming and exposes critical vulnerabilities in contemporary LLM defenses. All code, data, and conversion prompts are available for reproducibility and further investigations: https://github.com/Junuha/M2S_DATA</abstract>
-      <url hash="a01d9a46">2025.acl-long.805</url>
+      <url hash="d849707e">2025.acl-long.805</url>
       <bibkey>ha-etal-2025-one</bibkey>
       <doi>10.18653/v1/2025.acl-long.805</doi>
+      <revision id="1" href="2025.acl-long.805v1" hash="a01d9a46"/>
+      <revision id="2" href="2025.acl-long.805v2" hash="d849707e" date="2025-09-05">Title update.</revision>
     </paper>
     <paper id="806">
       <title><fixed-case>RAE</fixed-case>mo<fixed-case>LLM</fixed-case>: Retrieval Augmented <fixed-case>LLM</fixed-case>s for Cross-Domain Misinformation Detection Using In-Context Learning Based on Emotional Information</title>
@@ -14130,7 +14145,7 @@
       <author orcid="0000-0002-7770-4431"><first>Jordan Lee</first><last>Boyd-Graber</last><affiliation>University of Maryland, College Park</affiliation></author>
       <pages>19586-19587</pages>
       <abstract>Language models are often miscalibrated, leading to confidently incorrect answers. We introduce GRACE, a benchmark for language model calibration that incorporates comparison with human calibration. GRACE consists of question-answer pairs, in which each question contains a series of clues that gradually become easier, all leading to the same answer; models must answer correctly as early as possible as the clues are revealed. This setting permits granular measurement of model calibration based on how early, accurately, and confidently a model answers. After collecting these questions, we host live human vs. model competitions to gather 1,749 data points on human and model teams’ timing, accuracy, and confidence. We propose a metric, CalScore, that uses GRACE to analyze model calibration errors and identify types of model miscalibration that differ from human behavior. We find that although humans are less accurate than models, humans are generally better calibrated. Since state-of-the-art models struggle on GRACE, it effectively evaluates progress on improving model calibration.</abstract>
-      <url hash="3b9e4037">2025.acl-long.962</url>
+      <url hash="e8bf858d">2025.acl-long.962</url>
       <bibkey>sung-etal-2025-grace</bibkey>
       <doi>10.18653/v1/2025.acl-long.962</doi>
     </paper>
@@ -18347,9 +18362,11 @@
       <author><first>Niranjan</first><last>Balasubramanian</last><affiliation>State University of New York, Stony Brook</affiliation></author>
       <pages>26039-26057</pages>
       <abstract>LLM generated code often contains security issues. We address two key challenges in improving secure code generation. First, obtaining high quality training data covering a broad set of security issues is critical. To address this, we introduce a method for distilling a preference dataset of insecure and secure code pairs from frontier LLMs, along with a security reasoning that explains the issues and the fix. The key idea here is to make use of security knowledge sources to devise a systematic prompting strategy that ensures broad coverage. Second, aligning models to secure code requires focusing on localized regions of code. Direct preference optimization methods, like SimPO, are not designed to handle these localized differences and turn out to be ineffective. We address this with a new localized preference optimization algorithm that masks the security related tokens in both the winning (secure) and losing (insecure) responses. To prevent loss in code quality, we also add a regularizer. Evaluations show that both training on our dataset, DiSCo, and the new preference optimization algorithm, LPO, yield substantial reductions in code insecurity while also improving overall code quality. Code and dataset are available at https://github.com/StonyBrookNLP/disco-lpo.</abstract>
-      <url hash="52c7ba3f">2025.acl-long.1263</url>
+      <url hash="a6c6bac0">2025.acl-long.1263</url>
       <bibkey>hasan-etal-2025-teaching</bibkey>
       <doi>10.18653/v1/2025.acl-long.1263</doi>
+      <revision id="1" href="2025.acl-long.1263v1" hash="52c7ba3f"/>
+      <revision id="2" href="2025.acl-long.1263v2" hash="a6c6bac0" date="2025-09-11">Author name update.</revision>
     </paper>
     <paper id="1264">
       <title>Anything Goes? A Crosslinguistic Study of (Im)possible Language Learning in <fixed-case>LM</fixed-case>s</title>
@@ -20980,9 +20997,11 @@
       <author><first>Ghassen</first><last>Karray</last><affiliation>University of Zurich</affiliation></author>
       <pages>30005-30031</pages>
       <abstract>This article examines LLMs’ ability to correctly label simple inferences with partisan conclusions. For this, we develop a dataset with both formal and material inferences, containing logically equivalent pairs of inferences with conclusions that favor either the political left or the political right. This allows us to focus on political bias as a source of decrease in performance. Our samples are synthetically generated and thus highly controlled, covering both English and German. We assess the performance of 16 configurations of both open and proprietary state-of-the-art LLMs on that dataset, finding generally unreliable performance as well as widespread political bias which, in the case of the English samples, persists throughout our experimental settings.</abstract>
-      <url hash="e6698502">2025.acl-long.1450</url>
+      <url hash="4e57dd82">2025.acl-long.1450</url>
       <bibkey>gubelmann-karray-2025-assessing</bibkey>
       <doi>10.18653/v1/2025.acl-long.1450</doi>
+      <revision id="1" href="2025.acl-long.1450v1" hash="e6698502"/>
+      <revision id="2" href="2025.acl-long.1450v2" hash="4e57dd82" date="2025-09-02">Adding acknowledgements section.</revision>
     </paper>
     <paper id="1451">
       <title><fixed-case>PARME</fixed-case>: Parallel Corpora for Low-Resourced <fixed-case>M</fixed-case>iddle <fixed-case>E</fixed-case>astern Languages</title>
@@ -21459,9 +21478,11 @@
       <author><first>Min</first><last>Zhang</last><affiliation>Harbin Institute of Technology, Shenzhen</affiliation></author>
       <pages>30678-30701</pages>
       <abstract>Large Vision-Language Models (LVLMs) have demonstrated remarkable performance across diverse tasks. Despite great success, recent studies show that LVLMs encounter substantial limitations when engaging with visual graphs. To study the reason behind these limitations, we propose VGCure, a comprehensive benchmark covering 22 tasks for examining the fundamental graph understanding and reasoning capacities of LVLMs. Extensive evaluations conducted on 14 LVLMs reveal that LVLMs are weak in basic graph understanding and reasoning tasks, particularly those concerning relational or structurally complex information. Based on this observation, we propose a structure-aware fine-tuning framework to enhance LVLMs with structure learning abilities through three self-supervised learning tasks. Experiments validate the effectiveness of our method in improving LVLMs’ performance on fundamental and downstream graph learning tasks, as well as enhancing their robustness against complex visual graphs.</abstract>
-      <url hash="10db5ad7">2025.acl-long.1482</url>
+      <url hash="ee848f75">2025.acl-long.1482</url>
       <bibkey>zhu-etal-2025-benchmarking</bibkey>
       <doi>10.18653/v1/2025.acl-long.1482</doi>
+      <revision id="1" href="2025.acl-long.1482v1" hash="10db5ad7"/>
+      <revision id="2" href="2025.acl-long.1482v2" hash="ee848f75" date="2025-09-06">Updated the Acknowledgements.</revision>
     </paper>
     <paper id="1483">
       <title><fixed-case>ISR</fixed-case>: Self-Refining Referring Expressions for Entity Grounding</title>
@@ -21794,9 +21815,11 @@
       <author><first>Min</first><last>Zhang</last><affiliation>Harbin Institute of Technology, Shenzhen</affiliation></author>
       <pages>31156-31171</pages>
       <abstract>Reinforcement Learning (RL) algorithms for safety alignment of Large Language Models (LLMs), such as Direct Preference Optimization (DPO), encounter the challenge of distribution shift. Current approaches typically address this issue through online sampling from the target policy, which requires significant computational resources.In this paper, we hypothesize that during off-policy training, while the ranking order of output generated by policy changes, their overall distribution remains relatively stable.This stability allows the conversion of the sampling process from the target policy into a computationallyefficient re-ranking of preference data.Building on this hypothesis, we propose a new framework that leverages the model’s intrinsic safety judgment capability to extract reward signals, which are then used to calculate label confidence for preference reordering. Extensive experiments and theoretical analysis demonstrate that the proposed method effectively addresses the distribution shift issue, remarkably enhancing the safety performance while avoiding about 300x computational overheads.</abstract>
-      <url hash="9b2fe829">2025.acl-long.1504</url>
+      <url hash="149fcea6">2025.acl-long.1504</url>
       <bibkey>qiyuan-etal-2025-efficient</bibkey>
       <doi>10.18653/v1/2025.acl-long.1504</doi>
+      <revision id="1" href="2025.acl-long.1504v1" hash="9b2fe829"/>
+      <revision id="2" href="2025.acl-long.1504v2" hash="149fcea6" date="2025-09-06">Updated ack.</revision>
     </paper>
     <paper id="1505">
       <title><fixed-case>E</fixed-case>nglish-based acoustic models perform well in the forced alignment of two <fixed-case>E</fixed-case>nglish-based Pacific Creoles</title>
@@ -22560,9 +22583,11 @@
       <author><first>Yong</first><last>Li</last></author>
       <pages>32400-32423</pages>
       <abstract>Large multimodal models exhibit remarkable intelligence, yet their embodied cognitive abilities during motion in open-ended urban aerial spaces remain to be explored. We introduce a benchmark to evaluate whether video-large language models (Video-LLMs) can naturally process continuous first-person visual observations like humans, enabling recall, perception, reasoning, and navigation. We have manually control drones to collect 3D embodied motion video data from real-world cities and simulated environments, resulting in 1.5k video clips. Then we design a pipeline to generate 5.2k multiple-choice questions. Evaluations of 17 widely-used Video-LLMs reveal current limitations in urban embodied cognition. Correlation analysis provides insight into the relationships between different tasks, showing that causal reasoning has a strong correlation with recall, perception, and navigation, while the abilities for counterfactual and associative reasoning exhibit lower correlation with other tasks. We also validate the potential for Sim-to-Real transfer in urban embodiment through fine-tuning.</abstract>
-      <url hash="d67c5524">2025.acl-long.1558</url>
+      <url hash="d481691d">2025.acl-long.1558</url>
       <bibkey>zhao-etal-2025-urbanvideo</bibkey>
       <doi>10.18653/v1/2025.acl-long.1558</doi>
+      <revision id="1" href="2025.acl-long.1558v1" hash="d67c5524"/>
+      <revision id="2" href="2025.acl-long.1558v2" hash="d481691d" date="2025-09-02">Added a footnote to clarify affiliations, updated author affiliations, and revised the caption for Figure 2.</revision>
     </paper>
     <paper id="1559">
       <title><fixed-case>HELIOS</fixed-case>: Harmonizing Early Fusion, Late Fusion, and <fixed-case>LLM</fixed-case> Reasoning for Multi-Granular Table-Text Retrieval</title>
@@ -23695,9 +23720,11 @@
       <author orcid="0000-0002-5284-477X"><first>Jonathan</first><last>May</last><affiliation>University of Southern California and USC/ISI</affiliation></author>
       <pages>381-413</pages>
       <abstract>Faced with an expensive human annotation process, creators of NLP systems increasingly turn to synthetic data generation. While this method shows promise, the extent to which synthetic data can replace human annotation is poorly understood. We investigate the use of synthetic data in Fact Verification (FV) and Evidence-based Question Answering (QA) by incrementally replacing human-generated data with synthetic points on eight diverse datasets. Strikingly, replacing up to 90% of the training data only marginally decreases performance, but replacing the final 10% leads to severe declines. We find that models trained on purely synthetic data can be improved by including as few as 125 human-generated data points. We show that matching the performance gain of a little human data requires an order of magnitude more synthetic data, and then estimate price ratios at which human annotation would be a more cost-effective solution. Our results suggest that even when human annotation at scale is infeasible, there is great value to having a small proportion of the dataset being human-generated.</abstract>
-      <url hash="bd9efdb6">2025.acl-short.30</url>
+      <url hash="de361ed0">2025.acl-short.30</url>
       <bibkey>ashok-may-2025-little</bibkey>
       <doi>10.18653/v1/2025.acl-short.30</doi>
+      <revision id="1" href="2025.acl-short.30v1" hash="bd9efdb6"/>
+      <revision id="2" href="2025.acl-short.30v2" hash="de361ed0" date="2025-09-05">Added a sponsor.</revision>
     </paper>
     <paper id="31">
       <title>Seeking Rational Demonstrations for Large Language Models: A Domain Generalization Approach to Unsupervised Cross-Domain Keyphrase Generation</title>
@@ -24598,9 +24625,11 @@
       <author><first>AiTi</first><last>Aw</last><affiliation>I2R</affiliation></author>
       <pages>22-30</pages>
       <abstract>We introduce MERaLiON-AudioLLM, the first general-purpose audio-based large language model designed for multitask learning, with a particular focus on Singlish understanding. Trained on 62 million multimodal instruction samples comprising a total of 260k hours of audio, it exhibits strong generalization across a diverse set of tasks, including—but not limited to—automatic speech recognition, spoken question answering, speech translation, and paralinguistic analysis. Our results show significant improvements in local speech recognition and task-specific understanding, making MERaLiON-AudioLLM a leading solution for region-specific AI applications. An interactive demo has been developed to enable user-friendly interactions, supported by a backend with customized caching and load-balancing mechanisms. We benchmark the model across a broad range of multilingual and multitask scenarios, where it demonstrates competitive performance compared to other open-source models. The demo page, model weights and videos are publically accessible.</abstract>
-      <url hash="e85c0f56">2025.acl-demo.3</url>
+      <url hash="28d70940">2025.acl-demo.3</url>
       <bibkey>he-etal-2025-meralion</bibkey>
       <doi>10.18653/v1/2025.acl-demo.3</doi>
+      <revision id="1" href="2025.acl-demo.3v1" hash="e85c0f56"/>
+      <revision id="2" href="2025.acl-demo.3v2" hash="28d70940" date="2025-09-11">Updated Ack.</revision>
     </paper>
     <paper id="4">
       <title><fixed-case>N</fixed-case>ame<fixed-case>T</fixed-case>ag 3: A Tool and a Service for Multilingual/Multitagset <fixed-case>NER</fixed-case></title>
diff --git a/data/xml/2025.argmining.xml b/data/xml/2025.argmining.xml
index 04961d6bdf..2422fdb783 100644
--- a/data/xml/2025.argmining.xml
+++ b/data/xml/2025.argmining.xml
@@ -211,9 +211,11 @@
       <author><first>Els</first><last>Lefever</last></author>
       <pages>168-180</pages>
       <abstract>Definition generation models trained on dictionary data are generally expected to produce neutral and unbiased output while capturing the contextual nuances. However, previous studies have shown that generated definitions can inherit biases from both the underlying models and the input context. This paper examines the extent to which stance-related bias in argumentative data influences the generated definitions. In particular, we train a model on a slang-based dictionary to explore the feasibility of generating persuasive definitions that concisely reflect opposing parties’ understandings of contested terms. Through this study, we provide new insights into bias propagation in definition generation and its implications for definition generation applications and argument mining.</abstract>
-      <url hash="3a3c7510">2025.argmining-1.16</url>
+      <url hash="cc240595">2025.argmining-1.16</url>
       <bibkey>evgrafova-etal-2025-stance</bibkey>
       <doi>10.18653/v1/2025.argmining-1.16</doi>
+      <revision id="1" href="2025.argmining-1.16v1" hash="3a3c7510"/>
+      <revision id="2" href="2025.argmining-1.16v2" hash="cc240595" date="2025-09-06">Various updates.</revision>
     </paper>
     <paper id="17">
       <title>Reproducing the Argument Quality Prediction of Project Debater</title>
diff --git a/data/xml/2025.bionlp.xml b/data/xml/2025.bionlp.xml
index eadf2861b0..1598bcf199 100644
--- a/data/xml/2025.bionlp.xml
+++ b/data/xml/2025.bionlp.xml
@@ -155,11 +155,13 @@
       <author><first>Ramakanth</first><last>Kavuluru</last><affiliation>University of Kentucky</affiliation></author>
       <pages>101-113</pages>
       <abstract>Extractive question answering over clinical text is a crucial need to help deal with the deluge of clinical text generated in hospitals. While encoder models (e.g., BERT) have been popular for this reading comprehension–style question answering task, recently encoder-decoder models (e.g., T5) are on the rise. There is also the emergence of preference optimization techniques to align decoder-only LLMs with human preferences. In this paper, we combine encoder-decoder models with the direct preference optimization (DPO) method for the RadQA radiology question answering task. Our approach achieves a 12–15 F1 point improvement over previous state-of-the-art models. To the best of our knowledge, this effort is the first to show that DPO method also works for reading comprehension via novel heuristics to generate preference data without human inputs.</abstract>
-      <url hash="6f9d891b">2025.bionlp-1.10</url>
+      <url hash="9113b702">2025.bionlp-1.10</url>
       <attachment type="SupplementaryMaterial" hash="e39e5bc3">2025.bionlp-1.10.SupplementaryMaterial.zip</attachment>
       <attachment type="SupplementaryMaterial" hash="b5390a1c">2025.bionlp-1.10.SupplementaryMaterial.txt</attachment>
       <bibkey>nahian-kavuluru-2025-radqa</bibkey>
       <doi>10.18653/v1/2025.bionlp-1.10</doi>
+      <revision id="1" href="2025.bionlp-1.10v1" hash="6f9d891b"/>
+      <revision id="2" href="2025.bionlp-1.10v2" hash="9113b702" date="2025-09-06">Codebase link update.</revision>
     </paper>
     <paper id="11">
       <title>Gender-Neutral Large Language Models for Medical Applications: Reducing Bias in <fixed-case>P</fixed-case>ub<fixed-case>M</fixed-case>ed Abstracts</title>
@@ -649,9 +651,11 @@
       <author><first>Titipat</first><last>Achakulvisut</last><affiliation>Department of Biomedical Engineering, Mahidol University</affiliation></author>
       <pages>96-103</pages>
       <abstract>This paper presents an approach to answering patient-specific medical questions using electronic health record (EHR) grounding with ArchEHR-QA 2025 datasets. We address medical question answering as an alignment problem, focusing on generating responses factually consistent with patient-specific clinical notes through in-context learning techniques. We show that LLM-generated responses, used as few-shot examples with GPT-4.1 and Gemini-2.5-Pro, significantly outperform baseline approaches (overall score = 49.1), achieving strict precision, recall, and F1-micro scores of 60.6, 53.6, and 56.9, respectively, on the ArchEHR-QA 2025 test leaderboard. It achieves textual similarity between answers and essential evidence using BLEU, ROUGE, SARI, BERTScore, AlignScore, and MEDCON scores of 6.0, 32.1, 65.8, 36.4, 64.3, and 43.6, respectively. Our findings highlight the effectiveness of combining EHR grounding with few-shot examples for personalized medical question answering, establishing a promising approach for developing accurate and personalized medical question answering systems. We release our code at https://github.com/biodatlab/archehr-qa-lamar.</abstract>
-      <url hash="8d78bd8b">2025.bionlp-share.12</url>
+      <url hash="cb9aa417">2025.bionlp-share.12</url>
       <bibkey>yoadsanit-etal-2025-lamar</bibkey>
       <doi>10.18653/v1/2025.bionlp-share.12</doi>
+      <revision id="1" href="2025.bionlp-share.12v1" hash="8d78bd8b"/>
+      <revision id="2" href="2025.bionlp-share.12v2" hash="cb9aa417" date="2025-09-06">Minor fixes.</revision>
     </paper>
     <paper id="13">
       <title>Neural at <fixed-case>A</fixed-case>rch<fixed-case>EHR</fixed-case>-<fixed-case>QA</fixed-case> 2025: Agentic Prompt Optimization for Evidence-Grounded Clinical Question Answering</title>
@@ -912,9 +916,11 @@
       <author><first>Hongyi</first><last>Xin</last><affiliation>Shanghai Jiao Tong University</affiliation></author>
       <pages>275-280</pages>
       <abstract>We propose a unified, multi-stage lay summarization pipeline for BioLaySumm 2025 (Subtask 1.1) that (1) selects and summarizes key article sections via BioBART, (2) retrieves K-shot demonstrations using BGE embeddings for in-context Llama 3 8B prompting, (3) applies LoRA adapters to Llama 3 8B for supervised fine-tuning, (4) merges section summaries with a second BioBART pass, and (5) refines outputs through reinforcement learning (PPO &amp; GRPO) using a composite reward of factuality (AlignScore, SummaC), relevance (ROUGE-L, BERTScore), and readability (LENS, FKGL, DCRS, CLI). On PLOS and eLife validation sets, our complete systemreduces DCRS from 9.23 to 8.56 and reduces CLI from 12.98 to 12.65, ranking 3rd in readability. and outperforms llama3 finetune baseline in AlignScore 0.722 to 0.862, ranking 5th in factuality, demonstrating balanced gains across readability, relevance, and factuality.</abstract>
-      <url hash="331a8305">2025.bionlp-share.33</url>
+      <url hash="c1b81dc3">2025.bionlp-share.33</url>
       <bibkey>xu-etal-2025-team</bibkey>
       <doi>10.18653/v1/2025.bionlp-share.33</doi>
+      <revision id="1" href="2025.bionlp-share.33v1" hash="331a8305"/>
+      <revision id="2" href="2025.bionlp-share.33v2" hash="c1b81dc3" date="2025-09-02">This revision mainly updated some citations.</revision>
     </paper>
     <paper id="34">
       <title><fixed-case>V</fixed-case>e<fixed-case>R</fixed-case>ea<fixed-case>F</fixed-case>ine: Iterative Verification Reasoning Refinement <fixed-case>RAG</fixed-case> for Hallucination-Resistant on Open-Ended Clinical <fixed-case>QA</fixed-case></title>
diff --git a/data/xml/2025.coling.xml b/data/xml/2025.coling.xml
index 13b13f2baf..74e2e6eaad 100644
--- a/data/xml/2025.coling.xml
+++ b/data/xml/2025.coling.xml
@@ -10149,10 +10149,11 @@
       <author><first>Juhee</first><last>Park</last></author>
       <pages>794–806</pages>
       <abstract>In-vehicle speech recognition (IVSR) systems are crucial components of modern automotive interfaces, enabling hands-free control and enhancing user safety. However, traditional IVSR systems often struggle with interpreting user intent accurately due to limitations in contextual understanding and ambiguity resolution, leading to user frustration. This paper introduces LLM ContextBridge, a novel hybrid architecture that integrates Pretrained Language Model-based intent classification with Large Language Models to enhance both command recognition and dialogue management. LLM ContextBridge serves as a seamless bridge between traditional natural language understanding techniques and LLMs, combining the precise intent recognition of conventional NLU with the contextual handling and ambiguity resolution capabilities of LLMs. This approach significantly improves recognition accuracy and user experience, particularly in complex, multi-turn dialogues. Experimental results show notable improvements in task success rates and user satisfaction, demonstrating that LLM ContextBridge can make IVSR systems more intuitive, responsive, and context-aware.</abstract>
-      <url hash="8d9007d0">2025.coling-industry.66</url>
+      <url hash="84688931">2025.coling-industry.66</url>
       <bibkey>chun-etal-2025-llm</bibkey>
       <revision id="1" href="2025.coling-industry.66v1" hash="15a92a45"/>
       <revision id="2" href="2025.coling-industry.66v2" hash="8d9007d0" date="2025-03-17">Fixed references.</revision>
+      <revision id="3" href="2025.coling-industry.66v3" hash="84688931" date="2025-09-06">Updated author info.</revision>
     </paper>
     <paper id="67">
       <title>Neural Document Segmentation Using Weighted Sliding Windows with Transformer Encoders</title>
diff --git a/data/xml/2025.conll.xml b/data/xml/2025.conll.xml
index cbce8d3a84..b05e21c3e0 100644
--- a/data/xml/2025.conll.xml
+++ b/data/xml/2025.conll.xml
@@ -308,10 +308,12 @@
       <author orcid="0000-0002-5994-671X"><first>Nathan</first><last>Schneider</last><affiliation>Georgetown University</affiliation></author>
       <pages>365-376</pages>
       <abstract>Construction Grammar hypothesizes that knowledge of a language consists chiefly of knowledge of form–meaning pairs (“constructions”) that include vocabulary, general grammar rules, and even idiosyncratic patterns. Recent work has shown that transformer language models represent at least some constructional patterns, including ones where the construction is rare overall. In this work, we probe BERT’s representation of the form and meaning of a minor construction of English, the NPN (noun–preposition–noun) construction—exhibited in such expressions as face to face and day to day—which is known to be polysemous. We construct a benchmark dataset of semantically annotated corpus instances (including distractors that superficially resemble the construction). With this dataset, we train and evaluate probing classifiers. They achieve decent discrimination of the construction from distractors, as well as sense disambiguation among true instances of the construction, revealing that BERT embeddings carry indications of the construction’s semantics.Moreover, artificially permuting the word order of true construction instances causes them to be rejected, indicating sensitivity to matters of form. We conclude that BERT does latently encode at least some knowledge of the NPN construction going beyond a surface syntactic pattern and lexical cues.</abstract>
-      <url hash="3c0cfdb9">2025.conll-1.24</url>
+      <url hash="7678127e">2025.conll-1.24</url>
       <attachment type="software" hash="a6a6d803">2025.conll-1.24.software.zip</attachment>
       <bibkey>scivetti-schneider-2025-construction</bibkey>
       <doi>10.18653/v1/2025.conll-1.24</doi>
+      <revision id="1" href="2025.conll-1.24v1" hash="3c0cfdb9"/>
+      <revision id="2" href="2025.conll-1.24v2" hash="7678127e" date="2025-09-06">Minor updates.</revision>
     </paper>
     <paper id="25">
       <title>Evidence of Generative Syntax in <fixed-case>LLM</fixed-case>s</title>
diff --git a/data/xml/2025.depling.xml b/data/xml/2025.depling.xml
index 19bd4e81f6..602fa945de 100644
--- a/data/xml/2025.depling.xml
+++ b/data/xml/2025.depling.xml
@@ -16,8 +16,11 @@
       <isbn>979-8-89176-290-9</isbn>
     </meta>
     <frontmatter>
-      <url hash="ac94f643">2025.depling-1.0</url>
+      <url hash="190f93da">2025.depling-1.0</url>
       <bibkey>depling-ws-syntaxfest-2025-1</bibkey>
+      <revision id="1" href="2025.depling-1.0v1" hash="ac94f643"/>
+      <revision id="2" href="2025.depling-1.0v2" hash="e482954b" date="2025-09-05">Corrected a typo.</revision>
+      <revision id="3" href="2025.depling-1.0v3" hash="190f93da" date="2025-09-06">Minor removal.</revision>
     </frontmatter>
     <paper id="1">
       <title>A Typology of Non-Projective Patterns in Unas’s and Teti’s Pyramid Texts</title>
@@ -54,8 +57,10 @@
       <author><first>Leo</first><last>Wanner</last><affiliation>Barcelona Supercomputing Center</affiliation></author>
       <pages>36-53</pages>
       <abstract>While the competence of LLMs to cope with agreement constraints has been widely tested in English, only a very limited number of works deals with morphologically rich(er) languages. In this work, we experiment with 25 mono- and multilingual LLMs, applying them to a collection of more than 5,000 test examples that cover the main agreement phenomena in three Romance languages (Italian, Portuguese, and Spanish) and one Slavic Language (Russian). We identify which of the agreement phenomena are most difficult for which models and challenge some common assumptions of what makes a good model. The test suites into which the test examples are organized are openly available and can be easily adapted to other agreement phenomena and other languages for further research.</abstract>
-      <url hash="78c19f6f">2025.depling-1.4</url>
+      <url hash="01289108">2025.depling-1.4</url>
       <bibkey>taboas-garcia-wanner-2025-assessing</bibkey>
+      <revision id="1" href="2025.depling-1.4v1" hash="78c19f6f"/>
+      <revision id="2" href="2025.depling-1.4v2" hash="01289108" date="2025-09-06">Added acknowledgments.</revision>
     </paper>
     <paper id="5">
       <title>Introducing <fixed-case>KIP</fixed-case>arla Forest: seeds for a <fixed-case>UD</fixed-case> annotation of interactional syntax</title>
diff --git a/data/xml/2025.findings.xml b/data/xml/2025.findings.xml
index e3c763dd4a..0508734bde 100644
--- a/data/xml/2025.findings.xml
+++ b/data/xml/2025.findings.xml
@@ -5249,9 +5249,11 @@
       <author orcid="0000-0003-4871-3045"><first>Nicolas</first><last>Thome</last><affiliation>sorbonne université</affiliation></author>
       <pages>7030-7046</pages>
       <abstract>Reinforcement learning (RL) is a promising approach for aligning large language models (LLMs) knowledge with sequential decision-making tasks. However, few studies have thoroughly investigated the impact on LLM agents capabilities of fine-tuning them with RL in a specific environment. In this paper, we propose a novel framework to analyze the sensitivity of LLMs to prompt formulations following RL training in a textual environment. Our findings reveal that the performance of LLMs degrades when faced with prompt formulations different from those used during the RL training phase. Besides, we analyze the source of this sensitivity by examining the model’s internal representations and salient tokens. Finally, we propose to use a contrastive loss to mitigate this sensitivity and improve the robustness and generalization capabilities of LLMs.</abstract>
-      <url hash="4711d0e0">2025.findings-naacl.390</url>
+      <url hash="52519feb">2025.findings-naacl.390</url>
       <bibkey>aissi-etal-2025-reinforcement</bibkey>
       <doi>10.18653/v1/2025.findings-naacl.390</doi>
+      <revision id="1" href="2025.findings-naacl.390v1" hash="4711d0e0"/>
+      <revision id="2" href="2025.findings-naacl.390v2" hash="52519feb" date="2025-09-11">Author detail update.</revision>
     </paper>
     <paper id="391">
       <title>An empirical study of validating synthetic data for formula generation</title>
@@ -14021,9 +14023,12 @@
       <author><first>Yixuan</first><last>Yuan</last><affiliation>The Chinese University of Hong Kong</affiliation></author>
       <pages>10345-10359</pages>
       <abstract>Large Language Models (LLMs) are transforming healthcare through LLM-based agents that can understand and assist with medical tasks. This survey examines the architectures, applications, and challenges of LLM-based agents in medicine. We analyze key components including system profiles, clinical planning, medical reasoning frameworks, and external capacity enhancement. The survey covers major applications in clinical decision support, medical documentation, training simulations, and healthcare service optimization, along with evaluation frameworks and metrics. While these agents show promise in enhancing healthcare delivery, challenges remain in hallucination management, multimodal integration, implementation, and ethics. We conclude by highlighting future directions in medical reasoning, physical system integration, and training simulations, providing researchers and practitioners with a structured overview of the field’s current state and prospects.</abstract>
-      <url hash="19a07e3c">2025.findings-acl.539</url>
+      <url hash="db39af7e">2025.findings-acl.539</url>
       <bibkey>wang-etal-2025-survey</bibkey>
       <doi>10.18653/v1/2025.findings-acl.539</doi>
+      <revision id="1" href="2025.findings-acl.539v1" hash="19a07e3c"/>
+      <revision id="2" href="2025.findings-acl.539v2" hash="db39af7e" date="2025-09-02">Corrected a few citations.</revision>
+      <revision id="3" href="2025.findings-acl.539v3" hash="db39af7e" date="2025-09-05">Updated citations.</revision>
     </paper>
     <paper id="540">
       <title>Context-Robust Knowledge Editing for Language Models</title>
@@ -19455,9 +19460,11 @@
       <author orcid="0000-0002-7725-395X"><first>Deniz</first><last>Gunduz</last><affiliation>Imperial College London</affiliation></author>
       <pages>18189-18204</pages>
       <abstract>Speculative decoding accelerates large language model inference using a smaller draft model. In this paper, we establish a surprising connection between speculative sampling and the concept of channel simulation from information theory, which aims at simulating a noisy channel using as few bits as possible. This connection allows us to provide an information-theoretic analysis of the speed up that can be achieved by speculative sampling. Leveraging this link, we derive an explicit relation between generation speed-up and the number of tokens <tex-math>k</tex-math> generated by the draft model for large <tex-math>k</tex-math>, which serves as an upper bound for all <tex-math>k</tex-math>. We also propose a novel speculative sampling method via exponential races called ERSS that matches state-of-the-art performance.</abstract>
-      <url hash="3ad33ba7">2025.findings-acl.936</url>
+      <url hash="de680ff1">2025.findings-acl.936</url>
       <bibkey>kobus-gunduz-2025-speculative</bibkey>
       <doi>10.18653/v1/2025.findings-acl.936</doi>
+      <revision id="1" href="2025.findings-acl.936v1" hash="3ad33ba7"/>
+      <revision id="2" href="2025.findings-acl.936v2" hash="de680ff1" date="2025-09-06">Added footnote.</revision>
     </paper>
     <paper id="937">
       <title>Going Beyond Your Expectations in Latency Metrics for Simultaneous Speech Translation</title>
@@ -25585,9 +25592,11 @@
       <author orcid="0000-0002-5284-477X"><first>Jonathan</first><last>May</last><affiliation>University of Southern California and USC/ISI</affiliation></author>
       <pages>26744-26759</pages>
       <abstract>Non-verbal communication (NVC) is an integral part of human language, but it has been overlooked in natural language processing research. Studying NVC in general is challenging because of its high variance in interpretation among individuals and cultures, but mime—the theatrical technique of suggesting intent using only gesture, expression, and movement—is a subset of NVC with much lower human interpretation variance. As a gateway for evaluating vision-language models on their understanding of NVC, we propose Mime Identification-based Multimodal Evaluation (MIME), a gesture recognition task built upon a novel corpus of mimed activity comprising 86 unique gestures with a variety of perturbations applied to the avatar, background, and viewpoint for evaluating recognition robustness. We find that both open-weight and API-based vision-language models perform significantly worse than humans at identifying mimed gestures in MIME, motivating the need for increased research for instilling more robust understanding of human actions for VLMs.</abstract>
-      <url hash="b9ff65be">2025.findings-acl.1372</url>
+      <url hash="30d61213">2025.findings-acl.1372</url>
       <bibkey>cho-etal-2025-vision</bibkey>
       <doi>10.18653/v1/2025.findings-acl.1372</doi>
+      <revision id="1" href="2025.findings-acl.1372v1" hash="b9ff65be"/>
+      <revision id="2" href="2025.findings-acl.1372v2" hash="30d61213" date="2025-09-06">Minor updates.</revision>
     </paper>
     <paper id="1373">
       <title>Training Language Model to Critique for Better Refinement</title>
diff --git a/data/xml/2025.gebnlp.xml b/data/xml/2025.gebnlp.xml
index da4506bbbd..292a04b2cc 100644
--- a/data/xml/2025.gebnlp.xml
+++ b/data/xml/2025.gebnlp.xml
@@ -64,9 +64,11 @@
       <author><first>Mascha</first><last>Kurpicz-Briki</last><affiliation>BFH - Bern University of Applied Sciences</affiliation></author>
       <pages>33-51</pages>
       <abstract>Bias in Natural Language Processing (NLP) applications has become a critical issue, with many methods developed to measure and mitigate bias in word embeddings and language models. However, most approaches focus on single categories such as gender or ethnicity, neglecting the intersectionality of biases, particularly in non-English languages. This paper addresses these gaps by studying both single-category and intersectional biases in Italian word embeddings and language models. We extend existing bias metrics to Italian, introducing GG-FISE, a novel method for detecting intersectional bias while accounting for grammatical gender. We also adapt the CrowS-Pairs dataset and bias metric to Italian. Through a series of experiments using WEAT, SEAT, and LPBS tests, we identify significant biases along gender and ethnic lines, with particular attention to biases against Romanian and South Asian populations. Our results highlight the need for culturally adapted methods to detect and address biases in multilingual and intersectional contexts.</abstract>
-      <url hash="7a1f68a4">2025.gebnlp-1.3</url>
+      <url hash="d662c25a">2025.gebnlp-1.3</url>
       <bibkey>puttick-kurpicz-briki-2025-detecting</bibkey>
       <doi>10.18653/v1/2025.gebnlp-1.3</doi>
+      <revision id="1" href="2025.gebnlp-1.3v1" hash="7a1f68a4"/>
+      <revision id="2" href="2025.gebnlp-1.3v2" hash="d662c25a" date="2025-09-06">Minor updates.</revision>
     </paper>
     <paper id="4">
       <title>Power(ful) Associations: Rethinking “Stereotype” for <fixed-case>NLP</fixed-case></title>
diff --git a/data/xml/2025.gem.xml b/data/xml/2025.gem.xml
index b7accde96b..993d29dc3e 100644
--- a/data/xml/2025.gem.xml
+++ b/data/xml/2025.gem.xml
@@ -743,8 +743,10 @@
       <author><first>Akiko</first><last>Aizawa</last><affiliation>National Institute of Informatics</affiliation></author>
       <pages>973-973</pages>
       <abstract>Keyphrase generation refers to the task of producing a set of words or phrases that summarises the content of a document. Continuous efforts have been dedicated to this task over the past few years, spreading across multiple lines of research, such as model architectures, data resources, and use-case scenarios. Yet, the current state of keyphrase generation remains unknown as there has been no attempt to review and analyse previous work. In this paper, we bridge this gap by presenting an analysis of over 50 research papers on keyphrase generation, offering a comprehensive overview of recent progress, limitations, and open challenges. Our findings highlight several critical issues in current evaluation practices, such as the concerning similarity among commonly-used benchmark datasets and inconsistencies in metric calculations leading to overestimated performances. Additionally, we address the limited availability of pre-trained models by releasing a strong PLM-based model for keyphrase generation as an effort to facilitate future research.</abstract>
-      <url hash="0dd2486a">2025.gem-1.76</url>
+      <url hash="fe2e87e4">2025.gem-1.76</url>
       <bibkey>boudin-aizawa-2025-analysis</bibkey>
+      <revision id="1" href="2025.gem-1.76v1" hash="0dd2486a"/>
+      <revision id="2" href="2025.gem-1.76v2" hash="fe2e87e4" date="2025-09-12">Updated the paper.</revision>
     </paper>
     <paper id="77">
       <title><fixed-case>U</fixed-case>-<fixed-case>MATH</fixed-case>: A University-Level Benchmark for Evaluating Mathematical Skills in Large Language Models</title>
diff --git a/data/xml/2025.iwpt.xml b/data/xml/2025.iwpt.xml
index 263bfc42a8..7153f07402 100644
--- a/data/xml/2025.iwpt.xml
+++ b/data/xml/2025.iwpt.xml
@@ -15,8 +15,10 @@
       <isbn>979-8-89176-294-7</isbn>
     </meta>
     <frontmatter>
-      <url hash="6082824c">2025.iwpt-1.0</url>
+      <url hash="6c5ba862">2025.iwpt-1.0</url>
       <bibkey>iwpt-syntaxfest-2025-1</bibkey>
+      <revision id="1" href="2025.iwpt-1.0v1" hash="6082824c"/>
+      <revision id="2" href="2025.iwpt-1.0v2" hash="6c5ba862" date="2025-09-05">Typo correction.</revision>
     </frontmatter>
     <paper id="1">
       <title>An Efficient Parser for Bounded-Order Product-Free <fixed-case>L</fixed-case>ambek Categorial Grammar via Term Graph</title>
diff --git a/data/xml/2025.knowllm.xml b/data/xml/2025.knowllm.xml
index 0403fa2474..c1be4e66fb 100644
--- a/data/xml/2025.knowllm.xml
+++ b/data/xml/2025.knowllm.xml
@@ -174,9 +174,11 @@
       <author><first>Zheng</first><last>Chen</last><affiliation>Hong Kong University of Science and Technology</affiliation></author>
       <pages>120-139</pages>
       <abstract>Chinese homophones, prevalent in Internet culture, bring rich linguistic twists that are challenging for language models. While native speakers disambiguate them through phonological reasoning and contextual understanding, it remains untested how well LLMs perform on this task and whether LLMs also achieve this via similar reasoning processes or merely through memorization of homophone-original word pairs during training.In this paper, we present HomoP-CN, the first Chinese Internet homophones dataset with systematic perturbations for evaluating LLMs’ homophone restoration capabilities. Using this benchmark, we investigated the influence of semantic, phonological, and graphemic features on LLMs’ restoration accuracy, measured the reliance levels of each model on memorization during restoration through consistency ratios under controlled perturbations, and assessed the effectiveness of various prompting strategies, including contextual cues, pinyin augmentation, few-shot learning, and thought-chain approaches.</abstract>
-      <url hash="924da789">2025.knowllm-1.11</url>
+      <url hash="b9c7f15e">2025.knowllm-1.11</url>
       <bibkey>ma-etal-2025-reasoning</bibkey>
       <doi>10.18653/v1/2025.knowllm-1.11</doi>
+      <revision id="1" href="2025.knowllm-1.11v1" hash="924da789"/>
+      <revision id="2" href="2025.knowllm-1.11v2" hash="b9c7f15e" date="2025-09-02">Correct author list.</revision>
     </paper>
     <paper id="12">
       <title>Superfluous Instruction: Vulnerabilities Stemming from Task-Specific Superficial Expressions in Instruction Templates</title>
diff --git a/data/xml/2025.law.xml b/data/xml/2025.law.xml
index 6dd0459955..614cd1b9ca 100644
--- a/data/xml/2025.law.xml
+++ b/data/xml/2025.law.xml
@@ -278,9 +278,11 @@
       <author><first>Natalia</first><last>Patiño Mazzotti</last><affiliation>Goethe University Frankfurt</affiliation></author>
       <pages>279-284</pages>
       <abstract>In this paper, we identify types of uncertainty in interlinear glossed text (IGT) annotation, a common notation for language data in linguistic research.</abstract>
-      <url hash="aa7ba129">2025.law-1.23</url>
+      <url hash="1679f339">2025.law-1.23</url>
       <bibkey>ionov-patino-mazzotti-2025-addressing</bibkey>
       <doi>10.18653/v1/2025.law-1.23</doi>
+      <revision id="1" href="2025.law-1.23v1" hash="aa7ba129"/>
+      <revision id="2" href="2025.law-1.23v2" hash="1679f339" date="2025-09-06">Corrected errors.</revision>
     </paper>
     <paper id="24">
       <title>Illuminating Logical Fallacies with the <fixed-case>CAMPFIRE</fixed-case> Corpus</title>
diff --git a/data/xml/2025.naacl.xml b/data/xml/2025.naacl.xml
index feb619dc98..71f1e42c2d 100644
--- a/data/xml/2025.naacl.xml
+++ b/data/xml/2025.naacl.xml
@@ -1045,9 +1045,11 @@
       <author orcid="0000-0003-1562-7909"><first>Isabelle</first><last>Augenstein</last><affiliation>University of Copenhagen</affiliation></author>
       <pages>1607-1627</pages>
       <abstract>Studying human values is instrumental for cross-cultural research, enabling a better understanding of preferences and behaviour of society at large and communities therein. To study the dynamics of communities online, we propose a method to computationally analyse values present on Reddit. Our method allows analysis at scale, complementing survey based approaches. We train a value relevance and a value polarity classifier, which we thoroughly evaluate using in-domain and out-of-domain human annotations. Using these, we automatically annotate over nine million posts across 12k subreddits with Schwartz values. Our analysis unveils both previously recorded and novel insights into the values prevalent within various online communities. For instance, we discover a very negative stance towards conformity in the Vegan and AbolishTheMonarchy subreddits. Additionally, our study of geographically specific subreddits highlights the correlation between traditional values and conservative U.S. states. Through our work, we demonstrate how our dataset and method can be used as a complementary tool for qualitative study of online communication.</abstract>
-      <url hash="f1023824">2025.naacl-long.77</url>
+      <url hash="6359c2b2">2025.naacl-long.77</url>
       <bibkey>borenstein-etal-2025-investigating</bibkey>
       <doi>10.18653/v1/2025.naacl-long.77</doi>
+      <revision id="1" href="2025.naacl-long.77v1" hash="f1023824"/>
+      <revision id="2" href="2025.naacl-long.77v2" hash="6359c2b2" date="2025-09-06">Added Ack.</revision>
     </paper>
     <paper id="78">
       <title>Pointwise Mutual Information as a Performance Gauge for Retrieval-Augmented Generation</title>
@@ -6288,9 +6290,11 @@
       <author orcid="0000-0003-3095-9776"><first>Lei</first><last>Li</last><affiliation>School of Computer Science, Carnegie Mellon University</affiliation></author>
       <pages>9077-9090</pages>
       <abstract>The lottery ticket hypothesis posits the existence of “winning tickets” within a randomly initialized neural network. Do winning tickets exist for LLMs in fine-tuning scenarios? How can we find such winning tickets? In this paper, we propose KS-Lottery, a method to identify a small subset of LLM parameters highly effective in multilingual fine-tuning. Our key idea is to use Kolmogorov-Smirnov Test to analyze the distribution shift of parameters before and after fine-tuning. We further theoretically prove that KS-Lottery can find the certified winning tickets in the embedding layer, fine-tuning on the found parameters is guaranteed to perform as well as full fine-tuning. Comparing KS-Lottery with other tuning algorithms on translation tasks, the experimental results show that KS-Lottery finds a much smaller set of parameters for fine-tuning while achieving the comparable performance as full fine-tuning LLM. Surprisingly, we find that fine-tuning 18 tokens’ embedding of LLaMA suffices to reach the fine-tuning translation performance .</abstract>
-      <url hash="49956c8f">2025.naacl-long.458</url>
+      <url hash="416c92fd">2025.naacl-long.458</url>
       <bibkey>yuan-etal-2025-ks</bibkey>
       <doi>10.18653/v1/2025.naacl-long.458</doi>
+      <revision id="1" href="2025.naacl-long.458v1" hash="49956c8f"/>
+      <revision id="2" href="2025.naacl-long.458v2" hash="416c92fd" date="2025-09-05">Minor updates.</revision>
     </paper>
     <paper id="459">
       <title><fixed-case>PA</fixed-case>-<fixed-case>RAG</fixed-case>: <fixed-case>RAG</fixed-case> Alignment via Multi-Perspective Preference Optimization</title>
@@ -8656,9 +8660,11 @@
       <author id="saptarshi-ghosh"><first>Saptarshi</first><last>Ghosh</last><affiliation>Indian Institute of Technology Kharagpur</affiliation></author>
       <pages>12688-12704</pages>
       <abstract>Large language models (LLMs) are increasingly recognized for their exceptional generative capabilities and versatility across various tasks. However, the high inference costs associated with these models have not received adequate attention, particularly when compared to the focus on training costs in existing research. In response to this gap, our study conducts a comprehensive benchmarking of LLM inference energy across a wide range of NLP tasks, where we analyze the impact of different models, tasks, prompts, and system-related factors on inference energy. Specifically, our experiments reveal several interesting insights, including strong correlation of inference energy with output token length and response time. Also, we find that quantization and optimal batch sizes, along with targeted prompt phrases, can significantly reduce energy usage. This study is the first to thoroughly benchmark LLM inference across such a diverse range of aspects, providing insights and offering several recommendations for improving energy efficiency in model deployment.</abstract>
-      <url hash="1d94559d">2025.naacl-long.632</url>
+      <url hash="7186a564">2025.naacl-long.632</url>
       <bibkey>poddar-etal-2025-towards</bibkey>
       <doi>10.18653/v1/2025.naacl-long.632</doi>
+      <revision id="1" href="2025.naacl-long.632v1" hash="1d94559d"/>
+      <revision id="2" href="2025.naacl-long.632v2" hash="7186a564" date="2025-09-15">Add equal contribution note</revision>
     </paper>
     <paper id="633">
       <title><fixed-case>CSR</fixed-case>-Bench: Benchmarking <fixed-case>LLM</fixed-case> Agents in Deployment of Computer Science Research Repositories</title>
diff --git a/data/xml/2025.quasy.xml b/data/xml/2025.quasy.xml
index cbcccd726e..f684f0cee3 100644
--- a/data/xml/2025.quasy.xml
+++ b/data/xml/2025.quasy.xml
@@ -16,8 +16,10 @@
       <isbn>979-8-89176-293-0</isbn>
     </meta>
     <frontmatter>
-      <url hash="b21bd1bd">2025.quasy-1.0</url>
+      <url hash="75873ee4">2025.quasy-1.0</url>
       <bibkey>quasy-ws-syntaxfest-2025-1</bibkey>
+      <revision id="1" href="2025.quasy-1.0v1" hash="b21bd1bd"/>
+      <revision id="2" href="2025.quasy-1.0v2" hash="75873ee4" date="2025-09-05">Typo correction.</revision>
     </frontmatter>
     <paper id="1">
       <title>Subject-Verb Agreement Alternations in <fixed-case>S</fixed-case>panish Pseudopartitive Constructions: A Corpus Study</title>
@@ -54,8 +56,10 @@
       <author><first>Sylvain</first><last>Kahane</last><affiliation>Université Paris Nanterre</affiliation></author>
       <pages>26-38</pages>
       <abstract>In this paper, we develop a data-driven contrastive framework to extract common and distinctive linguistic descriptions from syntactic treebanks. The extracted contrastive rules are defined by a statistically significant difference in precision and classified as common and distinctive rules across the set of treebanks. We illustrate our method by working on object word order using Universal Dependencies (UD) treebanks in 6 Romance languages: Brazilian Portuguese, Catalan, French, Italian, Romanian and Spanish. We discuss the limitations faced due to inconsistent annotation and the feasibility of conducting contrasting studies using the UD collection.</abstract>
-      <url hash="374f331f">2025.quasy-1.5</url>
+      <url hash="8bc0f6c5">2025.quasy-1.5</url>
       <bibkey>herrera-etal-2025-extraction</bibkey>
+      <revision id="1" href="2025.quasy-1.5v1" hash="374f331f"/>
+      <revision id="2" href="2025.quasy-1.5v2" hash="8bc0f6c5" date="2025-09-06">Minor fixes.</revision>
     </paper>
     <paper id="6">
       <title>A Quantitative Study of Syntactic Complexity across Genres: Dependency Distance in <fixed-case>E</fixed-case>nglish and <fixed-case>C</fixed-case>hinese</title>
diff --git a/data/xml/2025.udw.xml b/data/xml/2025.udw.xml
index e3e5eaa7ac..c63d5ade65 100644
--- a/data/xml/2025.udw.xml
+++ b/data/xml/2025.udw.xml
@@ -16,8 +16,10 @@
       <isbn>979-8-89176-292-3</isbn>
     </meta>
     <frontmatter>
-      <url hash="80198ef5">2025.udw-1.0</url>
+      <url hash="3cb3eff2">2025.udw-1.0</url>
       <bibkey>udw-ws-2025-1</bibkey>
+      <revision id="1" href="2025.udw-1.0v1" hash="80198ef5"/>
+      <revision id="2" href="2025.udw-1.0v2" hash="3cb3eff2" date="2025-09-05">Typo correction.</revision>
     </frontmatter>
     <paper id="1">
       <title>Reference and Modification in <fixed-case>U</fixed-case>niversal <fixed-case>D</fixed-case>ependencies</title>
diff --git a/data/xml/2025.woah.xml b/data/xml/2025.woah.xml
index 07b1cd1f52..d1ee3eaa46 100644
--- a/data/xml/2025.woah.xml
+++ b/data/xml/2025.woah.xml
@@ -13,7 +13,7 @@
       <address>Vienna, Austria</address>
       <month>August</month>
       <year>2025</year>
-      <url hash="ee479f28">2025.woah-1</url>
+      <url hash="a0f0cc69">2025.woah-1</url>
       <venue>woah</venue>
       <venue>ws</venue>
       <isbn>979-8-89176-105-6</isbn>
@@ -385,8 +385,7 @@
       <author><first>Marcos</first><last>Garcia</last><affiliation>Universidade de Santiago de Compostela</affiliation></author>
       <pages>426-457</pages>
       <abstract>Conspiracist narratives posit an omnipotent, evil group causing harm throughout domains. However, modern-day online conspiracism is often more erratic, consisting of loosely connected posts displaying a general anti-establishment attitude pervaded by negative emotions. We gather a dataset of 300 conspiracist and mainstream, Telegram channels in Italian and English and use the automatic extraction of entities and emotion detection to compare structural characteristics of both types of channels. We create a co-occurrence network of entities to analyze how the different types of channels introduce and use them across posts and topics. We find that conspiracist channels are characterized by anger. Moreover, co-occurrence networks of entities appearing in conspiracist channels are more dense. We theorize that this reflects a narrative structure where all actants are pushed into a single domain. Conspiracist channels disproportionately associate the most central group of entities with anger and fear. We do not find evidence that entities in conspiracist narratives occur across more topics. This could indicate an erratic type of online conspiracism where everything can be connected to everything and that is characterized by a high number of entities and high levels of anger.</abstract>
-      <url hash="54ebd7e5">2025.woah-1.41</url>
-      <attachment type="SupplementaryMaterial" hash="01da7e56">2025.woah-1.41.SupplementaryMaterial.zip</attachment>
+      <url hash="631bae93">2025.woah-1.41</url>
       <attachment type="SupplementaryMaterial" hash="01da7e56">2025.woah-1.41.SupplementaryMaterial.zip</attachment>
       <bibkey>laken-etal-2025-multilingual</bibkey>
     </paper>
diff --git a/data/xml/N13.xml b/data/xml/N13.xml
index b9b1a51e96..b230946b36 100644
--- a/data/xml/N13.xml
+++ b/data/xml/N13.xml
@@ -1176,9 +1176,11 @@
       <author><first>Theresa</first><last>Wilson</last></author>
       <author><first>David</first><last>Yarowsky</last></author>
       <pages>1010–1019</pages>
-      <url hash="8f87557e">N13-1121</url>
+      <url hash="1868e062">N13-1121</url>
       <bibkey>bergsma-etal-2013-broadly</bibkey>
       <video href="N13-1121.mp4"/>
+      <revision id="1" href="N13-1121v1" hash="8f87557e"/>
+      <revision id="2" href="N13-1121v2" hash="1868e062" date="2025-09-15">Remove last name from Table 5.</revision>
     </paper>
     <paper id="122">
       <title>To Link or Not to Link? A Study on End-to-End Tweet Entity Linking</title>
diff --git a/data/xml/P19.xml b/data/xml/P19.xml
index 31c3570572..0cdab3ee01 100644
--- a/data/xml/P19.xml
+++ b/data/xml/P19.xml
@@ -6473,11 +6473,13 @@
       <author><first>David</first><last>Reitter</last></author>
       <pages>5127–5136</pages>
       <abstract>We examine the benefits of visual context in training neural language models to perform next-word prediction. A multi-modal neural architecture is introduced that outperform its equivalent trained on language alone with a 2% decrease in perplexity, even when no visual context is available at test. Fine-tuning the embeddings of a pre-trained state-of-the-art bidirectional language model (BERT) in the language modeling framework yields a 3.5% improvement. The advantage for training with visual context when testing without is robust across different languages (English, German and Spanish) and different models (GRU, LSTM, Delta-RNN, as well as those that use BERT embeddings). Thus, language models perform better when they learn like a baby, i.e, in a multi-modal environment. This finding is compatible with the theory of situated cognition: language is inseparable from its physical context.</abstract>
-      <url hash="af18fc21">P19-1506</url>
+      <url hash="11459406">P19-1506</url>
       <attachment type="supplementary" hash="5b1aa546">P19-1506.Supplementary.pdf</attachment>
       <video href="P19-1506.mp4"/>
       <doi>10.18653/v1/P19-1506</doi>
       <bibkey>ororbia-etal-2019-like</bibkey>
+      <revision id="1" href="P19-1506v1" hash="af18fc21"/>
+      <revision id="2" href="P19-1506v2" hash="11459406" date="2025-09-12">Author name update.</revision>
     </paper>
     <paper id="507">
       <title>Relating Simple Sentence Representations in Deep Neural Networks and the Brain</title>
diff --git a/data/yaml/name_variants.yaml b/data/yaml/name_variants.yaml
index 1468d7a0bc..064ed1313b 100644
--- a/data/yaml/name_variants.yaml
+++ b/data/yaml/name_variants.yaml
@@ -23,6 +23,9 @@
   - {first: Chalamalasetti, last: Kranti}
 - canonical: {first: Felicia, last: Körner}
   id: felicia-koerner
+  orcid: 0000-0002-4086-5338
+  degree: Ludwig Maximilian University of Munich
+  comment: LMU
   variants:
   - {first: Felicia, last: Koerner}
 - canonical: {first: Pranav, last: A}
diff --git a/data/yaml/sigs/sigdial.yaml b/data/yaml/sigs/sigdial.yaml
index 01bfa3d963..d0da8faac4 100644
--- a/data/yaml/sigs/sigdial.yaml
+++ b/data/yaml/sigs/sigdial.yaml
@@ -1,6 +1,6 @@
 Name: ACL/ISCA Special Interest Group on Discourse and Dialogue
 ShortName: SIGDIAL
-URL: http://www.aclweb.org/sigdial
+URL: https://sigdial.org
 Meetings:
   - 2024:
     - 2024.sigdial-1 # Proceedings of the 25th Annual Meeting of the Special Interest Group on Discourse and Dialogue