United States

Historical documents encompass a wealth of cultural treasures but suffer from severe damages including character missing, paper damage, and ink erosion over time. However, existing document processing methods primarily focus on binarization, enhancement, etc., neglecting the repair of these damages. To this end, we present a new task, termed Historical Document Repair (HDR), which aims to predict the original appearance of damaged historical documents. To fill the gap in this field, we propose a large-scale dataset HDR28K and a diffusion-based network DiffHDR for historical document repair. Specifically, HDR28K contains 28,552 damaged-repaired image pairs with character-level annotations and multi-style degradations. Moreover, DiffHDR augments the vanilla diffusion framework with semantic and spatial information and a meticulously designed character perceptual loss for contextual and visual coherence. Experimental results demonstrate that the proposed DiffHDR trained using HDR28K significantly surpasses existing approaches and exhibits remarkable performance in handling real damaged documents. Notably, DiffHDR can also be extended to document editing and text block generation, showcasing its high flexibility and generalization capacity. We believe this study could pioneer a new direction of document processing and contribute to the inheritance of invaluable cultures and civilizations. The dataset and code will be publicly available.

AAAI 2025

Predicting the Original Appearance of Damaged Historical Documents

technical paper

We are pleased to announce the Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25), which will be held in Philadelphia, Pennsylvania at the Pennsylvania Convention Center from February 25 to March 4, 2025.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.

### [Invited Speakers](https://aaai.org/conference/aaai/aaai-25/aaai-25-invited-speakers/)

Register [here](https://aaai.org/conference/aaai/aaai-25/registration/)

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.



Recent advances in large language model assistants have made them indispensable, raising significant concerns over managing their safety. Automated red teaming offers a promising alternative to the labor-intensive and error-prone manual probing for vulnerabilities, providing more consistent and scalable safety evaluations. However, existing approaches often compromise diversity by focusing on maximizing attack success rate. Additionally, methods that decrease the cosine similarity from historical embeddings with semantic diversity rewards lead to novelty stagnation as history grows. To address these issues, we introduce DiveR-CT, which relaxes conventional constraints on the objective and semantic reward, granting greater freedom for the policy to enhance diversity. Our experiments demonstrate DiveR-CT's marked superiority over baselines by 1) generating data that perform better in various diversity metrics across different attack success rate levels, 2) better-enhancing resiliency in blue team models through safety tuning based on collected data, 3) allowing dynamic control of objective weights for reliable and controllable attack success rates, and 4) reducing susceptibility to reward overoptimization. Overall, our method provides an effective and efficient approach to LLM red teaming, accelerating real-world deployment.

DiveR-CT: Diversity-enhanced Red Teaming Large Language Model Assistants with Relaxing Constraints

To address catastrophic forgetting in Continual Relation Extraction (CRE), many current approaches rely on memory buffers to rehearse previously learned knowledge while acquiring new tasks. Recently, prompt-based methods have emerged as potent alternatives to rehearsal-based strategies, demonstrating strong empirical performance. However, upon analyzing existing prompt-based approaches for CRE, we identified several critical limitations, such as inaccurate prompt selection, inadequate mechanisms for mitigating forgetting in shared parameters, and suboptimal handling of cross-task and within-task variances. To overcome these challenges, we draw inspiration from the relationship between prefix tuning and mixture of experts, proposing a novel approach that employs a prompt pool for each task, capturing variations within each task while enhancing cross-task variances. Furthermore, we incorporate a generative model to consolidate prior knowledge within shared parameters, eliminating the need for explicit data storage. Extensive experiments validate the efficacy of our approach, demonstrating superior performance over state-of-the-art prompt-based and rehearsal-free methods in continual relation extraction.

Adaptive Prompting for Continual Relation Extraction: A Within-Task Variance Perspective

The profusion of knowledge encoded in large language models (LLMs) and their ability to apply this knowledge zero-shot in a range of settings makes them promising candidates for use in decision-making. However, they are currently limited by their inability to provide outputs which can be faithfully explained and effectively contested to correct  mistakes. In this paper, we attempt to reconcile these strengths and weaknesses by introducing argumentative LLMs (aLLMs), a method for augmenting LLMs with argumentative reasoning. Concretely, aLLMs construct argumentation frameworks, which then serve as the basis for formal reasoning in support of decision-making. The interpretable nature of these argumentation frameworks and formal reasoning means that any decision made by aLLMs may be explained and contested. We evaluate aLLMs' performance experimentally in comparison with state-of-the-art techniques, in the context of the decision-making task of claim verification. We also define novel properties to characterise contestability and assess aLLMs formally in terms of these properties.

Argumentative Large Language Models for Explainable and Contestable Claim Verification

Video captioning automatically generates natural language phrases to explain the contents in video frames. When deploying captioning models in specialized domains, active learning can help to reduce the high annotation cost. However, existing active video caption methods rely on conventional uncertainty estimation, which can be  highly unreliable due to the novel challenge caused by the generative nature of the captioning process, which is much more complex than standard supervised learning tasks. Both our empirical evaluation and theoretical investigation reveal that entropy-based uncertainty estimation is in fact inflated, which will mislead active video sampling. Another challenge arises from the rich content of videos, as each video could be described in multiple ways. 
A single uncertainty score obtained from one possible caption does not capture the diversity induced by the rich content. To fill out this critical gap, we propose to identify multiple sources of uncertainty and perform novel hierarchical aggregation to integrate uncertainty from distinct sources mathematically. This innovates a holistic uncertainty metric to quantify the overall informativeness of video content for active sampling. The overall uncertainty is built upon conditional vacuity, which is a novel extension of the second-order uncertainty introduced along with the evidential learning framework to the generative setting, leading to more robust uncertainty estimation without inflation.  Both theoretical analysis and experimental evaluation are conducted to justify the effectiveness of the proposed framework for complex uncertainty estimation and interactive learning.

Hierarchical Multi-Source Uncertainty Aggregation for Interactive Video Captioning

Deep learning based dehazing networks trained on paired synthetic data have shown impressive performance, but they struggle with significant degradation in generalization ability on real-world hazy scenes. In this paper, we propose a lightweight Retinex-based Generative Adversarial \textbf{N}etwork (RetinexGAN) for real-world image dehazing using unpaired data. Our RetinexGAN consists of two stages: self-supervised pre-training and weakly-supervised fine-tuning. During the pre-training, we reduce the image dehazing task to an illumination-reflectance decomposition task based on the duality correlation between Retinex and dehazing. Specifically, a decomposition network named DecomNet is constructed to obtain an illumination and a reflectance, simultaneously. Moreover, a self-supervised learning strategy is developed to construct the connection between the preliminary dehazed result and the input hazy image, which constrains the solution space of DecomNet and accelerates training, leading to a more realistic dehazed result. In the fine-tuning stage, we develop a dual DTCWT-based attention module and embed it into the U-Net architecture to further improve the quality of preliminary result in the frequency domain. In addition, the adversarial learning is employed to constrain the relevance between the clean image and the final dehazed result in a weakly supervised manner, which can promote more natural performance. Extensive experiments on several real-world datasets demonstrate that our proposed RetinexGAN performs favorably over state-of-the-art dehazing methods in visual quality and quantitative evaluation.

Dehaze-RetinexGAN: Real-World Image Dehazing via Retinex-based Generative Adversarial Network

Traffic prediction is critical for optimizing travel scheduling and enhancing public safety, yet the complex spatial and temporal dynamics within traffic data present significant challenges for accurate forecasting. In this paper, we introduce a novel model, the Spatiotemporal-aware Trend-Seasonality Decomposition Network (STDN). This model begins by constructing a dynamic graph structure to represent 
traffic flow and incorporates novel spatio-temporal embeddings to jointly capture global traffic dynamics. The  representations learned are further refined by a specially designed trend-seasonality decomposition module, which disentangles the trend-cyclical component and seasonal component for each traffic node at different times within the graph. These components are subsequently processed through an encoder-decoder network to generate the final predictions. Extensive experiments conducted on real-world traffic datasets demonstrate that STDN achieves superior performance with remarkable computation cost. Furthermore, we have released a new traffic dataset named JiNan, which features unique inner-city dynamics, thereby enriching the scenario comprehensiveness in traffic prediction evaluation. All source code and data are available.

Spatiotemporal-aware Trend-Seasonality Decomposition Network for Traffic Flow Forecasting

The leaderboard of Large Language Models (LLMs) in mathematical tasks has been continuously updated. However, the majority of evaluations focus solely on the final results, neglecting the quality of the intermediate steps. This oversight can mask underlying problems, such as logical errors or unnecessary steps in the reasoning process. To measure reasoning beyond final-answer accuracy, we introduce ReasonEval, a new methodology for evaluating the quality of reasoning steps. ReasonEval employs $\textit{validity}$ and $\textit{redundancy}$ to characterize the reasoning quality, as well as accompanying LLMs to assess them automatically. We explore different design options for the LLM-based evaluators and empirically demonstrate that ReasonEval, when instantiated with base models possessing strong mathematical knowledge and trained with high-quality labeled data, consistently outperforms baseline methods in the meta-evaluation datasets. We also highlight the strong generalization capabilities of ReasonEval. By utilizing ReasonEval to evaluate LLMs specialized in math, we find that an increase in final-answer accuracy does not necessarily guarantee an improvement in the overall quality of the reasoning steps for challenging mathematical problems. Additionally, we observe that ReasonEval can play a significant role in data selection. We open-source the best-performing model, meta-evaluation script, and all evaluation results to facilitate future research.

Evaluating Mathematical Reasoning Beyond Accuracy

Given that AI systems are set to play a pivotal role in future decision-making processes, their trustworthiness and reliability are of critical concern. Due to their scale and complexity, modern AI systems resist direct interpretation, and alternative ways are needed to establish trust in those systems, and determine how well they align with human values. We argue that good measures of the information processing similarities between AI and humans, may be able to achieve these same ends.
While Representational alignment (RA) approaches measure similarity between the internal states of two systems, the associated data can be expensive and difficult to collect for human systems.
In contrast, Behavioural alignment (BA) comparisons are cheaper and easier, but questions remain as to their sensitivity and reliability.
We propose two new behavioural alignment metrics misclassification agreement which measures the similarity between the errors of two systems on the same instances, and class-level error similarity which measures the similarity between the error distributions of two systems. 
We show that our metrics correlate well with RA metrics, and provide complementary information to another BA metric, within a range of domains, and set the scene for a new approach to value alignment.

Measuring Error Alignment for Decision-Making Systems

Reinforcement Learning from Human Feedback (RLHF) aligns language models (LMs) with human values by training reward models (RMs) on binary preferences and using these RMs to fine-tune the base models. Despite its importance, the internal mechanisms of RLHF remain poorly understood. This paper introduces new metrics to evaluate RM effectiveness, focusing on feature imprint, feature resistance, and feature robustness. We categorize alignment datasets into target features (desired values) and spoiler features (undesired concepts). By regressing RM scores against these features, we quantify the extent to which RMs reward them -- feature imprint. We define alignment resistance as the proportion of the preference dataset where RMs fail to match human preferences, and we assess alignment robustness by analyzing RM responses to slightly perturbed texts. Our experiments, utilizing open-source components like the Anthropic/hh-rlhf preference dataset and OpenAssistant RMs, reveal significant imprints of target features and a notable sensitivity to spoiler features. We observed a 26% resistance incidence in portions of the dataset where LM labelers disagreed with human preferences. We also find that misalignment stems from confusing entries in the alignment dataset. These findings underscore the importance of scrutinizing both RMs and alignment datasets for a deeper understanding of value alignment.

SEAL: Systematic Error Analysis for Value ALignment

Existing studies explore the explainability of Grammatical Error Correction (GEC) in a limited scenario, where they ignore the interaction between corrections and explanations and have not establish a corresponding comprehensive benchmark. To bridge the gap, this paper first introduces the task of EXplainable GEC (**EXGEC**), which focuses on the integral role of both correction and explanation tasks. To facilitate the task, we propose **EXCGEC**, a tailored benchmark for Chinese EXGEC consisting of 8,216 explanation-augmented samples featuring the design of hybrid edit-wise explanations. We then benchmark several series of LLMs in multi-task learning settings, including post-explaining and pre-explaining. To promote the development of the task, we also build a comprehensive suite of evaluation by leveraging existing automatic metrics and conduct human evaluation experiments to demonstrate the human consistency of the automatic metrics for free-text explanations. Our experiments reveal the effectiveness of evaluating free-text explanations using traditional metrics like METEOR and ROUGE, and the inferior performance of multi-task models compared to the pipeline solution, indicating its challenges to establish positive effects in learning the both tasks. All the codes and data will be released after the review.

Premium content

Next from AAAI 2025

DiveR-CT: Diversity-enhanced Red Teaming Large Language Model Assistants with Relaxing Constraints

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES