United States

Step-level reward models (SRMs) can significantly enhance mathematical reasoning performance through process supervision or step-level preference alignment based on reinforcement learning. The performance of SRMs is pivotal, as they serve as critical guidelines, ensuring that each step in the reasoning process is aligned with desired outcomes. Recently, AlphaZero-like methods, where Monte Carlo Tree Search (MCTS) is employed for automatic step-level preference annotation, have proven particularly effective.  However, the precise mechanisms behind the success of SRMs remain largely unexplored. To address this gap, this study delves into the counterintuitive aspects of SRMs, particularly focusing on MCTS-based approaches. Our findings reveal that the removal of natural language descriptions of thought processes has minimal impact on the efficacy of SRMs. Furthermore, we demonstrate that SRMs are adept at assessing the complex logical coherence present in mathematical language while having difficulty in natural language. These insights provide a nuanced understanding of the core elements that drive effective step-level reward modeling in mathematical reasoning. By shedding light on these mechanisms, this study offers valuable guidance for developing more efficient and streamlined SRMs, which can be achieved by focusing on the crucial parts of mathematical reasoning.

AAAI 2025

What Are Step-Level Reward Models Rewarding? Counterintuitive Findings from MCTS-Boosted Mathematical Reasoning

snlp

language models

poster

We are pleased to announce the Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25), which will be held in Philadelphia, Pennsylvania at the Pennsylvania Convention Center from February 25 to March 4, 2025.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.

### [Invited Speakers](https://aaai.org/conference/aaai/aaai-25/aaai-25-invited-speakers/)

Register [here](https://aaai.org/conference/aaai/aaai-25/registration/)

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.



Diabetic retinopathy (DR), with its large patient population, has become a formidable threat to human visual health. In the clinical diagnosis of DR, multi-view fundus images are considered to be more suitable for DR diagnosis because of the wide coverage of the field of view. Therefore, different from most of the previous single-view DR grading methods, we design a dynamic selection-driven multi-view DR grading method to fit clinical scenarios better. Since lesion information plays a key role in DR diagnosis, previous methods usually boost the model performance by enhancing the lesion feature. However, during the actual diagnosis, ophthalmologists not only focus on the crucial parts, but also exclude irrelevant features to ensure that do not compromise the accuracy of their judgment. To this end, we introduce the idea of dynamic selection and design a series of selection mechanisms from fine granularity to coarse granularity. In this work, we first introduce an Ophthalmic Image Reader (OIR) agent to provide the model with pixel-level prompts of suspected lesion areas. Moreover, a Multi-View Token Selection Module (MVTSM) is designed to prune redundant feature tokens and realize dynamic selection of key information. In the final decision stage, we dynamically fuse multi-view features through the novel Multi-View Mixture of Experts Module (MVMoEM), to enhance key views and reduce the impact of conflicting views. Extensive experiments on a large multi-view fundus image dataset with 34,452 images demonstrate that our method performs favorably against state-of-the-art models. The public code can be available in the attachment.

Like an Ophthalmologist: Dynamic Selection Driven Multi-View Learning for Diabetic Retinopathy Grading

With the advancement of graph representation learning, self-supervised graph contrastive learning (GCL) has emerged as a key technique in the field. In GCL, positive and negative samples are generated through data augmentation. While recent works have introduced model-based methods to enhance positive graph augmentations, they often overlook the importance of negative samples, relying instead on rule-based methods that can fail to capture meaningful graph patterns. To address this issue, we propose a novel model-based adversarial contrastive graph augmentation (ACGA) method that automatically generates both positive graph samples with minimal sufficient information and hard negative graph samples. Additionally, we provide a theoretical framework to analyze the process of positive and negative graph augmentation in self-supervised GCL. We evaluate our ACGA method through extensive experiments on representative benchmark datasets, and the results demonstrate that ACGA outperforms state-of-the-art baselines.

Adversarial Contrastive Graph Augmentation with Counterfactual Regularization

From image to video understanding, the capabilities of Multi-modal LLMs (MLLMs) are increasingly powerful. However, most existing video understanding benchmarks are relatively short, which makes them inadequate for effectively evaluating the long-sequence modeling capabilities of MLLMs. This highlights the urgent need for a comprehensive and integrated long video understanding benchmark to assess the ability of MLLMs thoroughly. To this end, we propose ALLVB (\textbf{ALL}-in-one long \textbf{V}ideo understanding \textbf{B}enchmark). ALLVB's main contributions include: 1) It integrates 9 major video understanding tasks. These tasks are converted into video Q\&A formats, allowing a single benchmark to evaluate 9 different video understanding capabilities of MLLMs, highlighting the versatility, comprehensiveness, and challenging nature of ALLVB. 2) A fully automated annotation pipeline using GPT-4o is designed, requiring only human quality control, which facilitates the maintenance and expansion of the benchmark. 3) It contains 1,376 videos across 16 categories, averaging nearly 2 hours each, with a total of 252k Q\&As. To the best of our knowledge, it is the largest long video understanding benchmark in terms of the number of videos, average duration, and number of Q\&As. We have tested various mainstream MLLMs on ALLVB, and the results indicate that even the most advanced commercial models have significant room for improvement. This reflects the benchmark's challenging nature and demonstrates the substantial potential for development in long video understanding.

ALLVB: All-in-One Long Video Understanding Benchmark

In recent years, the distributed training of foundation models (FMs) has seen a surge in popularity. In particular, federated learning enables collaborative model training among edge clients while safeguarding the privacy of their data. However, federated training of FMs across resource-constrained and highly heterogeneous edge devices encounters several challenges. These include the difficulty of deploying FMs on clients with limited computational resources and the high computation and communication costs associated with fine-tuning and collaborative training. To address these challenges, we propose FedCAMS, a Cluster-Aware Framework with Knowledge-Aware Model Search. Specifically, FedCAMS incorporates a multi-factor heterogeneity-aware clustering algorithm to group clients into clusters based on data distribution and resource limitations, and selects an appropriate model as the cluster model within each cluster. In consideration of the computational limitations of different client devices, we design knowledge-aware model architecture search (KAMAS), which allows each client to identify the optimal sub-model from the cluster model without any training. After local training, a partial aggregation method is employed for intra-cluster aggregation. Finally, cluster-Aware knowledge transfer facilitates knowledge sharing between clusters and the server, addressing model heterogeneity and reducing communication overhead. Extensive experiments demonstrate that FedCAMS outperforms state-of-the-art baselines by 3-10\% in accuracy.

Cluster Based Heterogeneous Federated Foundation Model Adaptation and Fine-Tuning

Classical Transformer-based line segment detection methods have delivered impressive results. However, we observe that some accurately detected line segments are assigned low confidence scores during prediction, causing them to be ranked lower and potentially suppressed. Additionally, these models often require prolonged training periods to achieve strong performance, largely due to the necessity of bipartite matching. In this paper, we introduce RANK-LETR, a novel Transformer-based line segment detection method. Our approach leverages learnable geometric information to refine the ranking of predicted line segments by enhancing the confidence scores of high-quality predictions in a posterior verification step. We also propose a new line segment proposal method, wherein the feature point nearest to the centroid of the line segment directly predicts the location, significantly improving training efficiency and stability. Moreover, we introduce a line segment ranking loss to stabilize rankings during training, thereby enhancing the generalization capability of the model. Experimental results demonstrate that our method outperforms other Transformer-based and CNN-based approaches in prediction accuracy while requiring fewer training epochs than previous Transformer-based models.

Improving Transformer Based Line Segment Detection with Matched Predicting and Re-ranking

Partially linear models (PLMs) have attracted much attention in the field of statistical machine learning. Specially, the ability of variable selection of PLMs has been studied extensively due to the high requirement of model interpretability. However, few of the existing works concerns the false discovery rate (FDR) controllability of variable selection associated with PLMs. To address this issue, we formuate a new Knockoffs Inference scheme for Linear And Nonlinear Discoverer (called KI-LAND), where FDR is controlled with respect to both linear and nonlinear variables for automatic structure discovery. For the proposed KI-LAND, theoretical guarantees are established for both FDR controllability and power, and experimental evaluations are provided to validate its effectiveness on simulated exmaples and real data.

Knockoffs Inference for Partially Linear Models with Automatic Structure Discovery

Hypergraph neural networks (HGNNs) have shown promise in handling tasks characterized by high-order correlations, achieving notable success across various applications. However, there has been limited focus on heterophilous hypergraph learning (HHL), contrasting with the increasing attention given to graph neural networks designed for graphs exhibiting heterophily. This paper aims to pave the way for HHL by addressing key gaps from multiple perspectives: measurement, dataset diversity, and baseline model development. Firstly, we introduce metrics to quantify heterophily in hypergraphs, providing a numerical basis for assessing the homophily/heterophily ratio. Secondly, we develop diverse benchmark datasets across various real-world scenarios, facilitating comprehensive evaluations of existing HGNNs and advancing research in HHL. Additionally, as a novel baseline model, we propose HyperUFG, a framelet-based hypergraph neural network integrating both low-pass and high-pass filters. Extensive experiments conducted on synthetic and benchmark datasets highlight the challenges current HGNNs face with heterophilous hypergraphs, while showcasing that HyperUFG performs competitively and often outperforms many existing models in such scenarios. Overall, our study underscore the urgent need for further exploration and development in this emerging field, with the potential to inspire and guide future research in HHL.

When Hypergraph Meets Heterophily: New Benchmark Datasets and Baseline

3D color lookup tables (LUTs) enable precise color manipulation by mapping input RGB values to specific output RGB values. 3D LUTs are instrumental in various applications, including video editing, in-camera processing, photographic filters, computer graphics, and color processing for displays.  While an individual LUT does not incur a high {memory} overhead, software and devices may need to store dozens to hundreds of LUTs that can take over 100 MB.  This work aims to develop a neural network architecture that can encode hundreds of LUTs in a single compact representation. To this end, we propose a model with a {memory} footprint of less than 0.25 MB that can reconstruct 512 LUTs with only minor color distortion ($\bar{\Delta}E_M$ $\leq$ 2.0) over the entire color gamut.  We also show that our network can weight colors to provide further quality gains on natural image colors ($\bar{\Delta}{E}_M$ $\leq$ {1.0}). Finally, we show that minor modifications to the network architecture enable a bijective encoding that produces LUTs that are invertible, allowing for reverse color processing.

Efficient Neural Network Encoding for 3D Color Lookup Tables

Recent lightweight image captioning models using retrieved data mainly focus on text prompts. However, previous works only utilize the retrieved text as text prompts, and the visual information relies only on the CLIP visual embedding. Because of this issue, there is a limitation that the image descriptions inherent in the prompt are not sufficiently reflected in the visual embedding space. To tackle this issue, we propose ViPCap, a novel retrieval text-based visual prompt for lightweight image captioning. ViPCap leverages the retrieved text with image information as visual prompts to enhance the ability of the model to capture relevant visual information. By mapping text prompts into the CLIP space and generating multiple randomized Gaussian distributions, our method leverages sampling to explore randomly augmented distributions and effectively retrieves the semantic features that contain image information. These retrieved features are integrated into the image and designated as the visual prompt, leading to performance improvements on datasets such as COCO, Flickr30k, and NoCaps. Experimental results demonstrate that ViPCap significantly outperforms prior lightweight captioning models in efficiency and effectiveness, demonstrating the potential for a plug-and-play solution.

ViPCap: Retrieval Text-Based Visual Prompts for Lightweight Image Captioning

Blind image super-resolution (blind SR) aims to restore a high-resolution (HR) image from a low-resolution (LR) image with unknown degradation. Many existing methods explicitly estimate degradation information from various LR images. However, in most cases, image degradations are independent of image content. Their estimations may be influenced by the image content resulting in inaccuracy. Unlike existing works, we design a dual-encoder for degradation representation (DEDR) to preclude the influence of image content from LR images. This benefits in extracting the intrinsic degradation representation more accurately. To the best of our knowledge, this paper is the first work that estimates degradation representation through filtering out image content. Based on the degradation representation extracted by DEDR, we present a novel framework, named \textbf{degradation representation aware transform network (DRAT)} for blind SR. We propose global degradation aware (GDA) blocks to propagate degradation information across spatial and channel dimensions, in which a degradation representation transform module (DRT) is introduced to render features degradation-aware, thereby enhancing the restoration of LR images. Extensive experiments are conducted on three benchmark datasets (including Gaussian 8, DIV2KRK, and real-world datasets) under large scaling factors with complex degradations. The experimental results demonstrate that DRAT surpasses state-of-the-art supervised kernel estimation and unsupervised degradation prediction methods. The code will be released on GitHub.

Premium content

Next from AAAI 2025

Like an Ophthalmologist: Dynamic Selection Driven Multi-View Learning for Diabetic Retinopathy Grading

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES