United States

The rapid advancement of large language models (LLMs) has led to significant improvements in their capabilities, but also to increased concerns about their alignment with human values and intentions. Current alignment strategies, including adaptive training and inference-time methods, have demonstrated potential in this area. However, these approaches still struggle to balance deployment complexity and capability across various tasks and difficulties. In this work, we introduce the Streaming Distribution Induce Aligner (*Stream Aligner*), a novel alignment paradigm that combines efficiency with enhanced performance in various tasks throughout the generation process. *Stream Aligner* achieves dynamic sentence-level correction by using a small model to learn the preferences of the suffix sentence, iteratively correcting the suffix sentence output by the upstream model, and then using the corrected sentence to replace the suffix sentence in subsequent generations. Compared to *Aligner*, our experiments demonstrate that *Stream Aligner* reduces reliance on the capabilities of additional models, enhances the reasoning abilities of LLMs, and decreases latency during user interaction. Specifically, *Stream Aligner*-2B model has achieved an improvement of 76.1% in helpfulness, 36.0% in harmlessness on the tested Llama2-70B-chat model, and *Stream Aligner*-8B has achieved an improvement of 3.5% on the math ability of the tested Llama3-70B-chat model.

AAAI 2025

Stream Aligner: Efficient Sentence-Level Alignment via Distribution Induction

poster

We are pleased to announce the Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25), which will be held in Philadelphia, Pennsylvania at the Pennsylvania Convention Center from February 25 to March 4, 2025.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.

### [Invited Speakers](https://aaai.org/conference/aaai/aaai-25/aaai-25-invited-speakers/)

Register [here](https://aaai.org/conference/aaai/aaai-25/registration/)

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.



Farmers rely on in-field observations to make well-informed crop management decisions to maximize profit and minimize adverse environmental impact.
However, obtaining real-world crop state measurements is labor-intensive, time-consuming and expensive.
In most cases, it is not feasible to gather crop state measurements before every decision moment.
Moreover, in previous research pertaining to farm management optimization, these observations are often assumed to be readily available without any cost, which is unrealistic.
Hence, enabling optimization without the need to have *temporally complete* crop state observations is important.
An approach to that problem is to include measuring as part of decision making.
As a solution, we apply reinforcement learning (RL) to recommend opportune moments to simultaneously measure crop features and apply nitrogen fertilizer.
With realistic considerations, we design an RL environment with explicit crop feature measuring costs.
While balancing costs, we find that an RL agent, trained with recurrent PPO, discovers adaptive measuring policies that follow critical crop development stages, with results aligned by what domain experts would consider a sensible approach. 
Our results highlight the importance of measuring when crop feature measurements are not readily available.

To Measure or Not: A Cost-Sensitive, Selective Measuring Environment for Agricultural Management Decisions with Reinforcement Learning

In recent years, the rapid development of Large Language Models (LLMs) has highlighted the urgent need for large-scale, high-quality, and diverse data. We have launched an LLM data co-creation platform aimed at bringing together a wide range of participants to contribute data. Within six months, the platform has attracted over 10,000 participants who contributed more than 150,000 data entries across more than 200 tasks. An observable user cohort was constructed around the question, ``Who is the best data contributor?" along with sub-questions concerning user preferences, task competence, and more. Through a detailed analysis of data contributors, this paper reveals several data collection patterns related to human factors. It reveals that contributors who provide high-quality data often do not meet initial expectations, as their behavior exhibits typical characteristics of the Dunning-Kruger effect. This paper examined the cognitive bias between users' self-assessment and actual abilities, where individuals tend to overestimate their capabilities in certain tasks, leading to a decreased willingness to continue contributing and a consequent waste of human resources. To address this issue, we propose a task reassignment method based on multi-task fine-tuning of small language models (SLMs) to better align user groups with appropriate task types. After the reallocation, we observed a significant increase in user engagement and platform benefits, along with improved overall platform efficiency. The versatility of this method makes it applicable to broader data collection scenarios.

Cognitive Bias and Reassignment: Who Can Contribute High Quality LLM Data

Existing video fact-checking datasets often lack detailed evidence and explanations, compromising the reliability and interpretability of fact-checking methods. To address these gaps, we developed a novel dataset featuring comprehensive annotations for each news item, including veracity labels, the rationales behind these labels, and supporting evidence. This dataset significantly enhances models' ability to accurately identify and explain video content. 
We also present an explainable automatic framework 
$\textbf{3MFact}$, utilizing $\textbf{M}$ulti-role $\textbf{M}$ultimodal $\textbf{M}$odels for video $\textbf{Fact}$-checking. Our framework iteratively gathers and synthesizes online evidence to progressively determine the veracity label, generating three key outputs: veracity label, rationale, and supported evidence. We aim for this work to be a pioneering effort, providing robust support for the field of video fact-checking.

Pioneering Explainable Video Fact-Checking with a New Dataset and Multi-role Multimodal Model Approach

Extracting fine-grained OCR text from aged documents in diacritic languages remains challenging due to unexpected artifacts, time-induced degradation, and lack of datasets. While standalone spell correction approaches have been proposed, they show limited performance for historical documents due to numerous possible OCR error combinations and differences between modern and classical corpus distributions.
We propose a method utilizing available content-focused ebooks as a reference base to correct imperfect OCR-generated text, supported by large language models. This technique generates high-precision pseudo-page-to-page labels for diacritic languages, where small strokes pose significant challenges in historical conditions. The pipeline eliminates various types of noise from aged documents and addresses issues such as missing characters, words, and disordered sequences.
Our post-processing method, which generated a large OCR dataset of classical Vietnamese books, achieved a mean grading score of 8.72 on a 10-point scale. This outperformed the state-of-the-art transformer-based Vietnamese spell correction model, which scored 7.03, when evaluated on a sampled subset of the dataset. We also trained a baseline OCR model to assess and compare it with well-known engines. Experimental results demonstrate the strength of our baseline model compared to widely used open-source solutions. The resulting dataset will be released publicly to support future studies.

Reference-Based Post-OCR Processing with LLM for Precise Diacritic Text in Historical Document Recognition

Social segregation in cities, spanning racial, residential, and income dimensions, is becoming increasingly diverse and severe. As urban spaces and social relations grow more complex, residents in metropolitan areas experience different levels of social segregation. If not promptly addressed, this may lead to an increase in crime rates, heightened social tensions, and other serious social issues. Effectively quantifying and analyzing the structures within urban spaces and resident interactions has become crucial for addressing these segregation issues. Previous studies have mainly focused on surface-levels indicators of urban segregation, lacking comprehensive analysis from the perspectives of urban structure and mobility. This limitation fails to capture the full complexity of current segregation phenomena. To address this gap, we propose a framework named **Motif**-Enhanced **G**raph **P**rototype **L**earning (**MotifGPL**). The MotifGPL framework comprises three key modules: prototype-based graph structure extraction, motif distribution discovery, and urban graph structure reconstruction. Specifically, We use graph structure prototype learning to extract significant prototypes from both the urban spatial graph and the origin-destination graph, incorporating key urban attributes such as points of interest, street view images, and flow index. To enhance interpretability, the motif distribution discovery module innovatively matches each prototype with similar motifs, which represent simpler graph structures reflecting local patterns. Finally, we use the motif distribution results to guide the reconstruction of the two graphs. This model facilitates a detailed exploration of urban spatial structures and resident mobility patterns, allowing us to identify and analyze the key motif patterns that influence urban social segregation, guiding the reconstruction of urban graph structures for lower segregation. Extensive experimental results demonstrate that MotifGPL can effectively reveal the key motif patterns influencing urban social segregation and provide robust guidance for mitigating this phenomenon.

MotifGPL: Motif-Enhanced Graph Prototype Learning for Deciphering Urban Social Segregation

Biophysical models offer valuable insights into climate-phenology relationships in both natural and agricultural settings. However, there are substantial structural discrepancies across models which require site-specific recalibration, often yielding inconsistent predictions under similar climate scenarios. Machine learning methods offer data-driven solutions, but often lack interpretability and alignment with existing knowledge. We present a phenology model describing dormancy in fruit trees, integrating conventional biophysical models with a neural network to address their structural disparities. We evaluate our hybrid model in an extensive case study predicting cherry tree phenology in Japan, South Korea and Switzerland. Our approach consistently outperforms both traditional biophysical and machine learning models in predicting blooming dates across years. Additionally, the neural network's adaptability facilitates parameter learning for specific tree varieties, enabling robust generalization to new sites without site-specific recalibration. This hybrid model leverages both biophysical constraints and data-driven flexibility, offering a promising avenue for accurate and interpretable phenology modeling.

Hybrid Phenology Modeling for Predicting Temperature Effects on Tree Dormancy

Compositional generalization is the capability of a model to understand novel compositions composed of seen concepts. There are multiple levels of novel compositions including phrase-phrase level, phrase-word level, and word-word level. Existing methods achieve promising compositional generalization, but the consistency of compositional generalization across multiple levels of novel compositions remains unexplored. The consistency refers to that a model should generalize to a phrase-phrase level novel composition, and phrase-word/word-word level novel compositions that can be derived from it simultaneously. In this paper, we propose a meta-learning based framework, for achieving consistent compositional generalization across multiple levels. The basic idea is to progressively learn compositions from simple to complex for consistency. Specifically, we divide the original training set into multiple validation sets based on compositional complexity, and introduce multiple meta-weight-nets to generate sample weights for samples in different validation sets. To fit the validation sets in order of increasing compositional complexity, we optimize the parameters of each meta-weight-net independently and sequentially in a multilevel optimization manner. We build a GQA-CCG dataset to quantitatively evaluate the consistency. Experimental results on visual question answering and temporal video grounding, demonstrate the effectiveness of the proposed framework.

Consistency of Compositional Generalization Across Multiple Levels

Single Domain Generalization (SDG) is critical in medical imaging applications. Recently, Vision Foundation Models (VFMs) have spearheaded a trend in AI development due to their robust generalizability and versatility. This work aims to fully explore the generalization capabilities of VFMs alongside the domain-specific expertise of specialized models, thoroughly investigating the boundaries of their respective capabilities, thereby collaboratively addressing SDG challenges within medical imaging. We propose a framework for \textbf{Colla}borative reasoning between \textbf{S}pecialized and \textbf{U}niversal models for \textbf{S}ingle \textbf{D}omain \textbf{G}eneralization (\textbf{\textit{CollaSU-SDG}}) in medical imaging. Specifically, we first design a model-aware perturbation injection method from the perspective of single-source domain data, enabling differentiated and adaptive perturbation injection for two different scales of models. Then, a domain expansion adapter is designed for the VFM to adapt to the augmented single-source domain medical data. Lastly, we introduce an adaptive hierarchical transfer and dynamic dense prompting method that facilitate collaborative reasoning between the specialized and universal models, eliminating the need for explicit prompts. Through these designs, \textbf{\textit{CollaSU-SDG}} fully leverages the strengths of both specialized and universal models, achieving robust out-of-distribution generalization capabilities on single-source domain data. Experimental results demonstrate that \textbf{\textit{CollaSU-SDG}}  significantly advances the state-of-the-art performance across a wide range of medical datasets. 
All the code and pre-trained weights will be publicly available.

Tuning, Perturbating, and Collaborating: Harnessing Vision Foundation Models for Single Domain Generalization on Medical Imaging

Understanding the 3D semantics of a scene is a fundamental problem for various scenarios such as embodied agents.
While NeRFs and 3DGS excel at novel-view synthesis, previous methods for understanding their semantics have been limited to incomplete 3D understanding: their segmentation results are 2D masks and their supervision is anchored at 2D pixels.
This paper revisits the problem set to pursue a better 3D understanding of a scene modeled by NeRFs and 3DGS as follows. 1) We directly supervise the 3D points to train the language embedding field. It achieves state-of-the-art accuracy without relying on multi-scale language embeddings. 2) We transfer the pre-trained language field to 3DGS, achieving the first real-time rendering speed without sacrificing training time or accuracy. 3) We introduce a 3D querying and evaluation protocol for assessing the reconstructed geometry and semantics together. Code, checkpoints, and annotations will be available online.

Rethinking Open-Vocabulary Segmentation of Radiance Fields in 3D Space

Object detection, particularly open-vocabulary object detection, plays a crucial role in Earth sciences, such as environmental monitoring, natural disaster assessment, and land-use planning. However, existing open-vocabulary detectors, primarily trained on natural-world images, struggle to generalize to remote sensing images due to a significant data domain gap. Thus, this paper aims to advance the development of open-vocabulary object detection in remote sensing community. To achieve this, we first reformulate the task as Locate Anything on Earth (LAE) with the goal of detecting any novel concepts on Earth. We then developed the LAE-Label Engine which collects, auto-annotates, and unifies up to 10 remote sensing datasets creating the LAE-1M — the first large-scale remote sensing object detection dataset with broad category coverage. Using the LAE-1M, we further propose and train the novel LAE-DINO Model, the first open-vocabulary foundation object detector for the LAE task, featuring Dynamic Vocabulary Construction (DVC) and Visual-Guided Text Prompt Learning (VisGT) modules. DVC dynamically constructs vocabulary for each training batch, while VisGT maps visual features to semantic space, enhancing text features. We comprehensively conduct experiments on established remote sensing benchmark DIOR, DOTAv2.0, as well as our newly introduced 80-class LAE-80C benchmark. Results demonstrate the advantages of the LAE-1M dataset and the effectiveness of the LAE-DINO method. All the datasets and codes will be available at https://anonymous.4open.science/r/locate-anything-on-earth/.

Premium content

Next from AAAI 2025

To Measure or Not: A Cost-Sensitive, Selective Measuring Environment for Agricultural Management Decisions with Reinforcement Learning

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES