United States

Temporal grounding, which localizes video moments related to a natural language query, is a core problem of vision-language learning and video understanding. To encode video moments of varying lengths, recent methods employ a multi-level structure known as a feature pyramid. In this structure, lower levels concentrate on short-range video moments, while higher levels address long-range moments. Because higher levels experience downsampling to accommodate increasing moment length, their capacity to capture information is reduced and consequently leads to degraded information in moment representations. To resolve this problem, we propose a contrastive learning framework to capture salient semantics among video moments. Our key methodology is to leverage samples from the feature space emanating from multiple stages of the video encoder itself requiring neither data augmentation nor online memory banks to obtain positive and negative samples. To enable such an extension, we introduce a sampling process to draw multiple video moments corresponding to a common query. Subsequently, by utilizing these moments&#39; representations across video encoder layers, we instantiate a novel form of multi-scale and cross-scale contrastive learning that links local short-range video moments with global long-range video moments. Extensive experiments demonstrate the effectiveness of our framework for not only long-form but also short-form video grounding.

AAAI 2025

Multi-Scale Contrastive Learning for Video Temporal Grounding

Temporal grounding, which localizes video moments related to a natural language query, is a core problem of vision-language learning and video understanding. To encode video moments of varying lengths, recent methods employ a multi-level structure known as a feature pyramid. In this structure, lower levels concentrate on short-range video moments, while higher levels address long-range moments. Because higher levels experience downsampling to accommodate increasing moment length, their capacity to capture information is reduced and consequently leads to degraded information in moment representations. To resolve this problem, we propose a contrastive learning framework to capture salient semantics among video moments. Our key methodology is to leverage samples from the feature space emanating from multiple stages of the video encoder itself requiring neither data augmentation nor online memory banks to obtain positive and negative samples. To enable such an extension, we introduce a sampling process to draw multiple video moments corresponding to a common query. Subsequently, by utilizing these moments' representations across video encoder layers, we instantiate a novel form of multi-scale and cross-scale contrastive learning that links local short-range video moments with global long-range video moments. Extensive experiments demonstrate the effectiveness of our framework for not only long-form but also short-form video grounding.

poster

We are pleased to announce the Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25), which will be held in Philadelphia, Pennsylvania at the Pennsylvania Convention Center from February 25 to March 4, 2025.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.

### [Invited Speakers](https://aaai.org/conference/aaai/aaai-25/aaai-25-invited-speakers/)

Register [here](https://aaai.org/conference/aaai/aaai-25/registration/)

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.



The design of multi-item, multi-bidder auctions involves a delicate balancing act of economic objectives, bidder incentives, and real-world complexities. Efficient auctions, that is, auctions that allocate items to maximize total bidder value, are practically desirable since they promote the most economically beneficial use of resources. Arguably the biggest drawback of efficient auctions, however, is their potential to generate very low revenue. In this work, we show how the auction designer can artificially inject competition into the auction to boost revenue while striving to maintain efficiency. First, we invent a new auction family that enables the auction designer to specify competition in a precise, expressive, and interpretable way. We then introduce a new model of bidder behavior and individual rationality to understand how bidders act when prices are too competitive. Next, under our bidder behavior model, we use our new competitive auction class to derive the globally revenue-optimal efficient auction under two different knowledge models for the auction designer: knowledge of full bidder value distributions and knowledge of bidder value quantiles. Finally, we study a third knowledge model for the auction designer: knowledge of historical bidder valuation data. In this setting we present sample and computationally efficient learning algorithms that find high-revenue probably-efficient competitive auctions from bidder data. Our learning algorithms are instance adaptive and can be run in parallel across bidders, unlike most prior approaches to data-driven auction design.

Increasing Revenue in Efficient Combinatorial Auctions by Learning to Generate Artificial Competition

Causal discovery is essential across various scientific fields to uncover causal structures within data. Traditional methods relying on observational data have limitations due to confounding variables. This paper presents an optimization-based approach using integer programming (IP) to design minimal intervention sets that ensure causal structure identifiability. Our method provides exact and modular solutions, adaptable to different experimental settings and constraints. We demonstrate its effectiveness through comparative analysis across different settings demonstrating its applicability and robustness.

Causal Discovery by Interventions via Integer Programming

We propose a dynamic Computed Tomography (CT) reconstruction framework called STNF4D (SpatioTemporal-aware Neural Fields). First, we decompose the dynamic CT scene into four 3D hash grids. Compared to the plane decomposition method, this method enhances the model's capacity while keeping the representation compact and efficient. However, in densely predicted high-resolution dynamic CT scenes, the lack of constraints and hash conflicts in the hash grid features lead to obvious dot-like artifact and blurring in the reconstructed images. To address these issues, we propose the Spatiotemporal Transformer (ST-Former) that guides the model in selecting and optimizing features by sensing the spatiotemporal information in different hash grids, significantly improving the quality of reconstructed images. We conducted experiments on medical and industrial datasets covering various motion types, sampling schemes, and reconstruction resolutions. Experimental results show that our method outperforms the second-best by 5.99 dB and 4.27 dB in medical and industrial scenes, respectively.

Spatiotemporal-aware Neural Fields for Dynamic CT Reconstruction

Multi-object tracking is a challenging vision task that requires simultaneous reasoning about object detection and object association. Conventional solutions use frame as the basic unit and typically rely on a motion predictor that exploits the appearance features to associate detected candidates, leading to insufficient adaptability to long-term associations. In this study, we propose a section-based multi-object tracking approach that integrates a temporal coherent Object Flow Tracker (OFTrack), capable of achieving simultaneous multi-frame tracking by treating multiple consecutive frames as the basic processing unit, denoted as a “section”. Our OFTrack boosts the optical flow to the object flow by employing object perception and section-based motion estimation strategies. Object perception adopts object-aware sampling and scale-aware correlation to enable precise target discrimination. Motion estimation models the correlation of different objects in multi-frames via specialized temporal-spatial attention to achieve robust association in very long videos. Additionally, to address the oscillation of unpredictable trajectories in multi-frame estimation, we have designed temporal coherent enhancement including the trajectory masking pre-training and the smoothing constraint on trajectory curves. Comprehensive experiments on several widely used benchmarks demonstrate the superior performance of our approach.

Temporal Coherent Object Flow for Multi-Object Tracking

Exponential-family harmoniums (EFHs) generalize the restricted Boltzmann machine beyond Bernoulli random variables to other exponential families. Here we show how to extend the EFH beyond standard exponential families (Poisson, Gaussian, etc.), by allowing the sufficient statistics for the hidden units to be arbitrary functions of the observed data, parameterized by deep neural networks. This rules out the standard sampling scheme, block Gibbs sampling, so we replace it with a form of Langevin dynamics within Gibbs, inspired by a recent method for training Gaussian restricted Boltzmann machines (GRBMs). With Gibbs-Langevin, the GRBM can successfully model small data sets like MNIST and CelebA-32, but struggles with CIFAR-10, and cannot scale to larger images because it lacks convolutions. In contrast, our neural-network EFHs generate high-quality samples from CIFAR-10 and scale well to CelebA-HQ. Furthermore, we compared our model with a standard energy-based model of a similar neural-network architecture and same number of parameters. Our training method significantly improved image generation performance by approximately 25% across different datasets. Additionally, our model is competitive with noise conditional score network models, which utilize more complex neural networks (U-nets) and require considerably more sampling steps.

Exponential-Family Harmoniums with Neural Sufficient Statistics

As large language models (LLMs) continue to advance in capability and influence, ensuring their security and preventing harmful outputs has become crucial. A promising approach to address these concerns involves training models to automatically generate adversarial prompts for red teaming. However, the evolving subtlety of vulnerabilities in LLMs challenges the effectiveness of current adversarial methods, which struggle to generate diverse, complex prompts and dynamically explore the weaknesses of these models. To tackle these challenges, we introduce the **S**elf-**E**volving **A**dversarial **S**afety (**SEAS**) optimization framework, which includes both a SEAS dataset and a SEAS pipeline. The SEAS dataset comprises complex adversarial prompts, while the SEAS pipeline operates through three stages: Initialization, Attack, and Adversarial Optimization. This framework generates a diverse range of adversarial prompts and dynamically explores the model's vulnerabilities to enhance its security. Our contributions include a novel adversarial framework, a comprehensive safety dataset, and empirical evidence demonstrating the effectiveness of SEAS. After three iterations, the Target model achieves a security level comparable to GPT-4.

SEAS: Self-Evolving Adversarial Safety Optimization for Large Language Models

Formal XAI is an emerging field that focuses on providing explanations
with mathematical guarantees for the decisions made by machine
learning models. Recent work in this area has centered on the
computation of ``sufficient reasons''. Given a model $\mathcal{M}$
and an input instance $\mathbf{x}$, a sufficient reason for the decision $\mathcal{M}(\mathbf{x})$ is a
subset $S$ of the features of $\mathbf{x}$ such that for any instance $\mathbf{z}$
that has the same values as $\mathbf{x}$ for every feature in $S$, it holds that $\mathcal{M}(\mathbf{x}) = \mathcal{M}(\mathbf{z})$. Intuitively, this means
that the features in $S$ are sufficient to fully justify the classification of $\mathbf{x}$ by $\mathcal{M}$.
For sufficient reasons to be useful in practice, they should be as
small as possible, and a natural way to reduce the size of sufficient
reasons is to consider a probabilistic relaxation; the probability of $\mathcal{M}(\mathbf{x}) = \mathcal{M}(\mathbf{z})$ must
be at least some value $\delta \in (0,1]$, where $\mathbf{z}$ is a random instance compatible with $\mathbf{x}$.  Computing small $\delta$-sufficient reasons ($\delta$-SRs) is known to be a theoretically hard problem; even over decision trees — traditionally deemed simple and interpretable models — strong inapproximability results make the efficient computation of small $\delta$-SRs unlikely.
We propose the notion of $(\delta, \epsilon)$-SR, a simple relaxation of $\delta$-SRs, and show that while this relaxation is hard to compute over decision trees (as a simple consequence of the previous inapproximability results), it can be computed efficiently over linear models.

Probabilistic Explanations for Linear Models

Existing causal learning algorithms focus on micro-level causal discovery, confronting significant challenges in identifying the influence of macro systems, composed of micro-level variables, on other variables. This difficulty arises because the causal relationships in macro systems are often mediated through micro-level causal interactions, which can lead to erroneous causal discovery or omission when dispersed. To address this issue, we propose the Emergence-inspired Multi-granularity Casual learning (EMCausal) method. Inspired by emergent phenomena of aggregating micro-level nodes into macro-level entities, EMCausal introduces a progressive mapping encoder to simulate the process, thereby capturing the causal relationships driven by these macro entities. Next, it introduces a causal consistency constraint to collaboratively reconstruct micro-level nodes using macro-level representations, enabling the learning of a multi-granular causal structure. 
Experimental results on both synthetic and real datasets demonstrate that EMCausal can identify causal graphs under the influence of causal emergence, outperforming competitive baselines in terms of accuracy and robustness.

Emergence-Inspired Multi-Granularity Causal Learning

Although recent years have witnessed significant advancements in medical image segmentation, the pervasive issue of domain shift among medical images from diverse centres hinders the effective deployment of pre-trained models. Many Test-time Adaptation (TTA) methods have been proposed to address this issue by fine-tuning pre-trained models with test data during inference. These methods, however, often suffer from less-satisfactory optimization due to suboptimal optimization direction (dictated by the gradient) and fixed step-size (predicated on the learning rate). In this paper, we propose the Gradient alignment-based Test-time adaptation (GraTa) method to improve both the gradient direction and learning rate in the optimization procedure. Unlike conventional TTA methods, which primarily optimize the pseudo gradient derived from a self-supervised objective, our method incorporates an auxiliary gradient with the pseudo one to facilitate gradient alignment. Such gradient alignment enables the model to excavate the similarities between different gradients and correct the gradient direction to approximate the empirical gradient related to the current segmentation task. Additionally, we design a dynamic learning rate based on the cosine similarity between the pseudo and auxiliary gradients, thereby empowering the adaptive fine-tuning of pre-trained models on diverse test data. Extensive experiments establish the effectiveness of the proposed gradient alignment and dynamic learning rate and substantiate the superiority of our GraTa method over other state-of-the-art TTA methods on a benchmark medical image segmentation task. The code and weights of pre-trained source models will be available.

Gradient Alignment Improves Test-Time Adaptation for Medical Image Segmentation

Low-rank adaptation (LoRA) is an efficient strategy for adapting latent diffusion models (LDMs) on a private dataset to generate specific images by minimizing the adaptation loss. However, the LoRA-adapted LDMs are vulnerable to membership inference (MI) attacks that can judge whether a particular data point belongs to the private dataset, thus leading to the privacy leakage. To defend against MI attacks, we first propose a straightforward solution: Membership-Privacy-preserving LoRA (MP-LoRA). MP-LoRA is formulated as a min-max optimization problem where a proxy attack model is trained by maximizing its MI gain while the LDM is adapted by minimizing the sum of the adaptation loss and the MI gain of the proxy attack model. However, we empirically find that MP-LoRA has the issue of unstable optimization, and theoretically analyze that the potential reason is the unconstrained local smoothness, which impedes the privacy-preserving adaptation. To mitigate this issue, we further propose a Stable Membership-Privacy-preserving LoRA (SMP-LoRA) that adapts the LDM by minimizing the ratio of the adaptation loss to the MI gain. Besides, we theoretically prove that the local smoothness of SMP-LoRA can be constrained by the gradient norm, leading to improved convergence. Our experimental results corroborate that SMP-LoRA can indeed defend against MI attacks and generate high-quality images. Our Code is available at https://anonymous.4open.science/r/StablePrivateLoRA-7030/README.md.

Premium content

Next from AAAI 2025

Increasing Revenue in Efficient Combinatorial Auctions by Learning to Generate Artificial Competition

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES