United States

Audiovisual Automatic Speech Recognition (AV-ASR) aims to improve speech recognition accuracy by leveraging visual signals. It is particularly challenging in unconstrained scenarios across various domains (e.g., real-world videos) due to noisy acoustic environments, spontaneous speech, and the uncertain use of visual information. Most previous works fine-tune audio-only ASR models on audiovisual datasets, optimizing them for conventional ASR objectives. However, they often neglect visual features and common errors in unconstrained video scenarios. In this paper, we propose using a preference optimization strategy to improve speech recognition accuracy for real-world videos. First, we create preference data via simulating common errors that occurred in AV-ASR from two focals: manipulating the audio or vision input and rewriting the output transcript. Second, we propose BPO-AVASR, a Bifocal Preference Optimization method to improve AV-ASR models by leveraging both input-side and output-side preference. Extensive experiments demonstrate that our approach significantly improves speech recognition accuracy across various domains, outperforming previous state-of-the-art models on real-world video speech recognition. We will open-source our preference dataset and codes to support future research.

AAAI 2025

Enhancing Audiovisual Speech Recognition Through Bifocal Preference Optimization

speech signal processing

snlp

poster

We are pleased to announce the Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25), which will be held in Philadelphia, Pennsylvania at the Pennsylvania Convention Center from February 25 to March 4, 2025.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.

### [Invited Speakers](https://aaai.org/conference/aaai/aaai-25/aaai-25-invited-speakers/)

Register [here](https://aaai.org/conference/aaai/aaai-25/registration/)

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.



Diffusion models have demonstrated outstanding performance in generative tasks, making them ideal candidates for image editing. Recent studies highlight their ability to apply desired edits effectively by following textual instructions, yet with two key challenges remaining. First, these models struggle to apply multiple edits simultaneously, resulting in computational inefficiencies due to their reliance on sequential processing. Second, relying on textual prompts to determine the editing region can lead to unintended alterations to the image.  We introduce FunEditor, an efficient diffusion model designed to learn atomic editing functions and perform complex edits by aggregating simpler functions. This approach enables complex editing tasks, such as object movement, by aggregating multiple functions and applying them simultaneously to specific areas. Our experiments demonstrate that FunEditor significantly outperforms recent inference-time optimization methods and fine-tuned models, either quantitatively across various metrics or through visual comparisons or both, on complex tasks like object movement and object pasting. In the meantime, with only 4 steps of inference, FunEditor achieves 5--24$\times$ inference speedups over existing popular methods.

FunEditor: Achieving Complex Image Edits via Function Aggregation with Diffusion Models

First-order linear temporal logic (FOLTL) is a flexible and expressive formalism capable of naturally describing complex behaviors and properties. Although the logic is in general highly undecidable, the idea of using it as a specification language for the verification of complex infinite-state systems is appealing. However, a missing piece, which has proved to be an invaluable tool in dealing with other temporal logics, is an automaton model capable of capturing the logic. In this paper we address this issue, by defining and studying such a model, which we call first-order automaton. We define this very general class of automata, and the corresponding notion of regular first-order language, showing their closure under most common language-theoretic operations. We show how they can capture any FOLTL formula over any signature and theory, and provide sufficient conditions for the semi-decidability of their non-emptiness problem. Then, to show the usefulness of the formalism, we prove the decidability of monodic FOLTL, a classic result known in the literature, with a simpler and direct proof.

First-Order Automata

Although distributed machine learning (distributed ML) is gaining considerable attention in the community, prior works have independently looked at instances of distributed ML in either the training or the inference phase. No prior work has examined the combined robustness stemming from distributing both the learning and the inference process.

In this work, we explore, for the first time, the robustness of distributed ML models that are fully heterogeneous in training data, architecture, scheduler, optimizer, and other model parameters. Supported by theory and extensive experimental validation using CIFAR10 and FashionMNIST, we show that such properly distributed ML instantiations achieve across-the-board improvements in accuracy-robustness tradeoffs against state-of-the-art transfer-based attacks that could otherwise not be realized by current ensemble or federated learning instantiations. For instance, our experiments on CIFAR10 show that for CSE, one of the most powerful state-of-the-art transfer-based attacks, our method improves robust accuracy by up to 40%, with a minimal impact on clean task accuracy.

On the Robustness of Distributed Machine Learning Against Transfer Attacks

Since its introduction, the transformer has shifted the development trajectory away from traditional models (e.g., RNN, MLP) in time series forecasting, which is attributed to its ability to capture global dependencies within temporal tokens. Follow-up studies have largely involved altering the tokenization and self-attention modules to better adapt Transformers for addressing special challenges like non-stationarity, channel-wise dependency, and variable correlation in time series. However, we found that the expressive capability of sequence representation is a key factor influencing Transformer performance in time forecasting after investigating several representative methods, where there is an almost linear relationship between sequence representation entropy and mean square error, with more diverse representations performing better. In this paper, we propose a novel attention mechanism with Sequence Complementors and prove feasible from an information theory perspective, where these learnable sequences are able to provide complementary information beyond current input to feed attention. We further enhance the Sequence Complementors via a diversification loss that is theoretically covered. The empirical evaluation of both long-term and short-term forecasting has confirmed its superiority over the recent state-of-the-art methods.

Sequence Complementor: Complementing Transformers for Time Series Forecasting with Learnable Sequences

The design of multi-item, multi-bidder auctions involves a delicate balancing act of economic objectives, bidder incentives, and real-world complexities. Efficient auctions, that is, auctions that allocate items to maximize total bidder value, are practically desirable since they promote the most economically beneficial use of resources. Arguably the biggest drawback of efficient auctions, however, is their potential to generate very low revenue. In this work, we show how the auction designer can artificially inject competition into the auction to boost revenue while striving to maintain efficiency. First, we invent a new auction family that enables the auction designer to specify competition in a precise, expressive, and interpretable way. We then introduce a new model of bidder behavior and individual rationality to understand how bidders act when prices are too competitive. Next, under our bidder behavior model, we use our new competitive auction class to derive the globally revenue-optimal efficient auction under two different knowledge models for the auction designer: knowledge of full bidder value distributions and knowledge of bidder value quantiles. Finally, we study a third knowledge model for the auction designer: knowledge of historical bidder valuation data. In this setting we present sample and computationally efficient learning algorithms that find high-revenue probably-efficient competitive auctions from bidder data. Our learning algorithms are instance adaptive and can be run in parallel across bidders, unlike most prior approaches to data-driven auction design.

Increasing Revenue in Efficient Combinatorial Auctions by Learning to Generate Artificial Competition

Causal discovery is essential across various scientific fields to uncover causal structures within data. Traditional methods relying on observational data have limitations due to confounding variables. This paper presents an optimization-based approach using integer programming (IP) to design minimal intervention sets that ensure causal structure identifiability. Our method provides exact and modular solutions, adaptable to different experimental settings and constraints. We demonstrate its effectiveness through comparative analysis across different settings demonstrating its applicability and robustness.

Causal Discovery by Interventions via Integer Programming

We propose a dynamic Computed Tomography (CT) reconstruction framework called STNF4D (SpatioTemporal-aware Neural Fields). First, we decompose the dynamic CT scene into four 3D hash grids. Compared to the plane decomposition method, this method enhances the model's capacity while keeping the representation compact and efficient. However, in densely predicted high-resolution dynamic CT scenes, the lack of constraints and hash conflicts in the hash grid features lead to obvious dot-like artifact and blurring in the reconstructed images. To address these issues, we propose the Spatiotemporal Transformer (ST-Former) that guides the model in selecting and optimizing features by sensing the spatiotemporal information in different hash grids, significantly improving the quality of reconstructed images. We conducted experiments on medical and industrial datasets covering various motion types, sampling schemes, and reconstruction resolutions. Experimental results show that our method outperforms the second-best by 5.99 dB and 4.27 dB in medical and industrial scenes, respectively.

Spatiotemporal-aware Neural Fields for Dynamic CT Reconstruction

Multi-object tracking is a challenging vision task that requires simultaneous reasoning about object detection and object association. Conventional solutions use frame as the basic unit and typically rely on a motion predictor that exploits the appearance features to associate detected candidates, leading to insufficient adaptability to long-term associations. In this study, we propose a section-based multi-object tracking approach that integrates a temporal coherent Object Flow Tracker (OFTrack), capable of achieving simultaneous multi-frame tracking by treating multiple consecutive frames as the basic processing unit, denoted as a “section”. Our OFTrack boosts the optical flow to the object flow by employing object perception and section-based motion estimation strategies. Object perception adopts object-aware sampling and scale-aware correlation to enable precise target discrimination. Motion estimation models the correlation of different objects in multi-frames via specialized temporal-spatial attention to achieve robust association in very long videos. Additionally, to address the oscillation of unpredictable trajectories in multi-frame estimation, we have designed temporal coherent enhancement including the trajectory masking pre-training and the smoothing constraint on trajectory curves. Comprehensive experiments on several widely used benchmarks demonstrate the superior performance of our approach.

Temporal Coherent Object Flow for Multi-Object Tracking

Exponential-family harmoniums (EFHs) generalize the restricted Boltzmann machine beyond Bernoulli random variables to other exponential families. Here we show how to extend the EFH beyond standard exponential families (Poisson, Gaussian, etc.), by allowing the sufficient statistics for the hidden units to be arbitrary functions of the observed data, parameterized by deep neural networks. This rules out the standard sampling scheme, block Gibbs sampling, so we replace it with a form of Langevin dynamics within Gibbs, inspired by a recent method for training Gaussian restricted Boltzmann machines (GRBMs). With Gibbs-Langevin, the GRBM can successfully model small data sets like MNIST and CelebA-32, but struggles with CIFAR-10, and cannot scale to larger images because it lacks convolutions. In contrast, our neural-network EFHs generate high-quality samples from CIFAR-10 and scale well to CelebA-HQ. Furthermore, we compared our model with a standard energy-based model of a similar neural-network architecture and same number of parameters. Our training method significantly improved image generation performance by approximately 25% across different datasets. Additionally, our model is competitive with noise conditional score network models, which utilize more complex neural networks (U-nets) and require considerably more sampling steps.

Exponential-Family Harmoniums with Neural Sufficient Statistics

As large language models (LLMs) continue to advance in capability and influence, ensuring their security and preventing harmful outputs has become crucial. A promising approach to address these concerns involves training models to automatically generate adversarial prompts for red teaming. However, the evolving subtlety of vulnerabilities in LLMs challenges the effectiveness of current adversarial methods, which struggle to generate diverse, complex prompts and dynamically explore the weaknesses of these models. To tackle these challenges, we introduce the **S**elf-**E**volving **A**dversarial **S**afety (**SEAS**) optimization framework, which includes both a SEAS dataset and a SEAS pipeline. The SEAS dataset comprises complex adversarial prompts, while the SEAS pipeline operates through three stages: Initialization, Attack, and Adversarial Optimization. This framework generates a diverse range of adversarial prompts and dynamically explores the model's vulnerabilities to enhance its security. Our contributions include a novel adversarial framework, a comprehensive safety dataset, and empirical evidence demonstrating the effectiveness of SEAS. After three iterations, the Target model achieves a security level comparable to GPT-4.

Premium content

Next from AAAI 2025

FunEditor: Achieving Complex Image Edits via Function Aggregation with Diffusion Models

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES