United States

Despite the recent progress, existing frame interpolation methods still struggle with processing extremely high resolution input and handling challenging cases such as repetitive textures, thin objects, and large motion. To address these issues, we introduce a patch-based cascaded pixel diffusion model for frame interpolation, HiFI, that excels in these scenarios while achieving competitive performance on standard benchmarks.  Cascades, which generate a series of images from low- to high-resolution, can help significantly with large or complex motion that require both global context for a coarse solution and detailed context for high resolution output. However, contrary to prior work on cascaded diffusion models which perform diffusion on increasingly large resolutions, we use a single model that always performs diffusion at the same resolution and upsamples by processing patches of the inputs and the prior solution. We show that this technique drastically reduces memory usage at inference time and also allows us to use a single model at test time, solving both frame interpolation (base model’s task) and spatial up-sampling, saving training cost. We show that HiFI helps significantly with high resolution and complex repeated textures that require global context. HiFI demonstrates comparable or beyond state-of-the-art performance on multiple benchmarks (Vimeo, Xiph, X-Test, SEPE-8K). On our newly introduced dataset that focuses on particularly challenging cases, HiFI also significantly outperforms other baselines on these cases.

AAAI 2025

High-Resolution Frame Interpolation with Patch-based Cascaded Diffusion

motion tracking

poster

We are pleased to announce the Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25), which will be held in Philadelphia, Pennsylvania at the Pennsylvania Convention Center from February 25 to March 4, 2025.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.

### [Invited Speakers](https://aaai.org/conference/aaai/aaai-25/aaai-25-invited-speakers/)

Register [here](https://aaai.org/conference/aaai/aaai-25/registration/)

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.



Low-light image enhancement (LIE) aims at precisely and efficiently recovering an image degraded in poor illumination environments. Recent advanced LIE techniques are using deep neural networks, which require lots of low-normal light image pairs, network parameters, and computational resources. As a result, their practicality is limited. In this work, we devise a novel unsupervised LIE framework based on diffusion priors and lookup tables (DPLUT) to achieve efficient low-light image recovery. The proposed approach comprises two critical components: a light adjustment lookup table (LLUT) and a noise suppression lookup table (NLUT). LLUT is optimized with a set of unsupervised losses. It aims at predicting pixel-wise curve parameters for the dynamic range adjustment of a specific image. NLUT is designed to remove the amplified noise after the light brightens. As diffusion models are sensitive to noise, diffusion priors are introduced to achieve high-performance noise suppression. Extensive experiments demonstrate that our approach outperforms state-of-the-art methods in terms of visual quality and efficiency.

DPLUT: Unsupervised Low-light Image Enhancement with Lookup Tables and Diffusion Priors

Existing low-light image enhancement (LIE) methods have achieved noteworthy success in solving synthetic distortions, yet they often fall short in practical applications. The limitations arise from two inherent challenges in real-world LIE: 1) the collection of distorted/clean image pairs is often impractical and sometimes even unavailable, and 2) accurately modeling complex degradations presents a non-trivial problem. To overcome them, we propose the Attribute Guidance Diffusion framework (AGLLDiff), a training-free method for effective real-world LIE. Instead of specifically defining the degradation process, AGLLDiff shifts the paradigm and models the desired attributes, such as image exposure, structure and color of normal-light images. These attributes are readily available and impose no assumptions about the degradation process, which guides the diffusion sampling process to a reliable high-quality solution space. Extensive experiments demonstrate that our approach outperforms the current leading unsupervised LIE methods across benchmarks in terms of distortion-based and perceptual-based metrics, and it performs well even in sophisticated wild degradation.

AGLLDiff: Guiding Diffusion Models Towards Unsupervised Training-free Real-world Low-light Image Enhancement

Facial expression recognition (FER) has suffered from label ambiguity due to the inherent subjectivity of facial expressions. Additionally, the class imbalance prevalent in real-world scenarios further complicates the challenges in FER. Although many studies have shown impressive improvements, they typically address only one of these issues, leading to suboptimal results. To address both challenges simultaneously, we propose a novel framework called Navigating Label Ambiguity (NLA), which is robust in real-world conditions. The core idea behind NLA is to adaptively assign weights based on sample ambiguity while minimizing the impact of noise. To achieve this, NLA consists of two main components: Noise-aware Adaptive Weighting (NAW) and consistency regularization. Specifically, NAW adjusts weights by assigning higher importance to ambiguous samples and lower importance to noisy ones, based on the relationship between the intermediate prediction scores for the ground truth and the nearest negative. We also enhance the reliability of our NLA by incorporating a regularization term that ensures consistent latent distributions. Consequently, NLA effectively handles not only noise but also class imbalance by allowing the model to progressively focus on more challenging ambiguous samples that mainly belong to the minority class. Extensive experiments demonstrate that NLA outperforms existing methods in both overall and mean accuracy, confirming its robustness against noise and class imbalance. To the best of our knowledge, we are the first to tackle both problems within a single framework.

Navigating Label Ambiguity for Facial Expression Recognition in the Wild

Knowledge Distillation (KD) is essential in transferring dark knowledge from a large teacher to a small student network, such that the student can be much more efficient than the teacher but with comparable accuracy. Existing KD methods, however, rely on a large teacher trained specifically for the target task, which is both very inflexible and inefficient. In this paper, we argue that a SSL-pretrained model can effectively act as the teacher and its dark knowledge can be captured by the coordinate system or linear subspace where the features lie in. We then need only one forward pass of the teacher, and then tailor the coordinate system (TCS) for the student network. Our TCS method is teacher-free and applies to diverse architectures, works well for KD and practical few-shot learning, allows cross-architecture distillation with large capacity gap. Experiments show that TCS achieves significantly higher accuracy than state-of-the-art KD methods, while only requiring roughly half of their training time and GPU memory costs.

All You Need in Knowledge Distillation Is a Tailored Coordinate System

Recovering 4D world from monocular video is a crucial yet challenging task. 
Conventional methods usually rely on the assumptions of multi-view videos, known camera parameters, or static scenes.
In this paper, we relax all these constraints and tackle a highly ambitious but practical task: With only one monocular video without camera parameters, we aim to recover the dynamic 3D world alongside the camera poses.
To solve this, we introduce **GFlow**, a new framework that utilizes only 2D priors (depth and optical flow) to lift a video to a 4D scene, as a flow of 3D Gaussians through space and time. 
GFlow starts by segmenting the video into still and moving parts, then alternates between optimizing camera poses and the dynamics of the 3D Gaussian points.
This method ensures consistency among adjacent points and smooth transitions between frames.
Since dynamic scenes always continually introduce new visual content, we present prior-driven initialization and pixel-wise densification strategy for Gaussian points to integrate new content. 
By combining all those techniques, GFlow transcends the boundaries of 4D recovery from causal videos; it naturally enables tracking of points and segmentation of moving objects across frames.
Additionally, GFlow estimates the camera poses for each frame, enabling novel view synthesis by changing camera pose. This capability facilitates extensive scene-level or object-level editing, highlighting GFlow's versatility and effectiveness.

GFlow: Recovering 4D World from Monocular Video

Cascade ranking architecture, composed of matching, pre-ranking, ranking and re-ranking stages, is usually adopted to balance the efficiency and effectiveness in real-word recommendation system (RS). As the middle stage of RS, pre-ranking aims to quickly filter out the low-quality items selected at the matching stage and then forwarding high-quality items to the ranking stage. Existing pre-ranking approaches mainly endure two problems 1) Sample Selection Bias (SSB) problem, which heavily limits the performance improvement of filtering out low-quality items owing to ignoring the data flow between stages; and 2) Ranking Consistency (RC) problem, which may cause the ranked lists of the ranking stage and previous pre-ranking stage to be inconsistent. As a result, the competitive items with high scores at the ranking stage may not be selected because of low scores at the pre-ranking stage. These both two problems may cause sub-optimal performances, but previous works usually only focus on the one of them. In this paper, we propose a novel Sample Debias and Ranking Consistency Joint Learning Framework (SDCL) to jointly alleviate SSB and RC problems. SDCL consists of two main modules including 1) Multi-Task Distillation Module (MTD), which enhances the ability of identifying high-quality items by distilling knowledge across all tasks simultaneously from the more complex ranking model which jointly trained with the pre-ranking model; and 2) Adaptive Negative Sample Learning Module (ANSL), which improves the performance of filtering out low-quality items by adaptively adjusting negative samples learning weights based on the current performance of model. SDCL seamlessly integrates two modules in an end-to-end multi-task learning framework. Evaluations on both real-world large-scale traffic logs and online A/B test demonstrate the efficacy and superiority of SDCL.

Both Supply and Precision: Sample Debias and Ranking Consistency Joint Learning for Large Scale Pre-Ranking System

Driven by the rapid development of deep learning technology, the YOLO series has set a new benchmark for real-time object detectors. Additionally, transformer-based structures have emerged as the most powerful solution in the field, greatly extending the model's receptive field and achieving significant performance improvements. However, this improvement comes at a cost, as the quadratic complexity of the self-attentive mechanism increases the computational burden of the model. To address this problem, we introduce a simple yet effective baseline approach called Mamba YOLO. Our contributions are as follows: 1) We propose that the ODMamba backbone introduce a State Space Model (SSM) with linear complexity to address the quadratic complexity of self-attention.  Unlike the other Transformer-base and SSM-base method, ODMamba is simple to train without pretraining. 2) For real-time requirement, we designed the macro structure of ODMamba, determined the optimal stage ratio and scaling size. 3) We design the RG Block that employs a multi-branch structure to model the channel dimensions, which addresses the possible limitations of SSM in sequence modeling, such as insufficient receptive fields and weak image localization. This design captures localized image dependencies more accurately and significantly. Extensive experiments on the publicly available COCO benchmark dataset show that Mamba YOLO achieves state-of-the-art performance compared to previous methods. Specifically, a tiny version of Mamba YOLO achieves a 7.5% improvement in mAP on a single 4090 GPU with an inference time of 1.5 ms.

Mamba YOLO: A Simple Baseline for Object Detection with State Space Model

The rapid advancement in self-supervised representation learning has highlighted its potential to leverage unlabeled data for learning rich visual representations. However, the existing techniques, particularly those employing different augmentations of the same image, often rely on a limited set of simple transformations that cannot fully capture variations in the real world. This constrains the diversity and quality of samples, which leads to sub-optimal representations. In this paper, we introduce a framework that enriches the self-supervised learning (SSL) paradigm by utilizing generative models to produce semantically consistent image augmentations. By directly conditioning generative models on a source image, our method enables the generation of diverse augmentations while maintaining the semantics of the source image, thus offering a richer set of data for SSL. Our extensive experimental results on various joint-embedding SSL techniques demonstrate that our framework significantly enhances the quality of learned visual representations by up to 10% Top-1 accuracy in downstream tasks. This research demonstrates that incorporating generative models into the joint-embedding SSL workflow opens new avenues for exploring the potential of synthetic data. This development paves the way for more robust and versatile representation learning techniques.

Can Generative Models Improve Self-Supervised Representation Learning?

In today’s information-rich era, users rely heavily on recommender systems to identify relevant content. Graph structures, renowned for their ability to model intricate user-content relationships, have become essential to these systems. However, the accuracy of recommendations hinges critically on the quality of node representations within these graphs. Personalized recommendations strive to enhance uniqueness by maximizing the dissimilarity between representations (known as uniformity) while simultaneously ensuring that the representations align closely with the content users engage with (dubbed as alignment). Nevertheless, balancing these conflicting objectives remains a challenge for optimal recommendation performance. To tackle these challenges, we propose an innovative approach called SIURec, which differs significantly from previous studies. Rather than relying on manual weight selection between uniformity and alignment and optimizing uniformity solely on the final representation, SIURec adopts an adaptive adjustment method that learns the optimal weight between uniformity and alignment automatically. By optimizing uniformity at every convolutional layer, SIURec captures users’ sub-interests more effectively, ultimately leading to improved recommendation accuracy. Experimental results on four datasets demonstrate that SIURec achieves superior learning of uniformity (with an average improvement of 4.26% in accuracy compared to eleven SOTA methods) and exhibits robustness across different hyperparameter settings. Our implementation is available at https://anonymous.4open.science/r/SIURec-5A0D.

Sub-Interest-Aware Representation Uniformity for Recommender System

Conditioning image generation facilitates seamless editing and the creation of photorealistic images. However, conditioning on noisy or Out-of-Distribution (OoD) images poses significant challenges, particularly in balancing fidelity to the input and realism of the output. We introduce Confident Ordinary Differential Editing (CODE), a novel approach for image synthesis that effectively handles OoD guidance images. Utilizing a diffusion model as a generative prior, CODE enhances images through score-based updates along the probability-flow Ordinary Differential Equation (ODE) trajectory. This method requires no task-specific training, handcrafted modules, or assumptions, and is compatible with any diffusion model. Positioned at the intersection of conditional image generation and blind image restoration, CODE operates in a fully blind manner, relying solely on a pre-trained generative model. Our method introduces an alternative approach to blind restoration: instead of targeting a specific ground truth image based on assumptions about the underlying corruption, CODE aims to increase the likelihood of the input image while maintaining fidelity. This results in the most probable in-distribution image around the input. Our contributions are twofold. First, CODE introduces a novel editing method based on ODE providing enhanced control, realism, and fidelity compared to SDE-based counterpart. Second, we introduce a confidence interval-based clipping method, which improves CODE’s effectiveness by allowing it to disregard certain pixels or information, thus enhancing the restoration process in a blind manner. Experimental results demonstrate CODE’s effectiveness over existing methods, particularly in scenarios involving severe degradation or OoD inputs.

Premium content

Next from AAAI 2025

DPLUT: Unsupervised Low-light Image Enhancement with Lookup Tables and Diffusion Priors

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES