United States

Scene Text Recognition (STR) methods have demonstrated robust performance in word-level text recognition. However, in real applications the text image is sometimes long due to detected with multiple horizontal words. It triggers the requirement to build long text recognition models from readily available short (i.e., word-level) text datasets, which has been less studied previously. In this paper, we term this task Out of Length (OOL) text recognition. We establish the first Long Text Benchmark (LTB) to facilitate the assessment of different methods in long text recognition. Meanwhile, we propose a novel method called OOL Text Recognition with sub-String Matching (SMTR). SMTR comprises two cross-attention-based modules: one encodes a sub-string containing multiple characters into next and previous queries, and the other employs the queries to attend to the image features, matching the sub-string and simultaneously recognizing its next and previous character. SMTR can recognize text of arbitrary length by iterating the process above. To avoid being trapped in recognizing similar sub-strings, we introduce a regularization training to compel SMTR to effectively discover subtle differences between similar sub-strings for precise matching. In addition, we propose an inference augmentation strategy to alleviate confusion caused by identical sub-strings in the same text and improve the overall recognition efficiency. Extensive experimental results reveal that SMTR, even when trained exclusively on short text, outperforms existing methods in public short text benchmarks and exhibits a clear advantage on LTB. Code and LTB will be released.

AAAI 2025

Out of Length Text Recognition with Sub-String Matching

categorization

object detection

poster

We are pleased to announce the Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25), which will be held in Philadelphia, Pennsylvania at the Pennsylvania Convention Center from February 25 to March 4, 2025.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.

### [Invited Speakers](https://aaai.org/conference/aaai/aaai-25/aaai-25-invited-speakers/)

Register [here](https://aaai.org/conference/aaai/aaai-25/registration/)

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.



This paper investigates a challenging problem of zero-shot learning in the multi-label scenario (MLZSL), wherein the model is trained to recognize multiple unseen classes within a sample (e.g., an image) based on seen classes and auxiliary knowledge, e.g., semantic information. Existing methods usually resort to analyzing the relationship of various seen classes residing in a sample from the dimension of spatial or semantic characteristics and transferring the learned model to unseen ones. However, they neglect the integrity of local and global features. Although the use of the attention structure will accurately locate local features, especially objects, it will significantly lose its integrity, and the relationship between classes will also be affected. Rough processing of global features will also directly affect comprehensiveness. This neglect will make the model lose its grasp of the main components of the image. Relying only on the local existence of seen classes during the inference stage introduces unavoidable bias. In this paper, we propose a novel and comprehensive visual-semantic framework for MLZSL, dubbed Epsilon, to fully make use of such properties and enable a more accurate and robust visual-semantic projection. In terms of spatial information, we achieve effective refinement by group aggregating image features into several semantic prompts. It can aggregate semantic information rather than class information, preserving the correlation between semantics. In terms of global semantics, we use global forward propagation to collect as much information as possible to ensure that semantics are not omitted. Experiments on large-scale MLZSL benchmark datasets NUS-Wide and Open-Images-v4 demonstrate that the proposed Epsilon outperforms other state-of-the-art methods with large margins.

Epsilon: Exploring Comprehensive Visual-Semantic Projection for Multi-Label Zero-Shot Learning

Dynamic Vision Sensors (DVS) capture event data with high temporal resolution and low power consumption, presenting a more efficient solution for visual processing in dynamic and real-time scenarios compared to conventional video capture methods. Event data augmentation serve as an essential method for overcoming the limitation of scale and diversity in event datasets.   Our comparative experiments demonstrate that the two factors, spatial integrity and temporal continuity, can significantly affect the capacity of event data augmentation, which are guarantee for maintaining the sparsity and high dynamic range characteristics unique to event data. However, existing augmentation methods often neglect the preservation of spatial integrity and temporal continuity. To address this, we developed a novel event data augmentation strategy EventZoom, which employs a temporal progressive strategy, embedding transformed samples into the original samples through progressive scaling and shifting. The scaling process avoids the spatial information loss associated with cropping, while the progressive strategy prevents interruptions or abrupt changes in temporal information. We validated EventZoom across various supervised learning frameworks.  The experimental results show that EventZoom consistently outperforms existing event data augmentation methods with SOTA performance. For the first time, we have concurrently employed Semi-supervised and Unsupervised learning to verify feasibility on event augmentation algorithms, demonstrating the applicability and effectiveness of EventZoom as a powerful event-based data augmentation tool in handling real-world scenes with high dynamics and variability environments.

EventZoom: A Progressive Approach to Event-Based Data Augmentation for Enhanced Neuromorphic Vision

There are two crucial aspects of reliable autonomous driving systems: the reasoning behind decision-making and the precision of environmental perception. This paper introduces DME-Driver, a new autonomous driving system that enhances performance and robustness by fully leveraging the two crucial aspects. This system comprises two main models. The first, the Decision Maker, is responsible for providing logical driving instructions. The second, the Executor, receives these instructions and generates precise control signals for the vehicles. To ensure explainable and reliable driving decisions, we build the Decision-Maker based on a large vision language model. This model follows the logic employed by experienced human drivers and simulates making decisions in a safe and reasonable manner. On the other hand, the generation of accurate control signals relies on precise and detailed environmental perception, where 3D scene perception models excel. Therefore, a planning-oriented perception model is employed as the Executor. It translates the logical decisions made by the Decision-Maker into accurate control signals for the self-driving cars. To effectively train the proposed system, a new dataset named Human-driver Behavior and Decision-making (HBD) dataset has been collected. This dataset encompasses a diverse range of human driver behaviors and their underlying motivations. By leveraging this dataset, our system achieves high-precision planning accuracy through a logical thinking process.

DME-Driver: Integrating Human Decision Logic and 3D Scene Perception in Autonomous Driving

Shadow is a phenomenon that degenerates the image quality and decreases the performance of downstream vision algorithms. Despite that current image shadow removal methods have achieved promising progress, many of them require an externally obtained shadow mask as a necessary part of the input data, which not only introduces additional workload but also leads to degenerated performance near the shadow boundary due to the inaccuracy of the mask. Some of them do not require the shadow mask, however, they need to simultaneously consider the restoration of the brightness and color information along with the preservation of the texture and structure information inside the shadow region without external clues, which poses highly ill-posedness and makes the results prone to artifacts. In this paper, we propose **Pol-ShaRe**, the first **Pol**arization-guided image **Sha**dow **Re**moval solution, to remove shadow in a mask-free manner with fewer artifacts. Specifically, it consists of a two-stage pipeline to relieve the ill-posedness and a neural network tailored to the pipeline to suppress the artifacts. Experimental results show that our Pol-ShaRe achieves state-of-the-art performance on both synthetic and real-world images.

Polarization Guided Mask-Free Shadow Removal

Drug response prediction (DRP) is a longstanding challenge in modern oncology that underpins personalized treatment. Early DRP methods, trained on label-rich cell line samples, suffer from performance degradation when applied to label-scarce patient samples due to the distribution shift. Recently, a few transfer learning efforts have addressed this issue by aligning cell line (source domain) and patient (target domain) data via unsupervised domain adaptation (UDA). However, these efforts often treat each drug's response prediction as an isolated task, requiring model retraining when the drug changes; and focus only on aligning data distributions as a whole, neglecting the category (e.g., different cancers or tissues) confusion problem. To address these limitations, we propose a knowledge-guided domain adaptation model to transfer the DRP from cell lines to patients, named TransDRP. Specifically, TransDRP operates in two phases: pre-training and adaptation. In the first phase, we pre-train a multi-label graph neural network using molecular knowledge, to simultaneously predict responses for various drugs and capture their interdependencies. In the second phase, we implement a global-local domain adversarial strategy with clinical knowledge, to encourage representation alignment within same cancer categories and separation among different cancer categories across domains. Extensive experiments demonstrate that TransDRP outperforms state-of-the-art UDA methods in both transfer efficiency and precision for the patient DRP.

Knowledge-guided domain adaptation model for transferring drug response prediction from cell lines to patients

Model counting is a fundamental task that involves determining the number of satisfying assignments to a logical formula, typically in conjunctive normal form (CNF). While CNF model counting has received extensive attention over recent decades, interest in Pseudo-Boolean (PB) model counting is emerging partly due to the greater flexibility of PB formulas. As such, we observed feature gaps in existing PB counters such as a lack of support for projected and incremental settings, which could hinder adoption.

In this work, our main contribution is the introduction of the PB model counter PI-PBC, the first exact PB model counter with support for projected and incremental model counting. Our counter, PI-PBC, uses our Least Occurrence Weighted Min Degree (LOW-MD) computation ordering heuristic to support projected model counting and a cache mechanism to enable incremental model counting. In our evaluations, PI-PBC completed at least 1.40x the number of benchmarks of competing methods for projected model counting and at least 1.18x of competing methods in incremental model counting.

Towards Projected and Incremental Pseudo-Boolean Model Counting

Large vision-language models show tremendous potential in understanding visual information through human languages. However, they are prone to suffer from object hallucination, i.e., the generated image descriptions contain objects that do not exist in the image. In this paper, we reveal that object hallucination can be attributed to overconfidence in irrelevant visual features when soft visual tokens map to the LLM's word embedding space. Specifically, by figuring out the semantic similarity between visual tokens and LLM's word embedding, we observe that the smoothness of similarity distribution strongly correlates with the emergence of object hallucinations. To mitigate hallucinations, we propose using the Variational Information Bottleneck (VIB) to alleviate overconfidence by introducing stochastic noise, facilitating the constraining of irrelevant information. Furthermore, we propose an entropy-based noise-controlling strategy to enable the injected noise to be adaptively constrained regarding the smoothness of the similarity distribution. We adapt the proposed AdaVIB across distinct model architectures. Experimental results demonstrate that the proposed AdaVIB mitigates object hallucinations by effectively alleviating the overconfidence in irrelevant visual features, with consistent improvements on two object hallucination benchmarks.

Mitigating Hallucinations in Large Vision-Language Models by Adaptively Constraining Information Flow

Learned image compression (LIC) has achieved state-of-the-art rate-distortion performance, deemed promising for next-generation image compression techniques. However, pre-trained LIC models usually suffer from significant performance degradation when applied to out-of-training-domain images, implying their poor generalization capabilities. To tackle this problem, we propose a few-shot domain adaptation method for LIC by integrating plug-and-play adapters into pre-trained models. Drawing inspiration from the analogy between latent channels and frequency components, we examine domain gaps in LIC and observe that out-of-training-domain images disrupt pre-trained channel-wise decomposition. Consequently, we introduce a method for channel-wise re-allocation using convolution-based adapters and low-rank adapters, which are lightweight and compatible to mainstream LIC schemes. Extensive experiments across multiple domains and multiple representative LIC schemes demonstrate that our method significantly enhances pre-trained models, achieving comparable performance to H.266/VVC intra coding with merely 25 target-domain samples. Additionally, our method matches the performance of full-model finetune while transmitting fewer than 2% of the parameters.

Few-Shot Domain Adaptation for Learned Image Compression

Online to batch conversion involves constructing a new batch learner by utilizing a series of models generated by an existing online learning algorithm, for achieving generalization guarantees under i.i.d assumption. However, when applied to real-world streaming applications such as streaming recommender systems, the data stream may be sampled from time-varying distributions instead of persistently being i.i.d. This poses a challenge in terms of out-of-distribution (OOD) generalization. Existing approaches employ fixed conversion mechanisms that are unable to adapt to novel testing distributions, hindering the testing accuracy of the batch learner. To address these issues, we propose AdaO2B, an adaptive online to batch conversion approach under the bandit setting. AdaO2B is designed to be aware of the distribution shifts in the testing data and achieves OOD generalization guarantees. Specifically, AdaO2B can dynamically combine the sequence of models learned by a contextual bandit algorithm and determine appropriate combination weights using a context-aware weighting function. This innovative approach allows for the conversion of a sequence of models into a batch learner that facilitates OOD generalization. Theoretical analysis provides justification for why and how the learned adaptive batch learner can achieve OOD generalization error guarantees. Experimental results have demonstrated that AdaO2B significantly outperforms state-of-the-art baselines on both synthetic data and real-world data.

AdaO2B: Adaptive Online to Batch Conversion for Out-of-Distribution Generalization

In recent years, agents have become capable of communicating seamlessly via natural language and navigating in environments that involve cooperation and competition, a fact that can introduce social dilemmas. Due to the interleaving of cooperation and competition, understanding agents' decision-making in such environments is challenging, and humans can benefit from obtaining explanations. However, such environments and scenarios have rarely been explored in the context of explainable AI. While some explanation methods for cooperative environments can be applied in mixed-motive setups, they do not address inter-agent competition, cheap-talk, or implicit communication by actions. In this work, we design explanation methods to address these issues. Then, we proceed to establish generality and demonstrate the applicability of the methods to three games with vastly different properties.
Lastly, we demonstrate the effectiveness and usefulness of the methods for humans in two mixed-motive games. The first is a challenging 7-player game called no-press Diplomacy. The second is a 3-player game inspired by the Prisoner's Dilemma, featuring communication in natural language.

Premium content

Next from AAAI 2025

Epsilon: Exploring Comprehensive Visual-Semantic Projection for Multi-Label Zero-Shot Learning

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES