United States

Quantifying image complexity at the entity level is straightforward, but the assessment of semantic complexity has been largely overlooked. As a matter of fact, there are differences in semantic complexity across images. For example, the &quot;Cookie Theft&quot; picture is widely used to assess human language and cognitive abilities. Compared to most images, it contains richer semantics, allowing it to tell a vivid and engaging story. There is a need for more images like &quot;Cookie Theft&quot; to cater to people from different cultural backgrounds and eras. Additionally, semantically rich images can benefit the development of vision models, as images with limited semantics are becoming less challenging for these models. Assessing the semantic complexity requires human experts and empirical evidence. Automatic evaluation of how semantically rich an image is will benefit not only researchers in human cognition but also AI models. In response, we propose the Image Semantic Assessment (ISA) task to address this problem. We introduce the first ISA dataset and a novel method that leverages language to solve this vision problem. Experiments on our dataset demonstrate the effectiveness of our approach.

AAAI 2025

Is Your Image a Good Storyteller?

Quantifying image complexity at the entity level is straightforward, but the assessment of semantic complexity has been largely overlooked. As a matter of fact, there are differences in semantic complexity across images. For example, the "Cookie Theft" picture is widely used to assess human language and cognitive abilities. Compared to most images, it contains richer semantics, allowing it to tell a vivid and engaging story. There is a need for more images like "Cookie Theft" to cater to people from different cultural backgrounds and eras. Additionally, semantically rich images can benefit the development of vision models, as images with limited semantics are becoming less challenging for these models. Assessing the semantic complexity requires human experts and empirical evidence. Automatic evaluation of how semantically rich an image is will benefit not only researchers in human cognition but also AI models. In response, we propose the Image Semantic Assessment (ISA) task to address this problem. We introduce the first ISA dataset and a novel method that leverages language to solve this vision problem. Experiments on our dataset demonstrate the effectiveness of our approach.

poster

We are pleased to announce the Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25), which will be held in Philadelphia, Pennsylvania at the Pennsylvania Convention Center from February 25 to March 4, 2025.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.

### [Invited Speakers](https://aaai.org/conference/aaai/aaai-25/aaai-25-invited-speakers/)

Register [here](https://aaai.org/conference/aaai/aaai-25/registration/)

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.



Measuring the similarity of the internal representations of deep neural networks is an important and challenging problem.
Model stitching has been proposed as a possible approach, where two half-networks are connected by mapping the output of the first half-network to the input of the second one. The representations are considered functionally similar if the resulting stitched network achieves good task-specific performance. The mapping is normally created by training an affine stitching layer on the task at hand while freezing the two half-networks, a method called task loss matching. Here, we argue that task loss matching may be very misleading as a similarity measure. For example, it can indicate very high similarity between very distant layers, whose representations are known to have different functional properties. Moreover, it can indicate very distant layers to be more similar than architecturally corresponding layers. Even more surprisingly, when comparing layers within the same network, task loss matching often indicates that some layers are more similar to a layer than itself. We argue that the main reason behind these problems is that task loss matching tends to create out-of-distribution representations to improve task-specific performance. We demonstrate that direct matching (when the mapping minimizes the distance between the stitched representations) does not suffer from these problems. We compare task loss matching, direct matching, and well-known similarity metrics such as CCA and CKA. We conclude that direct matching strikes a good balance between the structural and functional requirements for a good similarity measure.

How Not to Stitch Representations to Measure Similarity: Task Loss Matching Versus Direct Matching

Code benchmarks such as HumanEval are widely adopted to evaluate the capabilities of Large Language Models (LLMs), providing insights into their strengths and weaknesses. However, current benchmarks primarily exercise LLMs' capability on common coding tasks (e.g., bubble sort, greatest common divisor), leaving **domain-specific coding tasks** (e.g., computation, system, cryptography)unexplored. To fill this gap, we propose a multi-domain code benchmark, DOMAINEVAL, designed to evaluate LLMs' coding capabilities thoroughly. Our pipeline works in a fully automated manner, enabling a push-bottom construction from code repositories into formatted subjects under study. Interesting findings are observed by evaluating 12 representative LLMs against DOMAINEVAL. We notice that LLMs are generally good at **computation** tasks while falling short on **cryptography and system** coding tasks. The performance gap can be as much as 68.94% (80.94% - 12.0%) in some LLMs. We also observe that generating more samples can increase the overall performance of LLMs, while the domain bias may even increase. The contributions of this study include a code generation benchmark dataset DOMAINEVAL, encompassing six popular domains, a fully automated pipeline for constructing code benchmarks, and an identification of the limitations of LLMs in code generation tasks based on their performance on DOMAINEVAL, providing directions for future research improvements.

DOMAINEVAL: An Auto-Constructed Benchmark for Multi-Domain Code Generation

Information retrieval methods often rely on a single embedding model trained on large, general-domain datasets like MSMARCO. While this approach can yield a retriever with reasonable overall performance, training a model on domain-specific data can yield better results within that domain. While prior work in information retrieval has tackled this problem via multi-task training or providing domain knowledge to an instruction-following retriever, the topic of combining different expert domain-specific retrievers has remained unexplored, despite its popularity in language model generation settings. In this work, we introduce RouterRetriever, a retrieval model that leverages multiple domain-specific experts and a routing mechanism to select the most appropriate expert for each query. It is lightweight and allows easy addition or removal of gates without additional training. Evaluation on the BEIR benchmark demonstrates that RouterRetriever outperforms both MSMARCO-trained (+2.1 absolute nDCG@10) and multi-task trained (+3.2) models. To achieve this, we developed a routing mechanism that shows higher performance over other routing techniques (+1.8 on average) that have been successfully employed in language modeling settings. Moreover, the benefit generalizes to other datasets even when there are no experts for that dataset. RouterRetriever is the first work to demonstrate the benefits of using multiple domain-specific expert embedding models with effective routing techniques compared to relying on a single embedding model for all domains in retrieval tasks.

RouterRetriever: Routing over a Mixture of Expert Embedding Models

In recent years, as data and problem sizes have increased, distributed learning has become an essential tool for training high-performance models. However, the communication bottleneck, especially for high-dimensional data, is a challenge. Several techniques have been developed to overcome this problem. These include communication compression and the implementation of local steps, which work particularly well when there is similarity of local data samples. In this paper, we study the synergy of these two approaches for efficient distributed optimization. Using variance reduction and error feedback frameworks, we present the first theoretically grounded accelerated algorithms with unbiased and biased compression for distributed problems under similarity. In terms of communicated time our theory gives $\tilde{\mathcal{O}} \left(  1+\left[ M^{-\frac{1}{4}} + \omega^{-\frac{1}{2}} \right]\sqrt{\frac{\delta}{\mu}}  \right)$ complexity for unbiased compressors and $\tilde{\mathcal{O}}\left(1+\beta^{\frac{1}{4}}\sqrt{\frac{\delta}{\mu}}\right)$ for biased ones, where $M$ is the number of computational nodes, $\beta$ is the compression power, $\delta$ is the similarity measure and $\mu$ is the parameter of strong convexity of objective. Our theoretical results are of record and confirmed by experiments on different average losses and datasets.

Accelerated Methods with Compressed Communications for Distributed Optimization Problems Under Data Similarity

Graph neural networks(GNNs) have been demonstrated to depend on whether the node effective information is sufficiently passing. Discrete curvature (*Ricci curvature*) is used to study graph connectivity and information propagation efficiency with a geometric perspective, and has been raised in recent years to explore the efficient message-passing structure of GNNs. However, most empirical studies are based on directly observed graph structures or heuristic topological assumptions, and lack in-depth exploration of underlying optimal information transport structures for downstream tasks. We suggest that graph curvature optimization is more in-depth and essential than directly rewiring or learning for graph structure with richer message-passing characterization and better information transport interpretability. From both graph geometry and information theory perspectives, we propose the novel Discrete **Curv**ature **G**raph **I**nformation **B**ottleneck (**CurvGIB**) framework to optimize the information transport structure and learn better node representations simultaneously. CurvGIB advances the *Variational Information Bottleneck* (*VIB*) principle for Ricci curvature optimization to learn the optimal information transport pattern for specific downstream tasks. The learned Ricci curvature is used to refine the optimal transport structure of the graph, and the node representation is fully and efficiently learned. Moreover, for the computational complexity of Ricci curvature differentiation, we combine *Ricci flow* and *VIB* to deduce a curvature optimization approximation to form a tractable IB objective function. Extensive experiments on various datasets demonstrate the superior effectiveness and interpretability of CurvGIB.

Discrete Curvature Graph Information Bottleneck

The study of enhancing the robustness against adversarial examples has always been a topic of widespread interest, leading to the development of numerous adversarial defense techniques. These methods aim to mitigate the effects of deliberately introduced perturbations in input data designed to deceive models and reduce their accuracy. Evaluating the effectiveness of these defense strategies poses a significant challenge. The recently introduced AutoAttack technique has been recognized as a standardized method for assessing model robustness. However, the computational demands of the AutoAttack method significantly limit its applicability, underscoring the urgent need for efficient evaluation techniques. Our research indicates that relaxing constraints at specific stages of the attack can lead to the development of models capable of executing more efficient and powerful attacks on deep neural networks. We further introduce an attack method that approximates the size of perturbations from the outside and propose the Constraint Relaxation (CR) attack method. Based on experiments with 105 robust models, our approach demonstrates superiority over AutoAttack in terms of attack success rate, achieving a significant acceleration of 38.3 times in forward propagation and 15.9 times in backward propagation. Additionally, our ablation experiments highlight the significant effectiveness of the constraint relaxation method.

Efficient Robustness Evaluation via Constraint Relaxation

Computer-aided design (CAD) significantly enhances the efficiency, accuracy, and innovation of design processes by enabling precise 2D and 3D modeling, extensive analysis, and optimization. These models encapsulate and facilitate the modification of detailed parameters in construction processes, which is essential for creating precise 3D shapes. Existing methods for creating CAD models rely on latent vectors or point clouds, which are difficult to obtain and not cost-effective. Recent advances in Multimodal Large Language Models (MLLMs) have inspired researchers to use natural language instructions and images for CAD model construction. However, these models still struggle with inferring 3D spatial location and orientation, leading to inaccuracies in determining the spatial 3D starting points and extrusion directions for constructing geometries. This work introduces CAD-GPT, a CAD synthesis method with spatial reasoning-enhanced MLLM. In this method, we propose a 3D Modeling Spatial Mechanism for accurately inferring three types of spatial information: 1) 3D global location, thereby ensuring the correct spatial position of each 3D shape; 2) 3D sketch plane angles, which enables accurate orientation of the 2D sketch plane each time it is constructed; 3) 2D sketch location translation, through which the precision of the size and shape of the 2D sketches is guaranteed. Extensive experiments demonstrate that CAD-GPT consistently outperforms existing state-of-the-art methods in CAD model synthesis, both quantitatively and qualitatively.

CAD-GPT: Synthesising CAD Construction Sequence with Spatial Reasoning-Enhanced Multimodal LLMs

While Neural Radiance Fields (NeRFs) have advanced the frontiers of novel view synthesis (NVS) using LiDAR data, they still struggle in dynamic scenes. Due to the low frequency and sparsity characteristics of LiDAR point clouds, it is challenging to spontaneously learn a dynamic and consistent scene representation from posed scans. In this paper, we propose STGC-NeRF, a novel LiDAR NeRF method that combines spatial-temporal geometry consistency to enhance the reconstruction of dynamic scenes. First, we propose a temporal geometry consistency regularization to enhance the regression of time-varying scene geometries from low-frequency LiDAR sequences. By estimating the pointwise correspondences between synthetic (or real) and real frames at different times, we convert them into various forms of temporal supervision. This alleviates the inconsistency caused by moving objects in dynamic scenes. Second, to improve the reconstruction of sparse LiDAR data, we propose spatial geometric consistency constraints. By computing multiple neighborhood feature descriptors incorporating geometric and contextual information, we capture structural geometry information from sparse LiDAR data. This helps encourage consistent direction, smoothness, and detail of the local surface. Extensive experiments on the KITTI-360 and nuScenes datasets demonstrate that STGC-NeRF outperforms state-of-the-art methods in both geometry and intensity accuracy for dynamic LiDAR scene reconstruction.

STGC-NeRF: Spatial-Temporal Geometric Consistency for LiDAR Neural Radiance Fields in Dynamic Scenes

As AI-based decision-makers increasingly influence decisions that affect humans, it is crucial to ensure their decisions are fair and unbiased. Most algorithms for fair decision-making provide probabilistic guarantees of fairness over the long run, not providing any guarantees at specific intervals, such as yearly or quarterly. In this paper, we introduce a novel neurosymbolic approach to guarantee fairness in every finite run through the use of a symbolic runtime enforcer called a *fairness shield*. The fairness shield monitors and minimally intervenes in the decision-maker’s decisions to ensure that fairness criteria are met either within a bounded horizon or periodically, while also minimizing the costs associated with such interventions as specified by a given cost function. Given a distribution over future decisions and their costs, we present algorithms to compute fairness shields by solving a bounded-horizon optimal control problem. We present synthesis algorithms for four types of fairness shields, each tailored to different operational settings. Our empirical evaluation demonstrates the effectiveness of these shields in ensuring fairness while maintaining cost efficiency across various scenarios.

Fairness Shields: Safeguarding against Biased Decision Makers

Despite the widespread use of LLMs due to their superior performance in various tasks, their high computational costs often lead potential users to opt for the pretraining-finetuning pipeline. However, biases prevalent in manually constructed datasets can introduce spurious correlations between tokens and labels, creating the so-called *shortcuts* and hindering the generalizability of fine-tuned models. Existing debiasing methods often rely on prior knowledge of specific dataset biases, which is challenging to acquire a priori. We propose RAZOR (Rewriting And Zero-bias Optimization Refinement), a novel, unsupervised, and data-focused debiasing approach based on text rewriting for shortcut mitigation. RAZOR leverages LLMs to iteratively rewrite potentially biased text segments by replacing them with heuristically selected alternatives in a shortcut space defined by token statistics and positional information. This process aims to align surface-level text features more closely with diverse label distributions, thereby promoting the learning of genuine linguistic patterns. Compared with unsupervised SoTA models, RAZOR improves by $3.5$% on the FEVER and $6.5$% on MNLI and SNLI datasets according to the F1 score. Additionally, RAZOR effectively mitigates specific known biases, reducing bias-related terms by $\times 2$ without requiring prior bias information, a result that is on par with SoTA models that leverage prior information. Our work prioritizes data manipulation over architectural modifications, emphasizing the pivotal role of data quality in enhancing model performance and fairness. This research contributes to developing more robust evaluation benchmarks for debiasing methods by incorporating metrics for bias reduction and overall model efficacy.

Premium content

Next from AAAI 2025

How Not to Stitch Representations to Measure Similarity: Task Loss Matching Versus Direct Matching

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES