Image-guided object assembly represents a burgeoning research topic in computer vision. This paper introduces a novel task: translating multi-view images of a structural 3D model (for example, one constructed with building blocks drawn from a 3D-object library) into a detailed sequence of assembly instructions executable by a robotic arm. Fed with multi-view images of the target 3D model for replication, the model designed for this task must address several sub-tasks, including recognizing individual components used in constructing the 3D model, estimating the geometric pose of each component, and deducing a feasible assembly order adhering to physical rules. Establishing accurate 2D-3D correspondence between multi-view images and 3D objects is technically challenging. To tackle this, we propose an end-to-end model known as the Neural Assembler. This model learns an object graph where each vertex represents recognized components from the images, and the edges specify the topology of the 3D model, enabling the derivation of an assembly plan. We establish benchmarks for this task and conduct comprehensive empirical evaluations of Neural Assembler and alternative solutions. Our experiments clearly demonstrate the superiority of Neural Assembler.

Neural Assembler: Learning to Generate Fine-Grained Robotic Assembly Instructions from Multi-View Images

Video question answering plays a vital role in computer vision, and recent advances in large language models have further propelled the development of this field. However, existing video question answering techniques often face limitations in grasping fine-grained video content in spatial dimensions. It mainly stems from the fixed and low-resolution input of video frames. While some approaches using high-resolution inputs partially alleviate this problem, they introduce excessive computational burdens by encoding the entire high-resolution image. In this work, we propose a granularity-adaptive spatial evidence tokenization model for video question answering. Our method introduces multi-granular visual tokenization in the spatial dimension to produce video tokens at various granularities based on the question. It highlights spatially activated patches at low resolutions through a granularity weighting module and then adaptively encodes these activated patches at high resolution for detail supplementation. To mitigate the computational overhead associated with high-resolution frame encoding, a masking and sparsity acceleration module is developed for efficient visual tokenization. Moreover, a granularity compression module is designed to dynamically select and compress visual tokens of varying granularities based on questions. We conduct extensive experiments on 11 mainstream video question answering datasets and the experimental results demonstrate the effectiveness of our proposed method. Code available at: \textcolor{blue}{https://code-website.wixsite.com/anonymous-web}.

Granularity-Adaptive Spatial Evidence Tokenization for Video Question Answering

Over the past few years, the research on vision-and-language navigation (VLN) has made tremendous progress. Many previous works attempted to improve the performance from different aspects like training strategy, data augmentation, pre-training, etc. This work focuses on a rarely-explored aspect in VLN, namely the trajectory organization and encoding during the navigation. Most of existing state-of-the-art VLN models adopt a vanilla sequential strategy for encoding the trajectories. Such strategy takes the whole trajectory as a single sequence to estimate the current state,  no matter whether the agent moved smoothly or perhaps made mistakes and backtracked in the past. We show that the sequential encoding may largely lose this kind of fine-grained structure in the trajectory, which could hamper the later state estimation and decision making. In order to solve this problem, this work proposes a novel tree-structured trajectory encoding strategy. The whole trajectory is organized as a tree rooted from the starting position, and encoded using our Tree-Transformer module to fully extract the fine-grained historical information. Besides, as the spatial topology could be easily embedded in the trajectory tree, we further design a tree-based action space to allow the agent making long-range error-correction in one decision. We implement the holistic agent based on cross-modal transformer and train it with a newly-proposed Tree-nDTW reward. On the benchmark dataset R2R, our model achieves a surpassing success rate (SR) of 68\% on \texttt{val-unseen} and 66\% on \texttt{test}. We further conduct extensive ablation studies and analyses to provide more insights for the effectiveness our designs.

Tree-Structured Trajectory Encoding for Vision-and-Language Navigation

Robotics 3

technical paper

We are pleased to announce the Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25), which will be held in Philadelphia, Pennsylvania at the Pennsylvania Convention Center from February 25 to March 4, 2025.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.

### [Invited Speakers](https://aaai.org/conference/aaai/aaai-25/aaai-25-invited-speakers/)

Register [here](https://aaai.org/conference/aaai/aaai-25/registration/)

AAAI 2025

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.



Poster Session 2

poster

CV: Language and Vision 3

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-23 is the Thirty-Seventh AAAI Conference on Artificial Intelligence. The theme of this conference is to create collaborative bridges within and beyond AI. Like previous AAAI conferences, AAAI-23 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and two new activities: a Bridge Program and a Lab Program. Many of these activities are tailored to the theme of bridges and all are selected according to the highest standards, with additional programs for students and young researchers. 
AAAI is providing you with a conference planner, which you can use to help organize your itinerary of activities. This includes talks to attend in person, talks to attend remotely, breaks with colleagues and your site seeing activities. To access this conference planner, please go to [https://aaai-2023.takemobi.io](https://aaai-2023.takemobi.io).

In order to access this site, you need to register. If you haven't already, please register [here](https://aaai.org/Conferences/AAAI-23/registration/).


AAAI 2023

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines.

Yadong Mu

3

Presentations

Neural Assembler: Learning to Generate Fine-Grained Robotic Assembly Instructions from Multi-View Images

Granularity-Adaptive Spatial Evidence Tokenization for Video Question Answering

Tree-Structured Trajectory Encoding for Vision-and-Language Navigation

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES