This paper studies the spatio-temporal video grounding task, which aims to localize a spatio-temporal tube in an untrimmed video based on the given text description of an event. Existing one-stage approaches suffer from insufficient space-time interaction in two aspects: i) less precise prediction of event temporal boundaries, and ii) inconsistency in object prediction for the same event across adjacent frames. To address these issues, we propose a framework of Comprehensive Space-Time entAnglement (CoSTA) to densely entangle space-time multi-modal features for spatio-temporal localization. Specifically, we propose a space-time collaborative encoder to extract comprehensive video features and leverage Transformer to perform spatio-temporal multi-modal understanding. Our entangled decoder couples temporal boundary prediction and spatial localization via an entangled query, boasting an enhanced ability to capture object-event relationships. We conduct extensive experiments on the challenging benchmarks of HC-STVG and VidSTG, where CoSTA outperforms existing state-of-the-art methods, demonstrating its effectiveness for this task.

CoSTA: End-to-End Comprehensive Space-Time Entanglement for Spatio-Temporal Video Grounding

In this paper, we study the problem of visual grounding by considering both phrase extraction and grounding (PEG). In contrast to the previous phrase-known-at-test setting, PEG requires a model to extract phrases from text and locate objects from image simultaneously, which is a more practical setting in real applications. As phrase extraction can be regarded as a $1$D text segmentation problem, we formulate PEG as a dual detection problem and propose a novel DQ-DETR model, which introduces dual queries to probe different features from image and text for object prediction and phrase mask prediction. Each pair of dual queries are designed to have shared positional parts but different content parts. Such a design effectively alleviates the difficulty of modality alignment between image and text (in contrast to a single query design) and empowers Transformer decoder to leverage phrase mask-guided attention to improve the performance. To evaluate the performance of PEG, we also propose a new metric CMAP (cross-modal average precision), analogous to the AP metric in object detection. The new metric overcomes the ambiguity of Recall@1 in many-box-to-one-phrase cases in phrase grounding. As a result, our PEG pre-trained DQ-DETR establishes new state-of-the-art results on all visual grounding benchmarks with a ResNet-101 backbone. For example, it achieves $91.04\%$ and $83.51\%$ in terms of recall rate on RefCOCO testA and testB with a ResNet-101 backbone. We will release the code after the blind review.

DQ-DETR: Dual Query Detection Transformer for Phrase Extraction and Grounding

Main Track - Computer Vision

technical paper

We are pleased to announce the Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24), which will be held in Vancouver, British Columbia at the Vancouver Convention Centre – West Building from 20-27 February, 2024.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-24 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.

We expect for AAAI-24 to be an in-person conference – one author of all accepted papers will be expected to present work in person unless there are exceptional circumstances that prevent this.

In order to access the AAAI-24 event page you need to register [here](https://aaai.org/aaai-conference/registration/)

AAAI 2024

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. 

Posters & Demos III

poster

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-23 is the Thirty-Seventh AAAI Conference on Artificial Intelligence. The theme of this conference is to create collaborative bridges within and beyond AI. Like previous AAAI conferences, AAAI-23 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and two new activities: a Bridge Program and a Lab Program. Many of these activities are tailored to the theme of bridges and all are selected according to the highest standards, with additional programs for students and young researchers. 
AAAI is providing you with a conference planner, which you can use to help organize your itinerary of activities. This includes talks to attend in person, talks to attend remotely, breaks with colleagues and your site seeing activities. To access this conference planner, please go to [https://aaai-2023.takemobi.io](https://aaai-2023.takemobi.io).

In order to access this site, you need to register. If you haven't already, please register [here](https://aaai.org/Conferences/AAAI-23/registration/).


AAAI 2023

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines.

Yaoyuan Liang

2

SHORT BIO

Presentations

CoSTA: End-to-End Comprehensive Space-Time Entanglement for Spatio-Temporal Video Grounding

DQ-DETR: Dual Query Detection Transformer for Phrase Extraction and Grounding

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES