United States

Although large-scale video-language pre-training models, which usually build a global alignment between the video and the text, have achieved remarkable progress on various downstream tasks, the idea of adopting fine-grained information during the pre-training stage is not well explored. In this work, we propose STOA-VLP, a pre-training framework that jointly models object and action information across spatial and temporal dimensions. More specifically, the model regards object trajectories across frames and multiple action features from the video as fine-grained features. Besides, We design two auxiliary tasks to better incorporate both kinds of information into the pre-training process of the video-language model. The first is the dynamic object-text alignment task, which builds a better connection between object trajectories and the relevant noun tokens. The second is the spatial-temporal action set prediction, which guides the model to generate consistent action features by predicting actions found in the text. Extensive experiments on three downstream tasks (video captioning, text-video retrieval, and video question answering) demonstrate the effectiveness of our proposed STOA-VLP (e.g. 3.7 Rouge-L improvements on MSR-VTT video captioning benchmark, 2.9% accuracy improvements on MSVD video question answering benchmark, compared to previous approaches).

AAAI 2023

STOA-VLP: Spatial-Temporal Modeling of Object and Action for Video-Language Pre-training

video-language

multi-modality

pre-training

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-23 is the Thirty-Seventh AAAI Conference on Artificial Intelligence. The theme of this conference is to create collaborative bridges within and beyond AI. Like previous AAAI conferences, AAAI-23 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and two new activities: a Bridge Program and a Lab Program. Many of these activities are tailored to the theme of bridges and all are selected according to the highest standards, with additional programs for students and young researchers. 
AAAI is providing you with a conference planner, which you can use to help organize your itinerary of activities. This includes talks to attend in person, talks to attend remotely, breaks with colleagues and your site seeing activities. To access this conference planner, please go to [https://aaai-2023.takemobi.io](https://aaai-2023.takemobi.io).

In order to access this site, you need to register. If you haven't already, please register [here](https://aaai.org/Conferences/AAAI-23/registration/).


The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines.

poster

Recent research has shown that integrating domain knowledge in deep learning architectures is effective -- it helps reduce the amount of required data, supports reasoning over complex tasks, thus improving the quality of the models' output, and improves the interpretability of models. However, the research community is missing a convened benchmark for systematically evaluating knowledge integration methods.
In this work, we create a collection of benchmarks that includes ten tasks in the domains of natural language processing and computer vision. In all cases, we model external knowledge as constraints, specify the sources of the constraints for each task, and implement various models that use these constraints.
We report the results of these models using a new set of extended evaluation criteria in addition to the task performances for a more in-depth analysis. This effort provides a framework for a more comprehensive and systematic comparison of constraint integration techniques and for identifying the related research challenges. It will facilitate further research for alleviating some problems of state-of-the-art neural models.

GLUECons: A Generic Benchmark for Learning Under Constraints

Self-supervised monocular depth estimation has been widely studied recently. Most of the work has focused on improving performance on benchmark datasets, such as KITTI, but has offered a few experiments on generalization performance. In this paper, we investigate the backbone networks (e.g., CNNs, Transformers, and CNN-Transformer hybrid models) toward the generalization of monocular depth estimation. We first evaluate state-of-the-art models on diverse public datasets, which have never been seen during the network training. Next, we investigate the effects of texture-biased and shape-biased representations using the various texture-shifted datasets that we generated. We observe that Transformers exhibit a strong shape bias and CNNs do a strong texture-bias. We also find that shape-biased models show better generalization performance for monocular depth estimation compared to texture-biased models. Based on these observations, we newly design a CNN-Transformer hybrid network with a multi-level adaptive feature fusion module, called MonoFormer. The design intuition behind MonoFormer is to increase shape bias by employing Transformers while compensating for the weak locality bias of Transformers by adaptively fusing multi-level representations. Extensive experiments show that the proposed method achieves state-of-the-art performance with various public datasets. Our method also shows the best generalization ability among the competitive methods.

Deep Digging into the Generalization of Self-supervised Monocular Depth Estimation

Many deep spatio-temporal learning methods have been proposed for crowd flow modeling in recent years. However, most of them focus on designing a spatial and temporal convolution mechanism to aggregate information from nearby nodes and historical observations for a pre-defined prediction task. Different from the existing research, this paper aims to provide a generic and dynamic representation learning method for crowd flow modeling. The main idea of our method is to maintain a continuous-time representation for each node, and update the representations of all nodes continuously according to the streaming observed data. Along this line, a particular encoder-decoder architecture is proposed, where the encoder converts the newly happened transactions into a timestamped message, and then the representations of related nodes are updated according to the generated message. The role of the decoder is to guide the representation learning process by reconstructing the observed transactions based on the most recent node representations. Moreover, a number of virtual nodes are added to discover macro-level spatial patterns and also share the representations among spatially-interacted stations. Experiments have been conducted on two real-world datasets for four popular prediction tasks in crowd flow modeling. The result demonstrates that our method could achieve better prediction performance for all the tasks than baseline methods.

Generic and Dynamic Graph Representation Learning for Crowd Flow Modeling

Spreadsheets are an important and unique type of business document for data storage, analysis and presentation. The distinction between spreadsheets and most other types of digital documents lies in that spreadsheets provide users with high flexibility of data organization on the grid. Existing related techniques mainly focus on the tabular data and are incompetent in understanding the entire sheet. On the one hand, spreadsheets have no explicit separation across tabular data and other information, leaving a gap for the deployment of such techniques. On the other hand, pervasive data dependence and semantic relations across the sheet require comprehensive modeling of all the information rather than only the tables. In this paper, we propose SheetPT, the first pre-training technique on spreadsheets to enable effective representation learning under this scenario. For computational effectiveness and efficiency, we propose the coherent chunk, an intermediate semantic unit of sheet structure; and we accordingly devise a hierarchical attention-based architecture to capture contextual information across different structural granularities. Three pre-training objectives are also designed to ensure sufficient training against millions of spreadsheets. Two representative downstream tasks, formula prediction and sheet structure recognition are utilized to evaluate its capability and the prominent results reveal its superiority over existing state-of-the-art methods.

SheetPT: Spreadsheet Pre-training Based on Hierarchical Attention Network

Classic option pricing models, such as the Black-Scholes formula, often depend on some rigid assumptions on the dynamics of the underlying asset prices. These assumptions are inevitably violated in practice and thus induce the model risk. To mitigate this, robust option pricing that only requires the no-arbitrage principle has attracted a great deal of attention among researchers. In this paper, we give new robust upper bounds for option prices based on a novel {\it $\eta$-momentum trading strategy}. Our bounds for European options are tighter for most common moneyness, volatility, and expiration date setups than those presented in (\citeauthor{DKM16} \citeyear{DKM16}). Our bounds for average strike Asian options are the first closed-form robust upper bounds for those options. Numerical simulations demonstrate that our bounds significantly outperform the benchmarks for both European and Asian options. 

Tighter Robust Upper Bounds for Options via No-Regret Learning

Inspired by the recent success of sequence modeling in RL and the use of   masked language model for pre-training, we propose a masked model for pre-training in RL, RePreM (Representation Pre-training with Masked Model), which trains the encoder combined with transformer blocks to predict the masked states or actions in a trajectory. RePreM is simple but effective compared to existing representation pre-training methods in RL. It avoids algorithmic sophistication (such as data augmentation or estimating multiple models) with sequence modeling and generates a representation that captures long-term dynamics well. Empirically, we demonstrate the effectiveness of RePreM in various tasks, including dynamic prediction, transfer learning, and sample-efficient RL with both value-based and actor-critic methods. Moreover, we show that RePreM scales well with dataset size, dataset quality, and the scale of the encoder, which indicates its potential towards big RL models.

RePreM: Representation Pre-training with Masked Model for Reinforcement Learning

Although k-means clustering has been widely studied due to its simplicity, these methods still have the following two fatal drawbacks. Firstly, they need to initialize the cluster centers, which causes unstable clustering performance. Secondly, they have poor performance on non-Gaussian datasets. Inspired by the affinity matrix, we propose a novel multi-view k-means based on the adjacency matrix. It maps the affinity matrix to the distance matrix according to the principle that every sample has a small distance from the points in its neighborhood and a large distance from the points outside of the neighborhood. Moreover, this method well exploits the complementary information embedded in different views by minimizing the tensor Schatten p-norm regularizer on the third-order tensor which consists of cluster assignment matrices of different views. Additionally, this method avoids initializing cluster centroids to obtain stable performance. And there is no need to compute the means of clusters so that our model is not sensitive to outliers. Experiment on a toy dataset shows the excellent performance on non-Gaussian datasets. And other experiments on several benchmark datasets demonstrate the superiority of our proposed method.

Centerless Multi-View K-means Based on The Adjacency Matrix

Biomedical entity linking (EL) is the task of linking mentions in a biomedical document to corresponding entities in a knowledge base (KB). The challenge in biomedical EL lies in leveraging mention context to select the most appropriate entity among possible candidates. Although some EL models achieve competitive results by retrieving candidate entities and then exploiting context to re-rank them, these re-ranking models concatenate mention context with one candidate at a time. They lack fine-grained interaction among candidates, and potentially cannot handle ambiguous mentions when facing candidates both with high lexical similarity. We cope with this issue using a re-ranking model based on prompt tuning, which represents mention context and all candidates at once, letting candidates in comparison attend to each other. We also propose a KB-enhanced self-supervised pretraining strategy. Instead of large-scale pretraining on biomedical EL data in previous work, we use masked language modeling with synonyms from KB. Our method achieves state-of-the-art results on 3 biomedical EL datasets: NCBI disease, BC5CDR and COMETA, showing the effectiveness of cross-entity interaction and KB-enhanced pretraining strategy.

Improving Biomedical Entity Linking with Cross-Entity Interaction

Non-autoregressive neural machine translation (NAT) models are proposed to accelerate the inference process while maintaining relatively high performance. However, existing NAT models are difficult to achieve the desired efficiency-quality
trade-off. For one thing, fully NAT models with efficient inference perform inferior to their autoregressive counterparts. For another, iterative NAT models can, though, achieve comparable performance while diminishing the advantage of speed. In this paper, we propose RenewNAT, a flexible framework with high efficiency and effectiveness, to incorporate the merits of fully and iterative NAT models. RenewNAT first generates the potential translation results and then renews them in a single pass. It can achieve significant performance improvements at the same expense as traditional NAT models (without introducing additional model parameters and decoding latency). Experimental results on various translation benchmarks (e.g., 4 WMT) show that our framework consistently improves the performance of strong fully NAT methods (e.g., GLAT and DSLP) without additional speed overhead.

RenewNAT: Renewing Potential Translation for Non-Autoregressive Transformer

 Lottery tickets (LTs) is able to discover accurate and sparse subnetworks that could be trained in isolation to match the performance of dense networks. Ensemble, in parallel, is one of the oldest time-proven tricks in machine learning to improve performance by combining the output of multiple independent models. However, the benefits of ensemble in the context of LTs will be diluted since ensemble does not directly lead to stronger sparse subnetworks, but leverages their predictions for a better decision. In this work, we first observe that directly averaging the weights of the adjacent learned subnetworks significantly boosts the performance of LTs. Encouraged by this observation, we further propose an alternative way to perform an ``ensemble'' over the subnetworks identified by iterative magnitude pruning via a simple interpolating strategy. We call our method "Lottery Pools". In contrast to the naive ensemble which brings no performance gains to each single subnetwork, Lottery Pools yields much stronger sparse subnetworks than the original LTs without requiring any extra training or inference cost. Across various modern architectures on CIFAR-10/100 and ImageNet, we show that our method achieves significant performance gains in both, in-distribution and out-of-distribution scenarios. Impressively, evaluated with VGG-16 and ResNet-18, the produced sparse subnetworks outperform the original LTs by up to 1.88% on CIFAR-100 and 2.36\% on CIFAR-100-C; the resulting dense network surpasses the pre-trained dense-model up to 2.22% on CIFAR-100 and 2.38% on CIFAR-100-C.

Downloads

Next from AAAI 2023

GLUECons: A Generic Benchmark for Learning Under Constraints

Similar lecture

Language Model Pre-training on True Negatives

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES