Thailand

Video captioning combines video understanding and language generation. Different from image captioning that describes a static image with details of almost every object, video captioning usually considers a sequence of frames and biases towards focused objects, e.g., the objects that stay in focus regardless of the changing background. Therefore, detecting and properly accommodating focused objects is critical in video captioning. To enforce the description of focused objects and achieve controllable video captioning, we propose an Object-Oriented Non-Autoregressive approach (O2NA), which performs caption generation in three steps: 1) identify the focused objects and predict their locations in the target caption; 2) generate the related attribute words and relation words of these focused objects to form a draft caption; and 3) combine video information to refine the draft caption to a fluent final caption. Since the focused objects are generated and located ahead of other words, it is difficult to apply the word-by-word autoregressive generation process; instead, we adopt a non-autoregressive approach. The experiments on two benchmark datasets, i.e., MSR-VTT and MSVD, demonstrate the effectiveness of O2NA, which achieves results competitive with the state-of-the-arts but with both higher diversity and higher inference speed.

ACL-IJCNLP 2021

O2NA: An Object-Oriented Non-Autoregressive Approach for Controllable Video Captioning

**Welcome to ACL-IJCNLP 2021!**

The great event is jointly organized by the Association for Computational Linguistics (ACL) and Asian Federation of Natural Language Processing (AFNLP). 

As in previous years, the program of the conference includes a poster session, tutorials, workshops and demonstrations in addition to the main conference.


We were able to keep the registration fees similar to those charged for the virtual ACL 2020. The one fee allows attendance at the main conference and any/all tutorials and workshops. These fees would be $125 Regular Early and $175 Regular Late; $50 Student Early and $75 Student Late. Early registration closes at midnight July 11, 2021 (Eastern Daylight Time).

**Reminder:** It is ACL’s policy that at least one author of each accepted paper (including ACL Finding papers) must register for the conference.

**Reminder2:** Underline site will open closer to the event. If you already registered you will receive access detail

Registration is now open

The great event is jointly organized by the Association for Computational Linguistics (ACL) and Asian Federation of Natural Language Processing (AFNLP).

technical paper

Multi-intent SLU can handle multiple intents in an utterance, which has attracted increasing attention. However, the state-of-the-art joint models heavily rely on autoregressive approaches, resulting in two issues: slow inference speed and information leakage. In this paper, we explore a non-autoregressive model for joint multiple intent detection and slot filling, achieving more fast and accurate. Specifically, we propose a Global-Locally Graph Interaction Network (GL-GIN) where a local slot-aware graph interaction layer is proposed to model slot dependency for alleviating uncoordinated slots problem while a global intent-slot graph interaction layer is introduced to model the interaction between multiple intents and all slots in the utterance. Experimental results on two public datasets show that our framework achieves state-of-the-art performance while being 11.5 times faster.

GL-GIN: Fast and Accurate Non-Autoregressive Model for Joint Multiple Intent Detection and Slot Filling

The introduction of pretrained language models has reduced many complex task-specific NLP models to simple lightweight layers.
An exception to this trend is coreference resolution, where a sophisticated task-specific model is appended to a pretrained transformer encoder.
While highly effective, the model has a very large memory footprint -- primarily due to dynamically-constructed span and span-pair representations -- which hinders the processing of complete documents and the ability to train on multiple instances in a single batch.
We introduce a lightweight end-to-end coreference model that removes the dependency on span representations, handcrafted features, and heuristics.
Our model performs competitively with the current standard model, while being simpler and more efficient.

Coreference Resolution without Span Representations

Both performance and efficiency are crucial factors for sequence labeling tasks in many real-world scenarios. Although the pre-trained models (PTMs) have significantly improved the performance of various sequence labeling tasks, their computational cost is expensive. To alleviate this problem, we extend the recent successful early-exit mechanism to accelerate the inference of PTMs for sequence labeling tasks.
However, existing early-exit mechanisms are specifically designed for sequence-level tasks, rather than sequence labeling. In this paper, we first propose a simple extension of sentence-level early-exit for sequence labeling tasks. To further reduce the computational cost, we also propose a token-level early-exit mechanism that allows partial tokens to exit early at different layers. Considering the local dependency inherent in sequence labeling, we employed a window-based criterion to decide for a token whether or not to exit. The token-level early-exit brings the gap between training and inference, so we introduce an extra self-sampling fine-tuning stage to alleviate it.
The extensive experiments on three popular sequence labeling tasks show that our approach can save up to 66%∼75% inference cost with minimal performance degradation. Compared with competitive compressed models such as DistilBERT, our approach can achieve better performance under the same speed-up ratios of 2×, 3×, and 4×.

Accelerating BERT Inference for Sequence Labeling via Early-Exit

Although the existing Named Entity Recognition (NER) models have achieved promising performance, they suffer from certain drawbacks. The sequence labeling-based NER models do not perform well in recognizing long entities as they focus only on word-level information, while the segment-based NER models which focus on processing segment instead of single word are unable to capture the word-level dependencies within the segment. Moreover, as boundary detection and type prediction may cooperate with each other for the NER task, it is also important for the two sub-tasks to mutually reinforce each other by sharing their information. In this paper, we propose a novel Modularized Interaction Network (MIN) model which utilizes both segment-level information and word-level dependencies, and incorporates an interaction mechanism to support information sharing between boundary detection and type prediction to enhance the performance for the NER task. We have conducted extensive experiments based on three NER benchmark datasets. The performance results have shown that the proposed MIN model has outperformed the current state-of-the-art models.

Modularized Interaction Network for Named Entity Recognition

Many joint entity relation extraction models setup two separated label spaces for the two sub-tasks (i.e., entity detection and relation classification). We argue that this setting may hinder the information interaction between entities and relations. In this work, we propose to eliminate the different treatment on the two sub-tasks' label spaces. The input of our model is a table containing all word pairs from a sentence. Entities and relations are represented by squares and rectangles in the table. We apply a unified classifier to predict each cell's label, which unifies the learning of two sub-tasks. For testing, an effective (yet fast) approximate decoder is proposed for finding squares and rectangles from tables. Experiments on three benchmarks (ACE04, ACE05, SciERC) show that, using only half the number of parameters, our model achieves competitive accuracy with the best extractor, and is faster.

UniRE: A Unified Label Space for Entity Relation Extraction

Continual learning has gained increasing attention in recent years, thanks to its biological interpretation and efficiency in many real-world applications. As a typical task of continual learning, continual relation extraction (CRE) aims to extract relations between entities from texts, where the samples of different relations are delivered into the model continuously. Some previous works have proved that storing typical samples of old relations in memory can help the model keep a stable understanding of old relations and avoid forgetting them. However, most methods heavily depend on the memory size in that they simply replay these memorized samples in subsequent tasks. To fully utilize memorized samples, in this paper, we employ relation prototype to extract useful information of each relation. Specifically, the prototype embedding for a specific relation is computed based on memorized samples of this relation, which is collected by K-means algorithm. The prototypes of all observed relations at current learning stage are used to re-initialize a memory network to refine subsequent sample embeddings, which ensures the model's stable understanding on all observed relations when learning a new task. Compared with previous CRE models, our model utilizes the memory information sufficiently and efficiently, resulting in enhanced CRE performance. Our experiments show that the proposed model outperforms the state-of-the-art CRE models and has great advantage in avoiding catastrophic forgetting. The code and datasets are released on https://github.com/fd2014cl/RP-CRE.

Refining Sample Embeddings with Relation Prototypes to Enhance Continual Relation Extraction

Translation between natural language and source code can help software development by enabling developers to comprehend, ideate, search, and write computer programs in natural language. Despite growing interest from the industry and the research community, this task is often difficult due to the lack of large standard datasets suitable for training deep neural models, standard noise removal methods, and evaluation benchmarks. This leaves researchers to collect new small-scale datasets, resulting in inconsistencies across published works. In this study, we present CoDesc - a large parallel dataset composed of 4.2 million Java methods and natural language descriptions. With extensive analysis, we identify and remove prevailing noise patterns from the dataset. We demonstrate the proficiency of CoDesc in two complementary tasks for code--description pairs: code summarization and code search. We show that the dataset helps improve code search by up to 22\% and achieves the new state-of-the-art in code summarization. Furthermore, we show CoDesc's effectiveness in pre-training--fine-tuning setup, opening possibilities in building pretrained language models for Java. To facilitate future research, we release the dataset, a data processing tool, and a benchmark at \url{https://github.com/csebuetnlp/CoDesc}.

CoDesc: A Large Code–Description Parallel Dataset

Since traditional tokenizers are isolated from a downstream task and model, they cannot output an appropriate tokenization depending on the task and model, although recent studies imply that the appropriate tokenization improves the performance.
In this paper, we propose a novel method to find an appropriate tokenization to a given downstream model by jointly optimizing a tokenizer and the model.
The proposed method has no restriction except for using loss values computed by the downstream model to train the tokenizer, and thus, we can apply the proposed method to any NLP task.
Moreover, the proposed method can be used to explore the appropriate tokenization for an already trained model as post-processing.
Therefore, the proposed method is applicable to various situations.
We evaluated whether our method contributes to improving performance on text classification in three languages and machine translation in eight language pairs.
Experimental results show that our proposed method improves the performance by determining appropriate tokenizations.

Joint Optimization of Tokenization and Downstream Model

The attention layer has become a prevalent component in improving the effectiveness of neural network models for NLP tasks. Figuring out why attention is effective and its interpretability has attracted a widespread deliberation. Current studies mostly investigate the effect of attention mechanism based on the attention distribution it generates with one single neural network structure. However they do not consider the changes in semantic capability of different components in the model due to the attention mechanism, which can vary across different network structures. In this paper, we propose a comprehensive analytical framework that exploits a convex hull representation of sequence semantics in an n-dimensional Semantic Euclidean Space and defines a series of indicators to capture the impact of attention on sequence semantics. Through a series of experiments on various NLP tasks and three representative recurrent units, we analyze why and how attention benefits the semantic capacity of different types of recurrent neural networks based on the indicators defined in the proposed framework.

How does Attention Affect the Model?

Recently, chest X-ray report generation, which aims to automatically generate descriptions of given chest X-ray images, has received growing research interests. The key challenge of chest X-ray report generation is to accurately capture and describe the abnormal regions. In most cases, the normal regions dominate the entire chest X-ray image, and the corresponding descriptions of these normal regions dominate the final report. Due to such data bias, learning-based models may fail to attend to abnormal regions. In this work, to effectively capture and describe abnormal regions, we propose the Contrastive Attention (CA) model. Instead of solely focusing on the current input image, the CA model compares the current input image with normal images to distill the contrastive information. The acquired contrastive information can better represent the visual features of abnormal regions. According to the experiments on the public IU-X-ray and MIMIC-CXR datasets, incorporating our CA into several existing models can boost their performance across most metrics. In addition, according to the analysis, the CA model can help existing models better attend to the abnormal regions and provide more accurate descriptions which are crucial for an interpretable diagnosis. Specifically, we achieve the state-of-the-art results on the two public datasets.

Premium content

Downloads

Next from ACL-IJCNLP 2021

GL-GIN: Fast and Accurate Non-Autoregressive Model for Joint Multiple Intent Detection and Slot Filling

Similar lecture

Align Voting Behavior with Public Statements for Legislator Representation Learning

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES