Thailand

Videos are more informative than images because
they capture the dynamics of the scene.
By representing motion in videos, we can capture
dynamic activities. In this work, we introduce
GPT-4 generated motion descriptions that
capture fine-grained motion descriptions of activities
and apply them to three action datasets.
We evaluated several video-text models on the
task of retrieval of motion descriptions. We
found that they fall far behind human expert
performance on two action datasets, raising
the question of whether video-text models understand
motion in videos. To address it, we
introduce a method of improving motion understanding
in video-text models by utilizing
motion descriptions. This method proves to
be effective on two action datasets for the motion
description retrieval task. The results draw
attention to the need for quality captions involving
fine-grained motion information in existing
datasets and demonstrate the effectiveness of
the proposed pipeline in understanding finegrained
motion during video-text retrieval.

ACL 2024

Diving Deep into the Motion Representation of Video-Text Models

multimodal vlms.

motion understanding

video-text models

poster

### Welcome!
The 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024) will take place in Bangkok, Thailand from August 11th to 16th, 2024. Our Virtual Poster Sessions will take place online Thursday, August 22, 2024.

You are required to register for this event. **Please register [here](https://2024.aclweb.org/registration). **

If you have already registered, please check your inbox for an email from Underline granting you access to ACL 2024 content.

Please register!

The 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024) will take place in Bangkok, Thailand from August 11th to 16th, 2024. More information will be announced soon.

While large language and vision-language models showcase impressive capabilities, they face a notable limitation: the inability to connect language with the physical world. To bridge this gap, research has focused on embodied language learning, where the language learner is situated in the world, perceives it, and interacts with it. This article explores the current standing of research in embodied language learning, highlighting opportunities and discussing common challenges. Lastly, it identifies existing gaps from the perspective of language understanding research within the embodied world and suggests potential future directions.

Embodied Language Learning: Opportunities, Challenges, and Future Directions

It is increasingly common to evaluate the same coreference resolution (CR) model on multiple datasets. Do these multi-dataset evaluations allow us to draw meaningful conclusions about model generalization? Or, do they rather reflect the idiosyncrasies of a particular experimental setup (e.g., the specific datasets used)? To study this, we view evaluation through the lens of measurement modeling, a framework commonly used in the social sciences for analyzing the validity of measurements. By taking this perspective, we show how multi-dataset evaluations risk conflating different factors concerning what, precisely, is being measured. This in turn makes it difficult to draw more generalizable conclusions from these evaluations. For instance, we show that across seven datasets, measurements intended to reflect CR model generalization are often correlated with differences in both how coreference is defined and how it is operationalized; this limits our ability to draw conclusions regarding the ability of CR models to generalize across any singular dimension. We believe the measurement modeling framework provides the needed vocabulary for discussing challenges surrounding what is actually being measured by CR evaluations.

Challenges to Evaluating the Generalization of Coreference Resolution Models: A Measurement Modeling Perspective

In this paper, we propose a textless acoustic model with a self-supervised distillation strategy for noise-robust expressive speech-to-speech translation (S2ST).
Recently proposed expressive S2ST systems have achieved impressive expressivity preservation performances by cascading unit-to-speech (U2S) generator to the speech-to-unit translation model. 
However, these systems are vulnerable to the presence of noise in input speech, which is an assumption in real-world translation scenarios. 
To address this limitation, we propose a U2S generator that incorporates a distillation with no label (DINO) self-supervised training strategy into it's pretraining process.
Because the proposed method captures noise-agnostic expressivity representation, it can generate qualified speech even in noisy environment.
Objective and subjective evaluation results verified that the proposed method significantly improved the performance of the expressive S2ST system in noisy environments while maintaining competitive performance in clean environments.

Textless Acoustic Model with Self-Supervised Distillation for Noise-Robust Expressive Speech-to-Speech Translation

We investigate intention detection in persuasive multi-turn dialogs employing the largest available Large Language Models (LLMs).
Much of the prior research measures the intention detection capability of machine learning models without considering the conversational history.
To evaluate LLMs' intention detection capability in conversation, we modified the existing datasets of persuasive conversation and created datasets using a multiple-choice paradigm.
It is crucial to consider others' perspectives through their utterances when engaging in a persuasive conversation, especially when making a request or reply that is inconvenient for others.
This feature makes the persuasive dialogue suitable for the dataset of measuring intention detection capability.
We incorporate the concept of `face acts,' which categorize how utterances affect mental states.
This approach enables us to measure intention detection capability by focusing on crucial intentions and to conduct comprehensible analysis according to intention types.

Evaluating Intention Detection Capability of Large Language Models in Persuasive Dialogues

A substantial body of work has provided evidence that the lexicons of natural languages are organized to support efficient communication. However, existing work has largely focused on word-internal properties, such as Zipf’s observation that more frequent words are optimized in form to minimize communicative cost. Here, we investigate the hypothesis that efficient lexicon organization is also reflected in valency, or the combinations and orders of additional words and phrases a verb selects for in a sentence. We consider two measures of valency diversity for verbs: valency frame count (VFC), the number of distinct frames associated with a verb, and valency frame entropy (VFE), the average information content of frame selection associated with a verb. Using data from 79 languages, we provide evidence that more frequent verbs are associated with a greater diversity of valency frames, suggesting that the organization of valency is consistent with communicative efficiency principles. We discuss our findings in relation to classical findings such as Zipf’s meaning-frequency law and the principle of least effort, as well as implications for theories of valency and communicative efficiency principles.

More frequent verbs are associated with more diverse valency frames: Efficient principles at the lexicon-grammar interface

Metaphor interpretation is a difficult task in natural language understanding. The development of relevant techniques in this domain is slow, mostly because of the lack of large annotated datasets and effective pre-trained language models (PLMs) for metaphor learning. Thus, we propose a large annotated dataset and a PLM for the metaphor interpretation task. Our foundation model is based on a novel anomalous language modeling (ALM) method, which we benchmark with comparable PLM baselines on the new dataset, finding that it largely improves model performance on metaphor identification and interpretation.

MetaPro 2.0: Computational Metaphor Processing on the Effectiveness of Anomalous Language Modeling

We consider the task of building a dialogue system that can motivate users to adopt positive lifestyle changes, Motivational Interviewing (MI). Addressing such a task requires a system that could infer \textit{how} to motivate the user effectively. We propose DIIR, a framework that is capable of learning and applying conversation strategies in the form of natural language inductive rules from expert demonstrations. Automatic and human evaluation on instruction-following large language models show natural language strategies descriptions discovered by DIIR can improve active listening skills, reduce unsolicited advice, and promote more collaborative and less authoritative conversations, outperforming in-context demonstrations that are over 50 times longer.

Few-shot Dialogue Strategy Learning for Motivational Interviewing via Inductive Reasoning

Traditional Dialogue State Tracking (DST) has focused on tracking preferences and intents in conversations centered around specific tasks (e.g. booking services). These conventional systems assume a relatively restricted conversation flow in which each turn gradually offers new information. However, advancements in Large Language Models (LLMs) have ushered in more versatile open-domain chat systems in which extended dialogue sessions encompassing numerous tasks and topics are common---in turn requiring new conversational tracking tools in order to successfully orchestrate such systems. Addressing these challenges, we introduce a novel approach combining dialogue segmentation and state tracking within open-domain dialogues, tailored for zero-shot applications appropriate to a true open-domain dialogue system. Our proposed method S3-DST employs a unique structured prompting technique and *Pre-Analytical Recollection*, a novel grounding mechanism we designed for improving long context tracking. Tested on proprietary anonymized open-domain dialogue datasets as well as publicly available DST and segmentation datasets, S3-DST consistently outperforms the state-of-the-art, showcasing its effectiveness and adaptability state tracking in the next wave of LLM-based chat systems. We also release S3-DST annotations with GPT-4 on a curated subset of LMSYS-Chat-1M to be used as a testbed to fuel research in this direction.

S3-DST: Structured Open-Domain Dialogue Segmentation and State Tracking in the Era of LLMs

The long-standing one-to-many problem of gold standard responses in open-domain dialogue systems presents challenges for automatic evaluation metrics. 
Though prior works have demonstrated some success by applying powerful Large Language Models (LLMs), existing approaches still struggle with the one-to-many problem, and exhibit subpar performance in domain-specific scenarios. We assume the commonsense reasoning biases within LLMs may hinder their performance in domain-specific evaluations. To address both issues, we propose a novel framework SLIDE (Small and Large Integrated for Dialogue Evaluation), that leverages both a small, specialised model (SLM), and LLMs for the evaluation of open domain dialogues. 
Our approach introduces several techniques: (1) Contrastive learning to differentiate between robust and non-robust response embeddings; (2) A novel metric for semantic sensitivity that combines embedding cosine distances with similarity learned through neural networks, and (3) A strategy for incorporating the evaluation results from both the SLM and LLMs. 
Our empirical results demonstrate that our approach achieves state-of-the-art performance in both the classification and evaluation tasks, and additionally the SLIDE evaluator exhibits better correlation with human judgements. Our code is available at https://github.com/hegehongcha/SLIDE-ACL2024.

SLIDE: A Framework Integrating Small and Large Language Models for Open-Domain Dialogues Evaluation

Large Language Models (LLMs) have demonstrated remarkable performance on assisting humans in programming and facilitating programming automation. However, existing benchmarks for evaluating the code understanding and generation capacities of LLMs suffer from severe limitations. First, most benchmarks are insufficient as they focus on a narrow range of popular programming languages and specific tasks, whereas real-world software development scenarios show a critical need to implement systems with multilingual and multitask programming environments to satisfy diverse requirements. Second, most benchmarks fail to consider the actual executability and the consistency of execution results of the generated code. To bridge these gaps between existing benchmarks and expectations from practical applications, we introduce **CodeScope**, an execution-based, multilingual, multitask, multidimensional evaluation benchmark for comprehensively measuring LLM capabilities on coding tasks. CodeScope covers **43 programming languages** and **eight coding tasks**. It evaluates the coding performance of LLMs from three dimensions (perspectives): **length**, **difficulty**, and **efficiency**. To facilitate execution-based evaluations of code generation, we develop **MultiCodeEngine**, an automated code execution engine that supports 14 programming languages. Finally, we systematically evaluate and analyze eight mainstream LLMs and demonstrate the superior breadth and challenges of CodeScope for evaluating LLMs on code understanding and generation tasks compared to other benchmarks. The CodeScope benchmark and code are publicly available at https://github.com/WeixiangYAN/CodeScope.

Downloads

Next from ACL 2024

Embodied Language Learning: Opportunities, Challenges, and Future Directions

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES