Vision-language tasks, such as VQA, SNLI-VE, and VCR are challenging because they require the model's reasoning ability to understand the semantics of the visual world and natural language. Supervised methods working for vision-language tasks have been well-studied. However, solving these tasks in a zero-shot setting is less explored. Since Contrastive Language-Image Pre-training (CLIP) has shown remarkable zero-shot performance on image-text matching, previous works utilized its strong zero-shot ability by converting vision-language tasks into an image-text matching problem, and they mainly consider global-level matching (e.g., the whole image or sentence). However, we find visual and textual fine-grained information, e.g., keywords in the sentence and objects in the image, can be fairly informative for semantics understanding. Inspired by this, we propose a unified framework to take advantage of the fine-grained information for zero-shot vision-language learning, covering multiple tasks such as VQA, SNLI-VE, and VCR. Our experiments show that our framework outperforms former zero-shot methods on VQA and achieves substantial improvement on SNLI-VE and VCR. Furthermore, our ablation studies confirm the effectiveness and generalizability of our proposed method.

<iframe src="https://app.sli.do/event/532fnPQGAZvefKDC1fZsuL/embed/polls/56a78517-8fba-4a01-aa8f-c181a2acf48b" width="300" height="400"></iframe>

UniFine: A Unified and Fine-grained Approach for Zero-shot Vision-Language Understanding

Visual commonsense understanding requires Vision Language (VL) models to not only understand image and text but also cross-reference in-between to fully integrate and achieve comprehension of the visual scene described. Recently, various approaches have been developed and have achieved high performance on visual commonsense benchmarks. However, it is unclear whether the models really understand the visual scene and underlying commonsense knowledge due to limited evaluation data resources. To provide an in-depth analysis, we present a Multimodal Evaluation (ME) pipeline to automatically generate question-answer pairs to test models' understanding of the visual scene, text, and related knowledge. We then take a step further to show that training with the ME data boosts the model's performance in standard VCR evaluation. Lastly, our in-depth analysis and comparison reveal interesting findings: (1) semantically low-level information can assist the learning of high-level information but not the opposite; (2) visual information is generally under utilization compared with text.

Understanding ME? Multimodal Evaluation for Fine-grained Visual Commonsense

Large-scale visual-linguistic pre-training aims to capture the generic representations from multimodal features, which are essential for downstream vision-language tasks. Existing methods mostly focus on learning the seman- tic connections between visual objects and lin- guistic content, which tend to be recognition- level information and may not be sufficient for commonsensical reasoning tasks like VCR. In this paper, we propose a novel commonsen- sical vision-language pre-training framework to bridge the gap. We first augment the con- ventional image-caption pre-training datasets with commonsense inferences from a visual- linguistic GPT-2. To pre-train models on image, caption and commonsense inferences together, we propose two new tasks: masked common- sense modeling (MCM) and commonsense type prediction (CTP). To reduce the shortcut effect between captions and commonsense inferences, we further introduce the domain-wise adaptive masking that dynamically adjusts the mask- ing ratio. Experimental results on downstream tasks, VCR and VQA, show the improvement of our pre-training strategy over previous meth- ods. Human evaluation also validates the rel- evance, informativeness, and diversity of the generated commonsense inferences. Overall, we demonstrate the potential of incorporating commonsense knowledge into the conventional recognition-level visual-linguistic pre-training.

Bridging the Gap between Recognition-level Pre-training and Commonsensical Vision-language Tasks

Pre-trained contextual vision-and-language (V&L) models have achieved impressive performance on various benchmarks. However, existing models require a large amount of parallel image-caption data for pre-training. Such data are costly to collect and require cumbersome curation. Inspired by unsupervised machine translation, we investigate if a strong V&L representation model can be learned through unsupervised pre-training without image-caption corpora. In particular, we propose to conduct "mask-and-predict" pre-training on text-only and image-only corpora and introduce the object tags detected by an object recognition model as anchor points to bridge two modalities. We find that such a simple approach achieves performance close to a model pre-trained with aligned data, on four English V&L benchmarks. Our work challenges the widely held notion that aligned data is necessary for V&L pre-training, while significantly reducing the amount of supervision needed for V&L models.

Unsupervised Vision-and-Language Pre-training Without Parallel Images and Captions

Virtual Poster Session 3

poster

### Welcome to ACL 2023, the 61st Annual Meeting of the Association for Computational Linguistics! 
 The conference will be held in Toronto, Canada, July 9-14, 2023. 
Following the succession of the recent conferences in our field, ACL 2023 will adopt a hybrid format.
While the impact of Covid has considerably diminished in terms of traveling, obtaining visas to Canada
entails a very long process. Moreover, the global economic conditions pose challenges for many individuals to travel to conferences. Recognizing these circumstances, we know many participants may not be
able to attend the conference in person. Therefore, we are committed to providing a great virtual platform
so everyone has the opportunity to interact with other participants and enjoy the conference. Based on the
current registered participants, approxiately 30% have chosen to attend the conference virtually. Whether
you join us in person or virtually, we sincerely hope everyone has a remarkable conference experience. 
This General Chair’s message is where I express my gratitude to the many individuals who have made
enormous contributions to the conference over the past year.

Read [**ACL 2023 General Chair's message**](https://docs.google.com/document/d/1WobYM7norbG4dI48s75HfJoD89qgX5a_F-6U8AteLSA/edit?usp=sharing/) in full.

##### **[Conference Handbook](https://2023.aclweb.org/downloads/acl2023-handbook.pdf)**

ACL 2023

The Association for Computational Linguistics (ACL) is the premier international scientific and professional society for people working on computational problems involving human language, a field often referred to as either computational linguistics or natural language processing.

**Topics:** Commonsense Reasoning 
Dialogue and Interactive Systems 
Discourse and Pragmatics 
Information Extraction

Virtual Poster Session 13

## Welcome to EMNLP 2022!
I am delighted to welcome you to EMNLP 2022! I believe this conference has been complicated beyond any precedent. Over the past year, it’s been thrilling to see the organization team approach each new puzzle with creativity and enthusiasm. We hope that those participating in Abu Dhabi as well as those joining remotely will leave the conference feeling newly inspired by the program and newly connected to our ever-growing community. Following EMNLP 2021 and major NLP conferences since, EMNLP 2022 is “hybrid,” serving both virtual and in-person participants.

Our key innovations for EMNLP 2022 include:

* EMNLP 2022 is “hybrid” in a second sense, as well: we allowed both direct and rolling review paper submissions, building on the pilot experiment of EMNLP 2021, which considered a small number of ARR submissions. 
* Familiar from NAACL but new to EMNLP, we’ve added an industry track.
* During the conference, “portals” will link virtual poster sessions to in-person conference participants during poster sessions each day.
* The first ACL-family conference in the United Arab Emirates.

 *Message from Noah A. Smith, University of Washington and Allen Institute for AI, Seattle, Washington, USA* 
***EMNLP 2022 General Chair***
 
[![](https://assets.underline.io/uploads/markdown_image/1/image/9eec7d4a287ee18c278b08229290aa83.png)](https://drive.google.com/file/d/1OlPv6QBeo62VVTughj2jkiLeyHd1WnUt/view)
 
[![](https://assets.underline.io/uploads/markdown_image/1/image/a3db7a768409f05192210d98601edb25.png)](https://emnlp2022.rocket.chat/)

To access this site you need to register. Please register [here](https://2022.emnlp.org/registration/).

Register here

EMNLP 2022

Welcome!
EMNLP 2022 will take place in Abu Dhabi from December 7th to December 11th, 2022. And it will be held in hybrid mode, both online and offline.

**Organizers:** Antoine Bosselut, Xiang Lorraine Li, Bill Yuchen Lin, Vered Shwartz, Bodhisattwa Prasad Majumder, Yash Kumar Lal, Rachel Rudinger, Xiang Ren, Niket Tandon, Vilém Zouhar
 **Description:** We organize this workshop to encourage discussion of current progress on building machines with commonsense knowledge and reasoning abilities. We aim to bring together researchers from different areas (e.g., NLP, computer vision,
computational neuroscience, psychology) to communicate promising working directions in the area of commonsense
reasoning 
**Please visit our [website](https://csrr-workshop.github.io/)**

(CL_Commonsense)        Workshop on Commonsense Representation and Reasoning

workshop paper

# Welcome everyone to ACL 2022!

The 60th Annual Meeting of the Association for Computational Linguistics is taking place May 22-27, 2022 as a hybrid event, in Dublin and online. We are happy to welcome all of you to this anniversary edition with an almost 50-50 in-person and virtual participation. 
The main conference program features oral presentations, in-person and virtual posters and demo sessions, a plenary session for our best paper presentations and awards, three amazing keynote events and two new initiatives of invited talks: Spotlight Talks for Young Rising Stars (STIRS) and The Next Big Idea Talks. Posters (including Findings of ACL 2022) and demos are grouped by areas for both the in-person and the virtual sessions. For the virtual component, the talks will be on Zoom and the posters and the demos will be in GatherTown. The Student Research Workshop will have an oral session and a poster session as part of Poster Session 1. The program also features eight Tutorials and 28 Workshops. 

 
We wish you a wonderful conference! 
[**The ACL 2022 Organizing Committee**](https://www.2022.aclweb.org/organisers)
 
[**Conference Handbook**](https://drive.google.com/file/d/1_BUCMfhMVrjG9E2e71aHdHeE28KSje0l/view?usp=sharing) 
[**Mini Handbook**](https://drive.google.com/file/d/1qlBKl0wzmlVF1oCeMQl3BahLd9nLP5Ce/view?usp=sharing) 
[**Posters and Demo guides**](https://drive.google.com/file/d/1UucMAoCNncIOaH1rMMDa0owuG9qgvJTG/view?usp=sharing)

ACL 2022

The Association for Computational Linguistics (ACL) is the premier international scientific and professional society for people working on computational problems involving human language, a field often referred to as either computational linguistics or natural language processing (NLP). 

**Session Chair: ** Aishwarya Padmakumar

15A-Oral: Language Grounding to Vision, Robotics and Beyond

technical paper

2021 Annual Conference of the North American Chapter of the Association for Computational Linguistics.

**Whova App** 
Stay in touch with your fellow conference attendees via the [Whova App](https://whova.com/portal/webapp/nacon_202106/)

**Conference Structure**
https://2021.naacl.org/blog/conference-structure/

**Walkthrough video of how to NAACL 2021** 

Please take a moment to view this video explaining how to navigate the platform, attend sessions network with other attendees. 


<figure class="video_container">
 <iframe src="https://screencast-o-matic.com/watch/crhwbGVh3vx?v=6&ff=1&title=0&controls=1" width=640 height=350 frameborder="0" allowfullscreen="true"> </iframe>
</figure>

haoxuan you

4

1

SHORT BIO

Presentations

UniFine: A Unified and Fine-grained Approach for Zero-shot Vision-Language Understanding

Understanding ME? Multimodal Evaluation for Fine-grained Visual Commonsense

Bridging the Gap between Recognition-level Pre-training and Commonsensical Vision-language Tasks

Unsupervised Vision-and-Language Pre-training Without Parallel Images and Captions

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES