United States

Direct Preference Optimization (DPO) has recently expanded its successful application from aligning large language models (LLMs) to aligning text-to-image models with human preferences, which has generated considerable interest within the community. However, we have observed that these approaches rely solely on minimizing the reverse Kullback-Leibler divergence during alignment process between the fine-tuned model and the reference model, neglecting the incorporation of other divergence constraints. In this study, we focus on extending reverse Kullback-Leibler divergence in the alignment paradigm of text-to-image models to $f$-divergence, which aims to garner better alignment performance as well as good generation diversity. We provide the generalized formula of the alignment paradigm under the $f$-divergence condition and thoroughly analyze the impact of different divergence constraints on alignment process from the perspective of gradient fields. We conduct comprehensive evaluation on image-text alignment performance, human value alignment performance and generation diversity performance under different divergence constraints, and the results indicate that alignment based on Jensen-Shannon divergence achieves the best trade-off among them. The option of divergence employed for aligning text-to-image models significantly impacts the trade-off between alignment performance (especially human value alignment) and generation diversity, which highlights the necessity of selecting an appropriate divergence for practical applications.

AAAI 2025

Generalizing Alignment Paradigm of Text-to-Image Generation with Preferences Through f-Divergence Minimization

poster

We are pleased to announce the Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25), which will be held in Philadelphia, Pennsylvania at the Pennsylvania Convention Center from February 25 to March 4, 2025.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.

### [Invited Speakers](https://aaai.org/conference/aaai/aaai-25/aaai-25-invited-speakers/)

Register [here](https://aaai.org/conference/aaai/aaai-25/registration/)

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.



Accurate standard plane acquisition in fetal ultrasound (US)
videos is crucial for fetal growth assessment, anomaly detec-
tion, and adherence to clinical guidelines. However, manu-
ally selecting standard frames is time-consuming and prone
to intra- and inter-sonographer variability. Existing methods
primarily rely on image-based approaches that capture stan-
dard frames and then classify the input frames across dif-
ferent anatomies. This ignores the dynamic nature of video
acquisition and its interpretation. To address these chal-
lenges, we introduce Multi-Tier Class-Aware Token Trans-
former (MCAT); a visual query-based video clip localiza-
tion (VQ-VCL) method to assist sonographers by enabling
them to capture a quick ultrasound sweep. By then pro-
viding a visual query of the anatomy they wish to ana-
lyze, MCAT returns the video clip containing the standard
frames for that anatomy, facilitating thorough screening for
potential anomalies. We evaluate MCAT on two ultrasound
video datasets and a natural image VQ-VCL dataset based
on Ego4D. Our model outperforms state-of-the-art methods
by 10% and 13% mtIoU on the ultrasound datasets and by
5.35% mtIoU on the Ego4D dataset, using 96% fewer tokens.
MCAT’s efficiency and accuracy have significant potential
implications for public health, especially in low- and middle-
income countries (LMICs), where it may enhance prenatal
care by streamlining standard plane acquisition, simplifying
ultrasound-based screening and diagnosis andallowing sono-
graphers to examine more patients. The code will be available
at xxx.github.com and in supplementary material.

MCAT: Visual Query-Based Localization of Standard Anatomical Clips in Fetal Ultrasound Videos Using Multi-Tier Class-Aware Token Transformer

Local governments around the world are making consequential decisions on behalf of their constituents, and these constituents are responding with requests, advice, and assessments of their officials at public meetings. So many small meetings cannot be covered by traditional newsrooms at scale. We propose PublicSpeak, a probabilistic framework which can utilize meeting structure, domain knowledge, and linguistic information to discover public remarks in local government meetings. We then use our approach to inspect the issues raised by constituents in 7 cities across the United States. We evaluate our approach on a novel dataset of local government meetings and find that PublicSpeak improves over state-of-the-art by 10\% on average,  with gains of up to 40\%.

PUBLICSPEAK: Hearing the Public with a Probabilistic Framework

Large language models (LLMs) offer a valuable technology for various applications in healthcare. However, their tendency to hallucinate and the existing reliance on proprietary systems pose challenges in environments concerning critical decision-making and strict data privacy regulations, such as healthcare, where the trust in such systems is paramount. Through combining the strengths and discounting the weaknesses of humans and AI, the field of Human-AI Collaboration (HAIC) presents one front for tackling these challenges and hence improving trust. This paper presents a novel HAIC $\textit{guided deferral}$ system that can simultaneously parse medical reports and defer uncertain predictions with intelligent guidance to humans. We develop methodology which builds efficient, effective and open-source LLMs for this purpose, for the real-world deployment in healthcare. We conduct a pilot study which showcases the effectiveness of our proposed system in practice. Additionally, we highlight drawbacks of standard calibration metrics in imbalanced data scenarios commonly found in healthcare, and suggest a simple yet effective solution: the Imbalanced Expected Calibration Error ($\mathrm{ECE_{Imb}}$). We release our code for practitioners wishing to replicate our system.

Trustworthy and Practical AI for Healthcare: A Guided Deferral System with Large Language Models

Large language models (LLMs) and small language models (SLMs) are being adopted at remarkable speed, although their safety still remains a serious concern. With the advent of multilingual S/LLMs, the question now becomes a matter of scale: can we expand multilingual safety evaluations of these models with the same velocity at which they are deployed? To this end, we introduce RTP-LX, a human-transcreated and human-annotated corpus of toxic prompts and outputs in 28 languages. RTP-LX follows participatory design practices, and a portion of the corpus is especially designed to detect culturally-specific toxic language. We evaluate 10 S/LLMs on their ability to detect toxic content in a culturally-sensitive, multilingual scenario. We find that, although they typically score acceptably in terms of accuracy, they have low agreement with human judges when scoring holistically the toxicity of a prompt; and have difficulty discerning harm in context-dependent scenarios, particularly with subtle-yet-harmful content (e.g. microaggressions, bias). We release this dataset to contribute to further reduce harmful uses of these models and improve their safe deployment.

RTP-LX: Can LLMs Evaluate Toxicity in Multilingual Scenarios?

Tropical cyclone (TC) intensity forecasting is crucial for early disaster warning and emergency decision-making. Numerous researchers have explored deep-learning methods to address computational and post-processing issues in operational forecasting. Regrettably, they exhibit subpar long-term forecasting capabilities. We use two strategies to enhance long-term forecasting. (1) By enhancing the matching between TC intensity and spatial information, we can improve long-term forecasting performance. (2) Incorporating physical knowledge and physical constraints can help mitigate the accumulation of forecasting errors. To achieve the above strategies, we propose the VQLTI framework. VQLTI transfers the TC intensity information to a discrete latent space while retaining the spatial information differences, using large-scale spatial meteorological data as conditions. Furthermore, we leverage the forecast from the weather prediction model FengWu to provide additional physical knowledge for VQLTI. Additionally, we calculate the potential intensity (PI) to impose physical constraints on the latent variables. In the global long-term TC intensity forecasting, VQLTI achieves state-of-the-art results for the 24h to 120h, with the MSW (Maximum Sustained Wind) forecast error reduced by 35.65%-42.51% compared to ECMWF-IFS. The code implementation is available at https://anonymous.4open.science/r/VQLTI-CEF2.

VQLTI: Long-Term Tropical Cyclone Intensity Forecasting with Physical Constraints

The soaring drug and substance use crisis in the United States has claimed more than half a million lives in the past decade and remains a major public health threat. The ability to predict drug overdose deaths at the county level can help local communities develop action plans in response to emerging changes. Applying off-the-shelf machine learning algorithms for prediction can be challenging due to the heterogeneous risk profiles of the counties and suppressed data in common publicly available data sources. To fill these gaps, we develop a cluster-aware supervised learning (CASL) framework to enhance the prediction of county-level drug overdose deaths. This CASL model simultaneously clusters counties into groups based on geographical and socioeconomic characteristics and minimizes the loss function that accounts for suppressed values and cluster-specific regularization. Our computational study uses real-world data from 2010 to 2021, focusing on the ten states most severely impacted by the drug overdose crisis. The results demonstrate that our proposed CASL framework significantly outperforms state-of-the-art methods by achieving a superior balance in prediction accuracy for both unsuppressed and suppressed observations. The proposed model also uncovers different clusters of counties, capturing the underlying heterogeneity in the patterns of overdose mortality among counties of various characteristics.

A Spatio-temporal Cluster-aware Supervised Learning Framework for Predicting County-level Drug Overdose Deaths

Understanding internal joint loading is critical for diagnosing gait-related diseases such as knee osteoarthritis; however, current methods of measuring joint risk factors with force plates and 3D motion capture systems are time-consuming, expensive, and restricted to controlled lab settings, limiting their applicability to real-world contexts. Thus, in this paper, we aim to enable large-scale, cost-effective diagnosis of joint-related diseases via three key contributions: the development and deployment of novel instrumented insoles, the curation of a large multimodal biomechanics dataset, VidSole, and the evaluation of a baseline deep learning pipeline to predict internal joint loading factors. Our VidSole dataset combines the forces and moments measured by the insoles with RGB video from two viewpoints, 3D body motion capture, and force plate data for over 2,6000 trials of 52 participants performing four fundamental activities of daily living (sit-to-stand, stand-to-sit, walking, and running). We feed the insole data and kinematic parameters extractable from video (ie. pose, knee angle) into a deep learning pipeline, consisting of ensembled Gated Recurrent Unit (GRU) and Long Short Term Memory (LSTM) models, to first classify between the activities of daily living (99.16% accuracy) then estimate knee adduction moment (KAM). We validate the model’s KAM estimation by demonstrating that our mean absolute error (MAE) falls within the acceptable range of < (0.5% to 2.1%)$\times$body weight$\times$height, the current threshold for accurately detecting knee osteoarthritis with KAM. Thus, we believe our instrumented insoles, VidSole dataset, and deep learning pipeline are useful for accurately quantifying joint kinetic measurements and can assist in preventing joint-related diseases.

VidSole: A Multimodal Dataset for Joint Kinetics Quantification and Disease Detection with Deep Learning

Knowledge Tracing (KT) is a crucial component in the education field, which focuses on depicting students’ learning states and assessing students’ mastery of subjects. With the rise of modern online learning platforms, particularly massive open online courses (MOOCs), an abundance of interaction data has injected new vitality into the development of KT technology. Previous research commonly adopts deterministic representation to capture students’ knowledge states, which neglects the uncertainty during student interactions and thus fails to model the true knowledge state in learning process. In light of this, we innovatively propose an Uncertainty-Aware Knowledge Tracing model (UKT) which employs stochastic distribution embeddings to represent the uncertainty in student interactions, with a Wasserstein selfattention mechanism designed to capture the transition of state distribution in student learning behaviors. Additionally, we introduce the aleatory uncertainty-aware contrastive learning loss, which strengthens the model’s robustness towards detrimental uncertainties and ensures a more accurate knowledge-mastery assessment. Rigorous empirical studies on six real-world datasets demonstrate that UKT not only significantly surpasses existing deep learning-based models in KT prediction, but also shows unique advantages in handling the uncertainty of student interactions. All data and codes are publicly available at https://anonymous.4open.science/r/UKT.

Uncertainty-aware Knowledge Tracing

Medical benchmark datasets significantly contribute to developing Large Language Models (LLMs) for medical knowledge extraction, diagnosis, summarization, and other uses.
Yet, current benchmarks are mainly derived from exam questions given to medical students or cases described in the medical literature, lacking the complexity of real-world patient cases that deviate from classic textbook abstractions. These include rare diseases, uncommon presentations of common diseases, and unexpected treatment responses. 
Here, we construct Clinically Uncommon Patient Cases and Diagnosis Dataset (CUPCase) based on 3,563 real-world case reports from BMC, which we formulate into diagnoses in open-ended textual format and as multiple-choice options with distractors.
Using this dataset, we evaluate the ability of state-of-the-art LLMs, including both general-purpose and Clinical LLMs, to identify and correctly diagnose a patient case, and test models' performance when only partial information about cases is available.
Our findings show that general-purpose GPT-4o attains the best performance in both the multiple-choice task (average accuracy of 87.9%) and the open-ended task (BERTScore F1 of 0.764), outperforming several LLMs with a focus on the medical domain such as Meditron-70B and MedLM-Large. 
Moreover, GPT-4o was able to maintain 87% and 88% of its performance with only the first 20% of tokens of the case presentation in multiple-choice and free text, respectively,  highlighting the potential of LLMs to aid in early diagnosis in real-world cases. An error analysis demonstrates the complexity of the task, and attempts to hypothesise about the models' reasoning.
CUPCase expands our ability to evaluate LLMs for clinical decision support in an open and reproducible manner.

CUPCase: Clinically Uncommon Patient Cases and Diagnoses Dataset

Federated Learning in healthcare ensures patient privacy by allowing hospitals to collaboratively train machine learning models while keeping sensitive medical data secure and localized. Most existing research in Federated Learning (FL) has concentrated on unimodal scenarios, where all healthcare institutes share the same type of data. However, in real-world healthcare situations, some clients may have access to multiple types of data pertaining to the same disease. Multimodal Federated Learning (MMFL) utilizes multiple modalities in each client to build a more powerful Federated Learning (FL) model than its unimodal counterpart. However, the impact of missing modality in different clients, called modality incongruity, has been greatly overlooked. This paper, for the first time, analyses the impact of modality incongruity and reveals its connection with data heterogeneity across participating clients. We particularly inspect whether incongruent MMFL with unimodal and multimodal clients is more beneficial than unimodal FL. Furthermore, we examine three potential routes of addressing this issue. Firstly, we study the effectiveness of various self-attention mechanisms towards incongruity-agnostic information fusion in MMFL. Secondly, we introduce a modality imputation network (MIN) pre-trained in a multimodal client for modality translation in unimodal clients and investigate its potential towards mitigating the missing modality problem. Thirdly, we assess the capability of client-level and server-level regularization techniques towards mitigating modality incongruity effects. Experiments are conducted with Chest X-Ray and radiology reports under several MMFL settings on two publicly available real-world datasets, MIMIC-CXR and Open-I.

Premium content

Next from AAAI 2025

MCAT: Visual Query-Based Localization of Standard Anatomical Clips in Fetal Ultrasound Videos Using Multi-Tier Class-Aware Token Transformer

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES