United States

Current ophthalmology clinical workflows are plagued by over-referrals, long waits, and complex and heterogeneous medical records. Large language models (LLMs) present a promising solution to automate various procedures such as triaging, preliminary tests like visual acuity assessment, and report summaries. However, LLMs have demonstrated significantly varied performance across different languages in natural language question-answering tasks, potentially exacerbating healthcare disparities in Low and Middle-Income Countries (LMICs). This study introduces the first multilingual ophthalmological question-answering benchmark with manually curated questions parallel across languages, allowing for direct cross-lingual comparisons. Our evaluation of 6 popular LLMs across 7 different languages reveals substantial bias across different languages, highlighting risks for clinical deployment of LLMs in LMICs. Existing debiasing methods such as Translation Chain-of-Thought or Retrieval-augmented generation (RAG) by themselves fall short of closing this performance gap, often failing to improve performance across all languages and lacking specificity for the medical domain. To address this issue, We propose CLARA (Cross-Lingual Reflective Agentic system), a novel inference time de-biasing method leveraging retrieval augmented generation and self-verification. Our approach not only improves performance across all languages but also significantly reduces the multilingual bias gap, facilitating equitable LLM application across the globe.

AAAI 2025

Multi-OphthaLingua: A Multilingual Benchmark for Assessing and Debiasing LLM Ophthalmological QA in LMICs

poster

We are pleased to announce the Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25), which will be held in Philadelphia, Pennsylvania at the Pennsylvania Convention Center from February 25 to March 4, 2025.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.

### [Invited Speakers](https://aaai.org/conference/aaai/aaai-25/aaai-25-invited-speakers/)

Register [here](https://aaai.org/conference/aaai/aaai-25/registration/)

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.



Understanding internal joint loading is critical for diagnosing gait-related diseases such as knee osteoarthritis; however, current methods of measuring joint risk factors with force plates and 3D motion capture systems are time-consuming, expensive, and restricted to controlled lab settings, limiting their applicability to real-world contexts. Thus, in this paper, we aim to enable large-scale, cost-effective diagnosis of joint-related diseases via three key contributions: the development and deployment of novel instrumented insoles, the curation of a large multimodal biomechanics dataset, VidSole, and the evaluation of a baseline deep learning pipeline to predict internal joint loading factors. Our VidSole dataset combines the forces and moments measured by the insoles with RGB video from two viewpoints, 3D body motion capture, and force plate data for over 2,6000 trials of 52 participants performing four fundamental activities of daily living (sit-to-stand, stand-to-sit, walking, and running). We feed the insole data and kinematic parameters extractable from video (ie. pose, knee angle) into a deep learning pipeline, consisting of ensembled Gated Recurrent Unit (GRU) and Long Short Term Memory (LSTM) models, to first classify between the activities of daily living (99.16% accuracy) then estimate knee adduction moment (KAM). We validate the model’s KAM estimation by demonstrating that our mean absolute error (MAE) falls within the acceptable range of < (0.5% to 2.1%)$\times$body weight$\times$height, the current threshold for accurately detecting knee osteoarthritis with KAM. Thus, we believe our instrumented insoles, VidSole dataset, and deep learning pipeline are useful for accurately quantifying joint kinetic measurements and can assist in preventing joint-related diseases.

VidSole: A Multimodal Dataset for Joint Kinetics Quantification and Disease Detection with Deep Learning

Knowledge Tracing (KT) is a crucial component in the education field, which focuses on depicting students’ learning states and assessing students’ mastery of subjects. With the rise of modern online learning platforms, particularly massive open online courses (MOOCs), an abundance of interaction data has injected new vitality into the development of KT technology. Previous research commonly adopts deterministic representation to capture students’ knowledge states, which neglects the uncertainty during student interactions and thus fails to model the true knowledge state in learning process. In light of this, we innovatively propose an Uncertainty-Aware Knowledge Tracing model (UKT) which employs stochastic distribution embeddings to represent the uncertainty in student interactions, with a Wasserstein selfattention mechanism designed to capture the transition of state distribution in student learning behaviors. Additionally, we introduce the aleatory uncertainty-aware contrastive learning loss, which strengthens the model’s robustness towards detrimental uncertainties and ensures a more accurate knowledge-mastery assessment. Rigorous empirical studies on six real-world datasets demonstrate that UKT not only significantly surpasses existing deep learning-based models in KT prediction, but also shows unique advantages in handling the uncertainty of student interactions. All data and codes are publicly available at https://anonymous.4open.science/r/UKT.

Uncertainty-aware Knowledge Tracing

Medical benchmark datasets significantly contribute to developing Large Language Models (LLMs) for medical knowledge extraction, diagnosis, summarization, and other uses.
Yet, current benchmarks are mainly derived from exam questions given to medical students or cases described in the medical literature, lacking the complexity of real-world patient cases that deviate from classic textbook abstractions. These include rare diseases, uncommon presentations of common diseases, and unexpected treatment responses. 
Here, we construct Clinically Uncommon Patient Cases and Diagnosis Dataset (CUPCase) based on 3,563 real-world case reports from BMC, which we formulate into diagnoses in open-ended textual format and as multiple-choice options with distractors.
Using this dataset, we evaluate the ability of state-of-the-art LLMs, including both general-purpose and Clinical LLMs, to identify and correctly diagnose a patient case, and test models' performance when only partial information about cases is available.
Our findings show that general-purpose GPT-4o attains the best performance in both the multiple-choice task (average accuracy of 87.9%) and the open-ended task (BERTScore F1 of 0.764), outperforming several LLMs with a focus on the medical domain such as Meditron-70B and MedLM-Large. 
Moreover, GPT-4o was able to maintain 87% and 88% of its performance with only the first 20% of tokens of the case presentation in multiple-choice and free text, respectively,  highlighting the potential of LLMs to aid in early diagnosis in real-world cases. An error analysis demonstrates the complexity of the task, and attempts to hypothesise about the models' reasoning.
CUPCase expands our ability to evaluate LLMs for clinical decision support in an open and reproducible manner.

CUPCase: Clinically Uncommon Patient Cases and Diagnoses Dataset

Federated Learning in healthcare ensures patient privacy by allowing hospitals to collaboratively train machine learning models while keeping sensitive medical data secure and localized. Most existing research in Federated Learning (FL) has concentrated on unimodal scenarios, where all healthcare institutes share the same type of data. However, in real-world healthcare situations, some clients may have access to multiple types of data pertaining to the same disease. Multimodal Federated Learning (MMFL) utilizes multiple modalities in each client to build a more powerful Federated Learning (FL) model than its unimodal counterpart. However, the impact of missing modality in different clients, called modality incongruity, has been greatly overlooked. This paper, for the first time, analyses the impact of modality incongruity and reveals its connection with data heterogeneity across participating clients. We particularly inspect whether incongruent MMFL with unimodal and multimodal clients is more beneficial than unimodal FL. Furthermore, we examine three potential routes of addressing this issue. Firstly, we study the effectiveness of various self-attention mechanisms towards incongruity-agnostic information fusion in MMFL. Secondly, we introduce a modality imputation network (MIN) pre-trained in a multimodal client for modality translation in unimodal clients and investigate its potential towards mitigating the missing modality problem. Thirdly, we assess the capability of client-level and server-level regularization techniques towards mitigating modality incongruity effects. Experiments are conducted with Chest X-Ray and radiology reports under several MMFL settings on two publicly available real-world datasets, MIMIC-CXR and Open-I.

Incongruent Multimodal Federated Learning for Medical Vision and Language-based Multi-label Disease Detection

As global populations age rapidly, incorporating age-specific considerations into urban planning is crucial for transforming cities into supportive environments for both aging populations and sustainable development. However, current urban development practices fall significantly short in implementing age-friendly planning, leading to elderly services that are insufficient and unevenly distributed across regions. This underscores the urgent need for age-friendly urban renewal strategies. To tackle this challenge, our work focuses on generating optimized planning schemes for urban aging facilities, tailored to the unique demands of age-friendly community planning. We introduce a novel framework called **F**airness-driven **A**ge-friendly community **P**lanning via **C**onditional **D**iffusion generation (FAP-CD) that utilizes a conditioned graph denoising diffusion probabilistic model to learn the conditional joint probability distribution of aging facilities and their spatial relationships at a fine-grained regional level. Specifically, this approach generates optimized spatial distributions of facilities from noisy graphs, conditioned on the needs of the elderly during the denoising diffusion process. In the training phase, we incorporate a demand-fairness pre-training module that leverages an attention mechanism and min-max optimization to integrate community demand features with facility characteristics, ensuring a balanced distribution of services across different regions. Additionally, we use a discrete graph structure to represent potential walkable accessibility within regional road networks, serving as a guiding condition to accelerate model sampling. We also design a graph denoising network with an attribute augmentation module and a hybrid graph message aggregation module to enhance the integration of neighbor and global node and edge information. Empirical results across multiple metrics highlight our method's superior ability to balance age-friendly needs with regional equity, achieving an average improvement of 41\% over various competitive baseline models.

FAP-CD: Fairness-Driven Age-Friendly Community Planning via Conditional Diffusion Generation

Rendering photorealistic head avatars from arbitrary viewpoints is crucial for various applications like virtual reality. Although previous methods based on Neural Radiance Fields (NeRF) can achieve impressive results, they lack fidelity and efficiency. Recent methods using 3D Gaussian Splatting (3DGS) have improved rendering quality and real-time performance but still require significant storage overhead. In this paper, we introduce a method called GraphAvatar that utilizes Graph Neural Networks (GNN) to generate 3D Gaussians for the head avatar. Specifically, GraphAvatar trains a geometric GNN and an appearance GNN to generate the attributes of the 3D Gaussians from the tracked mesh. Therefore, our method can store the GNN models instead of the 3D Gaussians, significantly reducing the storage overhead to just 10MB. To reduce the impact of face-tracking errors, we also present a novel graph-guided optimization module to refine face-tracking parameters during training. Finally, we introduce a 3D-aware enhancer for post-processing to enhance the rendering quality. We conduct comprehensive experiments to demonstrate the advantages of GraphAvatar, surpassing existing methods in visual fidelity and storage consumption. The ablation study sheds light on the trade-offs between rendering quality and model size. The code will be released.

GraphAvatar: Compact Head Avatars with GNN-Generated 3D Gaussians

Incomplete multi-view clustering (IMVC) has garnered increasing attention in recent years due to the common issue of missing data in multi-view datasets. The primary approach to address this challenge involves recovering the missing views before applying conventional multi-view clustering methods. Although imputation-based IMVC methods have achieved significant improvements, they still encounter notable limitations: 1) heavy reliance  on paired data for training the data recovery module, which is impractical in real scenarios with high missing data rates; 2) the generated data often lacks diversity and discriminability, resulting in suboptimal clustering results. 
To address these shortcomings, we propose a novel IMVC method called Diffusion Contrastive Generation (DCG). Motivated by the consistency between the diffusion and clustering processes, DCG learns the distribution characteristics to enhance clustering by applying forward diffusion and reverse denoising processes to intra-view data.
By performing contrastive learning on a limited  set of paired multi-view samples, DCG can align the generated views with the real views, facilitating accurate recovery of views across arbitrary missing view scenarios. Additionally, DCG integrates instance-level and category-level interactive learning to exploit the consistent and complementary information available in multi-view data, achieving robust and end-to-end clustering. 
Extensive experiments demonstrate that our method outperforms state-of-the-art approaches.

Incomplete Multi-view Clustering via Diffusion Contrastive Generation

We address the challenging task of neural machine translation (NMT) in the entertainment domain, where the objective is to automatically translate a given dialogue from a source language content to a target language. This task has various applications, particularly in automatic dubbing, subtitling, and other content localization tasks, enabling source content to reach a wider audience. Traditional NMT systems typically translate individual sentences in isolation, without facilitating knowledge transfer of crucial elements such as the \textit{context} and \textit{style} from previously encountered sentences. In this work, we emphasize the significance of these fundamental aspects in producing pertinent and captivating translations. We demonstrate their significance through several examples and propose a novel framework for entertainment translation, which, to our knowledge, is the first of its kind. Furthermore, we introduce an algorithm to estimate the context and style of the current \textit{session} and use these estimations to generate a \textit{prompt} that guides a Large Language Model (LLM) to generate high-quality translations. Our method is both language and LLM-agnostic, making it a general-purpose tool. We demonstrate the effectiveness of our algorithm through various numerical studies and observe significant improvement in the COMET scores over various state-of-the-art LLMs. Moreover, our proposed method consistently outperforms baseline LLMs in terms of win-ratio.

Enhancing Entertainment Translation for Indian Languages Using Adaptive Context, Style and LLMs

We analyze a distributed algorithm to compute a low-rank matrix factorization on $N$ clients, each holding a local dataset $\mathbf{S}^i \in \mathbb{R}^{n_i \times d}$, mathematically, we seek to solve $min_{\mathbf{U}^i \in \mathbb{R}^{n_i\times r}, \mathbf{V}\in \mathbb{R}^{d \times r} } \frac{1}{2} \sum_{i=1}^N \|\mathbf{S}^i - \mathbf{U}^i \mathbf{V}^\top\|^2_{\text{F}}$. Considering a power initialization of $\mathbf{V}$, we rewrite the previous smooth non-convex problem into a smooth strongly-convex problem that we solve using a parallel Nesterov gradient descent potentially requiring a single step of communication at the initialization step. For any client $i$ in $\{1, \dots, N\}$, we obtain a global $\mathbf{V}$ in $\mathbb{R}^{d \times r}$ common to all clients and a local variable $\mathbf{U}^i$ in $\mathbb{R}^{n_i \times r}$. We provide a linear rate of convergence of the excess loss which depends on $\sigma_{\max} / \sigma_{r}$, where $\sigma_{r}$ is the $r^{\mathrm{th}}$ singular value of the concatenation $\mathbf{S}$ of the matrices $(\mathbf{S}^i)^N_{i=1}$. This result improves the rates of convergence given in the literature, which depend on $\sigma_{\max}^2 / \sigma_{\min}^2$. We provide an upper bound on the Frobenius-norm error of reconstruction under the power initialization strategy. We complete our analysis with experiments on both synthetic and real data.

In-depth Analysis of Low-rank Matrix Factorisation in a Federated Setting

Transformer-based models have recently achieved outstanding performance in image matting. However, their application to high-resolution images remains challenging due to the quadratic complexity of global self-attention. To address this issue, we propose MEMatte, a memory-efficient matting framework for processing high-resolution images. MEMatte incorporates a router before each global attention block, directing informative tokens to the global attention while routing other tokens to a Lightweight Token Refinement Module (LTRM). Specifically, the router employs a local-global strategy to predict the routing probability of each token, and the LTRM utilizes efficient modules to simulate global attention. Additionally, we introduce a Batch-constrained Adaptive Token Routing (BATR) mechanism, which allows each router to dynamically route tokens based on image content and the stages of attention block in the network. Furthermore, we construct an ultra high-resolution image matting dataset, UHR-395, comprising 35,500 training images and 1,000 test images, with an average resolution of $4872\times6017$. This dataset is created by compositing 395 different alpha mattes across 11 categories onto various backgrounds, all with high-quality manual annotation. Extensive experiments demonstrate that MEMatte outperforms existing methods on both high-resolution and real-world datasets, significantly reducing memory usage by approximately 88\% and latency by 50\% on the Composition-1K benchmark.

Premium content

Next from AAAI 2025

VidSole: A Multimodal Dataset for Joint Kinetics Quantification and Disease Detection with Deep Learning

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES