EMNLP 2025

November 05, 2025

Suzhou, China

Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.

We assess whether AI systems can credibly evaluate investment risk appetite—a task that must be thoroughly validated before automation. Our analysis was conducted on proprietary systems (GPT, Claude, Gemini) and open-weight models (LLaMA, DeepSeek, Mistral), using carefully curated user profiles that reflect real users with varying attributes such as country and gender. As a result, the models exhibit significant variance in score distributions when user attributes—such as country or gender—that should not influence risk computation are changed. For example, GPT-4o assigns higher risk scores to Nigerian and Indonesian profiles. While some models align closely with expected scores in the low- and mid-risk ranges, none maintain consistent scores across regions and demographics, thereby violating AI and finance regulations.

Downloads

Paper

Next from EMNLP 2025

A Proactive Reliability Metric for Detecting Failures in Language Model Training
poster

A Proactive Reliability Metric for Detecting Failures in Language Model Training

EMNLP 2025

Maryam Fatima
Maryam Fatima

05 November 2025

Stay up to date with the latest Underline news!

Select topic of interest (you can select more than one)

PRESENTATIONS

  • All Presentations
  • For Librarians
  • Resource Center
  • Free Trial
Underline Science, Inc.
1216 Broadway, 2nd Floor, New York, NY 10001, USA

© 2025 Underline - All rights reserved