Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
We present PricingLogic, the first benchmark that probes whether Large Language Models (LLMs) can reliably automate tourism-booking prices when multiple, overlapping fare rules apply. Travel agencies are eager to offload this error-prone task to AI systems; however, deploying LLMs without verified reliability could result in significant financial losses and erode customer trust. PricingLogic comprises 300 natural-language booking requests derived from 42 real-world pricing policies, spanning two levels of difficulty: (i) basic customer-type pricing and (ii) bundled-tour calculations involving interacting discounts. Evaluations of a line of LLMs reveal a steep performance drop on the harder tier, exposing systematic failures in rule interpretation and arithmetic reasoning. These results highlight that, despite their general capabilities, today’s LLMs remain unreliable for revenue-critical applications without further safeguards or domain adaptation.