Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Test-time scaling large language models (LLMs), such as DeepSeek-R1 and OpenAI's o1, enhances reasoning by extending inference-time chain-of-thought traces. However, their legal reasoning capabilities remain underexplored. We conduct the first systematic evaluation of 10 LLMs --- including both reasoning and general-purpose models --- across 17 Chinese and English legal benchmarks covering statutory and case-law traditions. To bridge the domain gap, we curate a chain-of-thought-annotated legal corpus and train Legal-R1-14B, an open-source legal specialist model. Legal-R1-14B outperforms both o1-preview and DeepSeek-R1 on several benchmarks, establishing a new baseline for legal reasoning. Error analysis reveals ongoing challenges such as outdated legal knowledge, reasoning failures, and factual hallucinations, highlighting key directions for future work in legal-domain LLMs.