Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
keywords:
farsense
cultural alignment
farsi
llm evaluation
persian language
synthetic data
low-resource nlp
human-in-the-loop
commonsense reasoning
counterfactual reasoning
benchmark
fine-tuning
Although Farsi is widely spoken, no comprehensive benchmark exists for assessing commonsense reasoning in language models. We therefore present \textbf{FarSense}, a 6‑task benchmark for Farsi covering True/False judgment, multiple-choice questions, Explanation, Cause‑Effect inference, Counterfactual reasoning, and Knowledge Completion. Starting from Farsi‑Wikipedia, we filtered noise and retained ~4,210 passages, rewrote them into realistic daily scenarios, and derived the above tasks from each scenario. Scenario and task generation quality was first judged via native‑speaker annotations on outputs from five major LLMs—GPT‑4o, Gemini-2.5-Flash, Mistral-Large, Qwen‑Plus, and DeepSeek‑Chat. Gemini-2.5-Flash demonstrated the highest performance, leading to its use in generating a large-scale dataset, subsequently finalized through meticulous two-step human validation. Using \textbf{FarSense}, we measured the commonsense ability of the same five flagship LLMs and also fine‑tuned six compact models (1B–24B parameters) before re‑evaluating them. To ensure broad applicability, task wording was designed to minimize dialectal, cultural, or religious bias. Experiments show that targeted fine‑tuning yields substantial gains, confirming \textbf{FarSense} as a reliable, openly licensed resource for advancing reproducible commonsense understanding research in Farsi NLP. We publicly release all code and data at https://github.com/KamyarZeinalipour/FarSense.