Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
We introduce AraSafe, the first native Arabic large-scale safety benchmark for large language models (LLMs), addressing the pressing need for culturally and linguistically representative evaluation resources. The dataset consists of 12K naturally occurring, human-written Arabic prompts spanning diverse domains, such as linguistics, social studies, sciences, and safety. Each prompt was independently annotated by two expert annotators into one of nine fine-grained safety categories, including 'Illegal Activities', 'Violence or Harm', 'Privacy Violation', and 'Hate Speech'. To enrich the representation of harmful content, we augmented the dataset with 12K synthetic harmful prompts generated using GPT-4o via carefully designed prompt engineering techniques. We benchmark a number of Arabic-centric and multilingual models in the 7–13B parameter range, including Jais, AceGPT, Allam, Fanar, Llama-3, Gemma-2, and Qwen3, as well as BERT-based fine-tuned models. GPT-4o was used as an upper-bound reference baseline. Our evaluation reveals critical safety blind spots in Arabic LLMs and underscores the necessity of localized, culturally grounded benchmarks for building responsible AI systems.