Singapore

Automated Essay Scoring (AES) and Automatic Essay Feedback
(AEF) systems aim to reduce the workload of human raters in
educational assessment. However, most existing systems
prioritize numeric scoring accuracy over feedback quality
and are primarily evaluated on school-level writing. This
paper presents Multi-Agent Argumentation and Grammar
Integrated Critiquer (MAGIC), a framework using five
specialized agents to evaluate prompt adherence,
persuasiveness, organization, vocabulary, and grammar for
both holistic scoring and detailed feedback generation. To
support evaluation at the college level, we collated a
dataset of Graduate Record Examination (GRE) practice
essays with expert-evaluated scores and feedback. MAGIC
achieves substantial to near-perfect scoring agreement with
humans on the GRE data, outperforming baseline LLM models
while providing enhanced interpretability through its
multi-agent approach. For feedback quality evaluation, we
employ human annotators using a structured rubric and
report inter-annotator agreement.

AAAI 2026

MAGIC: Multi-Agent Argumentation and Grammar Integrated Critiquer

hai: explainable ai (xai) for human understanding

hai: applications

hai: learning human values and preferences

technical paper

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Large language models (LLMs) have rapidly transformed the landscape of AI, demonstrating remarkable capabilities across reasoning, communication, and problem-solving. Yet, realizing their full potential requires addressing two critical challenges. First, their behavior must be steered and refined after training to ensure reliability, safety, and alignment with human values and intentions. Second, their large scale comes with substantial costs in training and deployment, necessitating research into more efficient methods.
My research centers on advancing both of these fronts—making LLMs both aligned and efficient. On one side, I investigate post-training techniques that allow models to better reflect human preferences, demonstrate strong reasoning capabilities, and mitigate hallucination. On the other side, I study methods for improving data efficiency in training and inference efficiency in deployment. Together, these thrusts highlight a broader vision of enabling LLMs that are not only powerful, but also trustworthy and accessible at scale.

Towards Aligned and Efficient Large Language Models

With the growing use of Large Language Model (LLM)-based
Question-Answering (QA) systems in education, it is
critical to evaluate their performance across individual
pipeline components. In this work, we introduce EduMod-LLM,
a modular function-calling LLM pipeline, and present a
comprehensive evaluation along three key axes: function
calling strategies, retrieval methods, and generative
language models. Our framework enables fine-grained
analysis by isolating and assessing each component. We
benchmark function-calling performance across LLMs, compare
our novel structure-aware retrieval method to vector-based
and LLM-scoring baselines, and evaluate various LLMs for
response synthesis. This modular approach reveals specific
failure modes and performance patterns, supporting the
development of interpretable and effective educational QA
systems. Our findings demonstrate the value of modular
function calling in improving system transparency and
pedagogical alignment.

EduMod-LLM: A Modular Approach for Designing Flexible and Transparent Educational Assistants

Effective collaborative learning requires both dynamic
interaction and systematic pedagogical planning, yet
existing AI tutoring systems focus primarily on one-on-one
interactions. We introduce SAGE (Scaffolded Agent-Guided
Education), a novel compositional two-phase framework that
combines automated pedagogical planning with proactive
multi-agent collaboration. SAGE first generates optimized
lesson plans through specialized planning agents, then
executes them via autonomous conversational agents in
structured dialogue with students.
This approach ensures that dynamic, multi-agent
interactions are grounded in a pedagogically sound
foundation. We evaluate the conversational phase of SAGE,
demonstrating improved performance against a next-speaker
prediction baseline (72.13\% win rate) and effective group
coordination in a study with real students. Specifically,
our study with students reveals high role adherence from AI
agents, a balanced progression between task-oriented and
socio-emotional interactions, and a clear scaffolding
effect where instructional support fades as learner
autonomy increases.
Our findings highlight the significant potential of
synergizing automated instructional design with autonomous
conversational execution for collaborative learning.

SAGE: A Compositional Multi-Agent LLM Framework with Pedagogical Reasoning for Structured Collaborative Problem Solving

Automated short answer grading with feedback (ASAG-F) systems currently face challenges in transparency, pedagogical alignment, and cost-effectiveness that limit their real-world deployment. We introduce GraphRAG, a knowledge graph-based retrieval-augmented generation framework that addresses these limitations by grounding all large language model (LLM)-generated feedback and scores in instructor-curated atomic facts, ensuring traceability and verifiability. Using the Short Answer Feedback (SAF) dataset with 31 topics, we evaluate GraphRAG on unseen-question and unseen-answer splits. Our systematic evaluation demonstrates that GraphRAG achieves grading accuracy comparable to vector-based RAG and generally superior to a fine-tuned LLM baseline model while providing more transparent source attribution. Additional findings include: (1) Instructing the LLM to discretize continuous scores to match pedagogical rubrics, such as the 0.25 increments common in SAF, improves grading accuracy; (2) LLM-generated feedback exhibits length-dependent quality variations when unconstrained; prompt-based length control substantially enhances feedback quality and its stability, achieving optimal balance of instructional richness and conciseness; (3) Performance scaling analysis reveals that basic models like GPT-4o-mini offer cost-effective performance, while premium models like Claude-Opus-4 show diminishing returns. These results demonstrate that GraphRAG offers a robust, explainable, pedagogy-aligned, and cost-effective solution for large-scale educational applications, enabling transparent automated grading with effective pedagogical feedback and practical deployment costs.

Graph RAG for Automated Short Answer Grading with Feedback: Bridging Pedagogical Needs and Technical Capabilities

Classroom exercises are imperative for reinforcing
learning; however, in conventional instruction, students
frequently lack timely and personalized feedback. To
address this, we present TacpAgent(Teaching Agent for
Classroom Exercises), a generative LLM-based agent that
delivers reflective, individualized feedback through a
prompt-template framework. TacpAgent takes as input the
teacher-prepared exercises and reference answers, together
with students’ submitted responses, confidence levels, and
brief explanations, then leverages the DeepSeek LLM with
designed prompt templates to generate reflective feedback
and recommend relevant textbook sections for targeted
review. A three-month quasi-experimental study was
conducted, involving two high school classes (N=87). The
study compared TacpAgent-supported exercises with
traditional paper-based exercises. The findings indicated
that the experimental group exhibited significantly higher
practice quiz scores (F=18.516, p<0.001, η²=0.181) and
demonstrated notable improvements in emotional (p<0.001)
and behavioral engagement (p=0.003). In contrast, the
control group demonstrated no significant changes.These
findings highlight TacpAgent’s potential to enable
scalable, personalized formative assessment in classroom
settings and provide practical guidance for integrating
generative AI into everyday teaching.

TacpAgent: Enhancing Student Engagement in Classroom Exercises Through LLM-Generated Feedback

AI-supported tools have entered K-12 classrooms in recent
years to reshape student learning and skill-building. We
are particularly interested in AI’s application in literacy
subjects, such as English, where students are expected to
hone their critical thinking and public speaking skills
through AI interactions. This report details the pilot
implementation of Debate Guru, an AI-enhanced debate
education platform, across two secondary schools with
varying instructional contexts. Over the course of a summer
school course, educators integrated Debate Guru in one of
two ways: 1) by using Debate Guru’s complete curriculum; or
2) by combining platform resources with their own
instruction, such as a literary text. The pilot was
implemented with approximately 50 8th-11th grade students.
Findings suggest significant increases in students’
confidence, argumentative reasoning, and engagement.
Teachers reported high usability and strong pedagogical
value, while students responded positively to interactive
features and AI feedback integration.

“Debate Guru”: Honing Public Speaking Skills Among Secondary School Students with AI Tutoring Systems

Effective classroom teaching requires instructors to be
responsive to their students, such as by pivoting their
lectures in real-time to address common misconceptions that
their students may have developed. Classroom response
systems such as multiple-choice "clicker" systems are one
method by which instructors can gauge their students’
understanding during classroom lectures, but open-ended
questions that prompt students to engage in
self-explanation are better suited to promoting critical
thinking. Additionally, analyzing students’ natural
language responses typically requires time-consuming manual
analysis, which makes it challenging to implement in a
classroom setting. To address this challenge, we present an
LLM-driven method for automatically assessing students'
responses and generating an aggregated summary of LLM-based
evaluations for their self-explanations during
undergraduate classroom lectures. Our approach extracts
relevant knowledge components for a given question, tags
students’ responses according to whether they correctly
address each knowledge component, and generates class-level
summaries that highlight common misconceptions and gaps in
knowledge to support instructors in pivoting their lectures
in real time. We evaluate the system’s effectiveness at
these tagging and summarization tasks on data from an
undergraduate computer science course, using quantitative
and qualitative metrics such as relevance, sufficiency,
hallucination rate, and alignment with instructional goals
and desired feedback format gathered through instructor
interviews. Results suggest that the explanation-based
classroom response system can accurately analyze students’
natural language explanations.

An Explanation-Based Classroom Response System for Real-Time Analysis of Undergraduate Students’ Natural Language Explanations

Automated scoring plays a crucial role in education by reducing the reliance on human raters and offering scalable and immediate evaluation of student work. While large language models (LLMs) have shown strong potential in this task, their use as end-to-end raters faces challenges such as low accuracy, prompt sensitivity, limited interpretability, and rubric misalignment, which hinder practical implementation. To address the limitations, we propose AutoSCORE, a multi-agent LLM framework enhancing automated scoring via rubric-aligned Structured COmponent REcognition. With two agents, AutoSCORE first extracts rubric-relevant components from student responses and encodes them into a structured representation (i.e., Scoring Rubric Component Extraction Agent), which is then used to assign final scores (i.e., Scoring Agent). This design ensures that model reasoning follows a human-like grading process, enhancing interpretability and robustness. We evaluate AutoSCORE on four benchmark datasets from the ASAP benchmark, using both proprietary and open-source LLMs (GPT-4o, LLaMA-3.1-8B, LLaMA-3.1-70B). Across diverse tasks and rubrics, AutoSCORE predominantly improves scoring accuracy, human-machine agreement (QWK, correlations), and reduces error metrics (MAE, RMSE) compared to single-agent baselines, with particularly strong benefits on complex, multidimensional rubrics, and especially large relative gains on smaller LLMs. These results demonstrate that structured component recognition combined with multi-agent design offers a scalable, reliable, and interpretable solution for automated scoring.

AutoSCORE: Enhancing Automated Scoring with Multi-Agent Large Language Models via Structured Component Recognition

Professor Edward Feigenbaum: a Tribute to and Lecture by a Pioneer of AI on his 90th Birthday

An agent acts in its world to achieve its objectives. Intelligence allows the agent to make decisions and act. In natural domains, sensing is limited, so acting is gambling. It’s a myth that passive learning and more data are all we need. An agent cannot learn from observations alone. It needs a real body to carry out experiments in its world, testing hypotheses, to determine causation, refining its model of the world’s dynamics. The agent is acting as a scientist: refining its model through experiments and acting appropriately to achieve its objectives. Its objectives depend on its preferences and values, and those of other agents its actions impact. Determining which values to use, and how preferences can be acquired fairly, is a major non-technical challenge. We address three primary questions: What should an agent believe? What should an agent do, given its beliefs, preferences, and abilities? What should the preferences of an agent be? Integrating these issues motivates the design of our latest AI textbook, Artificial Intelligence: Foundations of Computational Agents, (3rd Ed. 2023).

Alan Mackworth is a Professor Emeritus of Computer Science at the University of British Columbia. He works on artificial intelligence with applications in constraint satisfaction, cognitive robotics, assistive technology, hybrid systems and constraint-based agents. He invented the world’s first soccer-playing robots. He has co-authored two books: Computational Intelligence: A Logical Approach (1998) and Artificial Intelligence: Foundations of Computational Agents (2023, 3rd Ed.). Alan co-founded the UBC Cognitive Systems Program, the Centre for AI, Decision-making and Action (CAIDA) and the AI network of BC (AInBC). He has served as President of AAAI, IJCAI and CAIAC. He is a Fellow of AAAI, CAIAC, AGE-WELL, CIFAR and the Royal Society of Canada.

David Poole is a Professor Emeritus of Computer Science at the University of British Columbia. He is known for his work on combining logic and probability, probabilistic inference, relational probabilistic models, statistical relational AI and semantic science. He is a co-author of two AI textbooks (Cambridge University Press, 3rd edition 2023, and Oxford University Press, 1998), and co-author of ” Statistical Relational Artificial Intelligence: Logic, Probability, and Computation”. He is a former chair of the Association for Uncertainty in Artificial Intelligence, the winner of the Canadian AI Association (CAIAC) 2013 Lifetime Achievement Award, and is a Fellow of the Association for the Advancement Artificial Intelligence (AAAI) and CAIAC.

Downloads

Next from AAAI 2026

Towards Aligned and Efficient Large Language Models

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

Towards Aligned and Efficient Large Language Models

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads