Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Currently, large language models (\textbf{L}LMs) based \textbf{O}pen domain \textbf{N}atural language plannin\textbf{G} (LONG) has considerable room for improvement. E.g., nonreusable plans with incomplete intermediate states and missing steps hinder real-world applications. To remedy these flaws, this paper establishes a dataset with a baseline for LONG. The dataset, GOLD, provides the largest dataset for textual procedures with corresponding reusable formal planning domain definitions to date. The baseline, DIGGER, leverages entity-attribute-level action models, which reveal relevant implicit physical properties (aka attributes) of salient entities in actions. DIGGER first extracts action models and builds typed entity lists from textual procedures. Then, it builds goal states for new tasks and instantiates grounded actions using domain prediction. At last, plans are generalized and translated into textual procedures by LLM. Reference-based metrics, LLM-as-Judge, and human evaluation are employed to evaluate LONG comprehensively. Experiments on GOLD validate that DIGGER is stronger and more generalizable than recently proposed approaches and LLMs. I.e., DIGGER is the best in seen domains and applicable to unseen domains without adaptation. Specifically, the best BLEU-1 score increased from 0.385 to 0.408 on seen domains and boosted to 0.310 on unseen domains.