Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
We propose Paired by the Teacher (PbT), a two-stage teacher–student pipeline for synthesizing accurate input–output pairs without any human labeling or existing parallel data. In many low-resource natural language generation (NLG) scenarios, practitioners may have only raw outputs, like recaps, highlights, or questions, or only raw inputs, such as dialogues, articles, or paragraphs, but seldom both sides of the parallel data, unless we perform human labeling. This mismatch forces small models to learn from very few examples or rely on costly, broad-scope synthetic examples produced by large LLMs. In PbT, a teacher LLM first transforms each unpaired example into a concise intermediate representation (IR), and a student model learns to invert this transformation to reconstruct the original input from the IR. This enables us to pair each output with its generated input, creating high-quality paired data. We evaluate PbT on five benchmarks—dialogue summarization (SAMSum, DialogSum), document summarization (XSum, CNNDM), and question generation (SQuAD)—and an unpaired setting on SwitchBoard (paired with DialogSum summaries). An 8B student trained only on PbT data outperforms models trained on 70 B teacher-generated corpora and other unsupervised baselines, closing the gap to human-annotated pairs to within 2 ROUGE points. Human evaluation on SwitchBoard further confirms that only PbT meets target summary lengths with concise, faithful outputs, while all baselines remain overly verbose.