
Premium content
Access to this content requires a subscription. You must be a premium user to view this content.

Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Obtaining high-quality labeled datasets for e-commerce product information extraction remains challenging and costly. We present a systematic framework for generating trustworthy synthetic product data using Large Language Models (LLMs), introducing controlled modification strategies with built-in governance mechanisms: attribute-preserving modification, controlled negative example generation, and systematic attribute removal. Our approach implements responsible generation through brand anonymization, multi-stage validation, and semantic consistency enforcement. Human evaluation of 2,000 synthetic products demonstrates high quality (99.6\% natural language, 96.5\% valid attributes, 94.2\% consistency). Downstream evaluation shows synthetic data matches real data performance (60.5\% vs 60.8\% accuracy), with hybrid configurations reaching 68.8\% accuracy while reducing annotation costs by up to three orders of magnitude. Our framework provides a cost-effective, scalable solution for responsible synthetic data generation in resource-constrained scenarios, with quantitative metrics demonstrating maintained lexical diversity (TTR: 0.83 vs 0.84) and semantic fidelity (0.86 cosine similarity).
