
Premium content
Access to this content requires a subscription. You must be a premium user to view this content.

Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Self-hosting large language models (LLMs) is increasingly appealing for organizations seeking privacy, cost control, and customization. Yet deploying and maintaining in house models poses challenges in GPU utilization, workload routing, and reliability. We introduce Pick and Spin, a practical framework that makes self hosted LLM orchestration scalable and economical. Built on Kubernetes, it integrates a unified Helm based deployment system, adaptive scale-to-zero automation, and a hybrid routing module that balances cost, latency, and accuracy using both keyword heuristics and a lightweight DistilBERT classifier. We evaluate four models Llama 3 (90 B), Gemma 3 (27 B), Qwen 3 (235 B), and DeepSeek R1 (685 B) across eight public benchmark datasets, with five inference strategies, and two routing variants encompassing 3200 prompts and 1,60,000 inference runs. Pick and Spin achieves up to 10% higher accuracy, 30% lower latency, and 33% lower GPU cost per query compared with static deployments. These results show that intelligent orchestration and efficient scaling enable enterprise grade LLM performance on self hosted infrastructure, bringing high capacity AI within practical and affordable reach.
