Unified Gateway Architecture For Multi-Tenant Large Language Model Serving
DOI:
https://doi.org/10.63278/jicrcr.vi.3432Abstract
Enterprise adoption of large language models has revealed critical inefficiencies in current serving architectures, particularly for organizations deploying heterogeneous model fleets across multiple tenants. Existing solutions fragment prompt routing, key-value cache management, and safety enforcement across disparate components, resulting in elevated latency, redundant memory consumption, and inconsistent policy compliance. Gateway-Centric LLM Serving introduces a unified control plane that consolidates these functions into a dedicated gateway layer positioned between clients and model endpoints. The architecture enables dynamic model selection based on cost, latency, and domain constraints while exposing KV-caches as network-addressable resources for cross-session reuse. Centralized safety filters enforce organization-wide compliance policies including redaction and jailbreak prevention at the serving boundary. The routing decision is formalized as a multi-objective optimization with O(|M| log |M|) complexity, while cache operations achieve O(1) exact matching and O(log n) similarity search. Safety filtering maintains O(n × m) linear complexity with concurrent execution across pipeline stages. Evaluation on multi-tenant workloads with 10,000 requests across 3 tenants accessing 5 heterogeneous models demonstrates substantial improvements: P95 latency reduced by 51% (423ms vs 856ms), P99 latency improved by 62% (891ms vs 2,340ms), cross-tenant cache reuse yielding 42% memory savings with 58% hit rates, and policy violation reduction of 73% compared to distributed enforcement. Cost analysis reveals 34% TCO reduction with 5.6-month ROI for deployments exceeding 10M requests monthly. This architecture bridges distributed database gateway patterns with modern AI infrastructure, providing a blueprint for scalable, cost-efficient, and compliant LLM deployments.




