AI Inference VM
This document describes the design and operation of the primary AI inference virtual machine used in the lab.
The system is optimized for low-latency inference, predictable throughput, and operational stability rather than experimentation at scale or model training.
Design principles: see Lab Philosophy.
Role & Scope
This virtual machine provides a unified inference endpoint for:
- Local and LAN-based API consumers
- Interactive experimentation
- Coding assistance (editor integrations)
- Custom automation and chat bots
Consumers include:
- OpenWebUI
- VS Code extensions (e.g. Cline)
- Custom Discord bots
- Direct OpenAI-compatible API clients
The VM intentionally focuses on inference only.
Model training, multi-tenant scheduling, and high-availability orchestration
are explicitly out of scope.
Host Context
The VM runs on the primary compute host:
- Host: thor
- Platform: AMD EPYC 7402P (24 cores / 48 threads)
- Hypervisor: Proxmox VE
- NUMA: Enabled
- IOMMU / SR-IOV: Enabled
The host is tuned for latency-sensitive workloads and dedicates the majority of its resources to this VM.
Virtual Machine Configuration
- vCPUs: 48 (1:1 with host threads)
- Sockets: 1
- NUMA topology: Explicit 1:1 host ↔ guest mapping
- Memory: 128 GB dedicated
- Hugepages: Enabled (any)
- CPU model:
host - Machine type:
q35 - Network: VirtIO, VLAN 110 (trusted)
- Storage: High-performance SSD-backed pool
Each NUMA node is bound with 16 GB of memory, ensuring locality between CPU, memory, and PCIe devices.
GPU Configuration
Two GPUs are passed through directly to the VM:
- Model: NVIDIA GeForce RTX 3090 (Founders Edition)
- Quantity: 2
- Passthrough: Full PCIe passthrough (GPU + audio)
- NVLink: Present with physical bridge
- Power limit: 350 W enforced per card
- Persistence mode: Enabled
No mediated devices or sharing mechanisms are used. Each GPU is treated as a dedicated compute resource.
Guest OS & Base Stack
- OS: Ubuntu 24.04.3 LTS
- Kernel: 6.8.x (PREEMPT_DYNAMIC)
- NVIDIA driver: 580.105.08
- CUDA: 13.0
- Container runtime: Docker 29.1.x
All inference services are containerized. The host OS runs no AI workloads directly.
Inference Architecture
Inference is provided by multiple vLLM instances exposed through a single,
unified API endpoint.
Serving Model
- Engine: vLLM
- API: OpenAI-compatible
- Deployment: One model per GPU by default
Typical steady-state configuration:
| GPU | Model | Role |
|---|---|---|
| GPU 0 | Qwen 2.5 7B (FP16) | Low-latency / fast responses |
| GPU 1 | Qwen 2.5 14B (AWQ) | Deeper reasoning |
A custom Python router sits in front of the inference engines and presents a single API endpoint to clients, abstracting model selection and routing.
Alternative Deep Model Mode
An optional configuration allows both GPUs to be combined:
- Model: Qwen 2.5 32B (AWQ)
- Tensor parallelism: 2
- Deployment: One model spanning both GPUs
This mode is used selectively when deeper reasoning is required and the lower-latency models are intentionally taken offline.
Performance & Tuning Notes
Key tuning decisions:
- GPU memory utilization capped at 85–90% for stability
- Prefix caching and chunked prefill enabled
- Custom all-reduce disabled to avoid edge-case instability
- Conservative batching for larger models
- Explicit limits on concurrent sequences
The system favors consistent response times over maximum theoretical tokens-per-second.
Operational Characteristics
- Update cadence: As needed, typically after validation
- Failure recovery: Restart container → restart VM if needed
- Common failure modes: CUDA memory fragmentation, container-level faults
- Host reboots: Rare and intentional
The VM is designed to fail in obvious, recoverable ways.
Design Constraints & Rationale
- VM over bare metal: Clear isolation, predictable recovery, easier lifecycle management
- No Kubernetes: Unnecessary complexity for a single-node, GPU-bound workload
- No HA: Latency and simplicity prioritized over redundancy
This system is intentionally opinionated and optimized for a known workload profile rather than general-purpose AI experimentation.
Related Documentation
- AMD EPYC 7402P Proxmox Host — Host system providing GPU passthrough and NUMA topology
Applicability
This document describes the current primary AI inference system.
Future AI systems are expected to follow the same architectural principles unless explicitly documented otherwise.