AI Inference VM

This document describes the design and operation of the primary AI inference virtual machine used in the lab.

The system is optimized for low-latency inference, predictable throughput, and operational stability rather than experimentation at scale or model training.

Design principles: see Lab Philosophy.

Role & Scope

This virtual machine provides a unified inference endpoint for:

Local and LAN-based API consumers
Interactive experimentation
Coding assistance (editor integrations)
Custom automation and chat bots

Consumers include:

OpenWebUI
VS Code extensions (e.g. Cline)
Custom Discord bots
Direct OpenAI-compatible API clients

The VM intentionally focuses on inference only.
Model training, multi-tenant scheduling, and high-availability orchestration are explicitly out of scope.

Host Context

The VM runs on the primary compute host:

Host: thor
Platform: AMD EPYC 7402P (24 cores / 48 threads)
Hypervisor: Proxmox VE
NUMA: Enabled
IOMMU / SR-IOV: Enabled

The host is tuned for latency-sensitive workloads and dedicates the majority of its resources to this VM.

Virtual Machine Configuration

vCPUs: 48 (1:1 with host threads)
Sockets: 1
NUMA topology: Explicit 1:1 host ↔ guest mapping
Memory: 128 GB dedicated
Hugepages: Enabled (any)
CPU model: host
Machine type: q35
Network: VirtIO, VLAN 110 (trusted)
Storage: High-performance SSD-backed pool

Each NUMA node is bound with 16 GB of memory, ensuring locality between CPU, memory, and PCIe devices.

GPU Configuration

Two GPUs are passed through directly to the VM:

Model: NVIDIA GeForce RTX 3090 (Founders Edition)
Quantity: 2
Passthrough: Full PCIe passthrough (GPU + audio)
NVLink: Present with physical bridge
Power limit: 350 W enforced per card
Persistence mode: Enabled

No mediated devices or sharing mechanisms are used. Each GPU is treated as a dedicated compute resource.

Guest OS & Base Stack

OS: Ubuntu 24.04.3 LTS
Kernel: 6.8.x (PREEMPT_DYNAMIC)
NVIDIA driver: 580.105.08
CUDA: 13.0
Container runtime: Docker 29.1.x

All inference services are containerized. The host OS runs no AI workloads directly.

Inference Architecture

Inference is provided by multiple vLLM instances exposed through a single, unified API endpoint.

Serving Model

Engine: vLLM
API: OpenAI-compatible
Deployment: One model per GPU by default

Typical steady-state configuration:

GPU	Model	Role
GPU 0	Qwen 2.5 7B (FP16)	Low-latency / fast responses
GPU 1	Qwen 2.5 14B (AWQ)	Deeper reasoning

A custom Python router sits in front of the inference engines and presents a single API endpoint to clients, abstracting model selection and routing.

Alternative Deep Model Mode

An optional configuration allows both GPUs to be combined:

Model: Qwen 2.5 32B (AWQ)
Tensor parallelism: 2
Deployment: One model spanning both GPUs

This mode is used selectively when deeper reasoning is required and the lower-latency models are intentionally taken offline.

Performance & Tuning Notes

Key tuning decisions:

GPU memory utilization capped at 85–90% for stability
Prefix caching and chunked prefill enabled
Custom all-reduce disabled to avoid edge-case instability
Conservative batching for larger models
Explicit limits on concurrent sequences

The system favors consistent response times over maximum theoretical tokens-per-second.

Operational Characteristics

Update cadence: As needed, typically after validation
Failure recovery: Restart container → restart VM if needed
Common failure modes: CUDA memory fragmentation, container-level faults
Host reboots: Rare and intentional

The VM is designed to fail in obvious, recoverable ways.

Design Constraints & Rationale

VM over bare metal: Clear isolation, predictable recovery, easier lifecycle management
No Kubernetes: Unnecessary complexity for a single-node, GPU-bound workload
No HA: Latency and simplicity prioritized over redundancy

This system is intentionally opinionated and optimized for a known workload profile rather than general-purpose AI experimentation.

AMD EPYC 7402P Proxmox Host — Host system providing GPU passthrough and NUMA topology

Applicability

This document describes the current primary AI inference system.

Future AI systems are expected to follow the same architectural principles unless explicitly documented otherwise.