AI Inference VM

This document describes the design and operation of the primary AI inference virtual machine used in the lab.

The system is optimized for low-latency inference, predictable throughput, and operational stability rather than experimentation at scale or model training.

Design principles: see Lab Philosophy.


Role & Scope

This virtual machine provides a unified inference endpoint for:

Consumers include:

The VM intentionally focuses on inference only.
Model training, multi-tenant scheduling, and high-availability orchestration are explicitly out of scope.


Host Context

The VM runs on the primary compute host:

The host is tuned for latency-sensitive workloads and dedicates the majority of its resources to this VM.


Virtual Machine Configuration

Each NUMA node is bound with 16 GB of memory, ensuring locality between CPU, memory, and PCIe devices.


GPU Configuration

Two GPUs are passed through directly to the VM:

No mediated devices or sharing mechanisms are used. Each GPU is treated as a dedicated compute resource.


Guest OS & Base Stack

All inference services are containerized. The host OS runs no AI workloads directly.


Inference Architecture

Inference is provided by multiple vLLM instances exposed through a single, unified API endpoint.

Serving Model

Typical steady-state configuration:

GPUModelRole
GPU 0Qwen 2.5 7B (FP16)Low-latency / fast responses
GPU 1Qwen 2.5 14B (AWQ)Deeper reasoning

A custom Python router sits in front of the inference engines and presents a single API endpoint to clients, abstracting model selection and routing.


Alternative Deep Model Mode

An optional configuration allows both GPUs to be combined:

This mode is used selectively when deeper reasoning is required and the lower-latency models are intentionally taken offline.


Performance & Tuning Notes

Key tuning decisions:

The system favors consistent response times over maximum theoretical tokens-per-second.


Operational Characteristics

The VM is designed to fail in obvious, recoverable ways.


Design Constraints & Rationale

This system is intentionally opinionated and optimized for a known workload profile rather than general-purpose AI experimentation.



Applicability

This document describes the current primary AI inference system.

Future AI systems are expected to follow the same architectural principles unless explicitly documented otherwise.