Monitoring & Observability

This document describes the monitoring philosophy and planned observability stack for the lab network.

The focus is on actionable visibility rather than exhaustive metric collection.

Design principles: see Lab Philosophy.

Goals

Monitoring is intended to answer a small set of critical questions:

Is the network behaving as designed?
Are latency and loss within expected bounds?
Are failures localized and predictable?
Is policy enforcement still correct?

Metrics are collected to support diagnosis and confidence, not curiosity.

Philosophy

Observability is treated as a consumer of intent, not a source of truth.

Documentation defines how the network should behave
Configuration enforces that behavior
Monitoring verifies that reality matches intent

Monitoring is not used to dynamically control the network.

Planned Visibility Areas

The following areas are considered first-class monitoring targets:

Edge & WAN

Latency, loss, and jitter per WAN
CAKE queue behavior and drop statistics
Link utilization and saturation events
WAN failover and steering decisions

Switching Fabric

Link state and error counters
Uplink utilization and congestion
Hardware offload health
VLAN propagation consistency

Wireless

Client RSSI and SNR trends
Roaming events and failures
Retry rates and airtime utilization
Per-SSID load characteristics

Policy & Enforcement

Firewall drop counters by class
Unexpected inter-VLAN traffic attempts
NAT rule hit counts
Policy routing anomalies

Tooling Direction

Specific tools are intentionally not locked in yet.

Candidates include:

SNMP-based polling for infrastructure
Time-series metrics for latency and throughput
Log aggregation for firewall and control-plane events
Visualization layers for trend analysis

Final tooling choices will follow completion of NetBox as the source of truth.

Scope Boundaries

Monitoring explicitly avoids:

Automated remediation
Closed-loop control systems
Self-modifying network behavior

All network changes remain operator-driven.

Design Notes

Visibility should reduce uncertainty, not increase noise
Metrics must map to documented intent
Failures should be obvious, not subtle
Monitoring follows architecture, not the reverse

This document applies to all current and future network infrastructure unless explicitly stated otherwise.