Monitoring & Observability
This document describes the monitoring philosophy and planned observability stack for the lab network.
The focus is on actionable visibility rather than exhaustive metric collection.
Design principles: see Lab Philosophy.
Goals
Monitoring is intended to answer a small set of critical questions:
- Is the network behaving as designed?
- Are latency and loss within expected bounds?
- Are failures localized and predictable?
- Is policy enforcement still correct?
Metrics are collected to support diagnosis and confidence, not curiosity.
Philosophy
Observability is treated as a consumer of intent, not a source of truth.
- Documentation defines how the network should behave
- Configuration enforces that behavior
- Monitoring verifies that reality matches intent
Monitoring is not used to dynamically control the network.
Planned Visibility Areas
The following areas are considered first-class monitoring targets:
Edge & WAN
- Latency, loss, and jitter per WAN
- CAKE queue behavior and drop statistics
- Link utilization and saturation events
- WAN failover and steering decisions
Switching Fabric
- Link state and error counters
- Uplink utilization and congestion
- Hardware offload health
- VLAN propagation consistency
Wireless
- Client RSSI and SNR trends
- Roaming events and failures
- Retry rates and airtime utilization
- Per-SSID load characteristics
Policy & Enforcement
- Firewall drop counters by class
- Unexpected inter-VLAN traffic attempts
- NAT rule hit counts
- Policy routing anomalies
Tooling Direction
Specific tools are intentionally not locked in yet.
Candidates include:
- SNMP-based polling for infrastructure
- Time-series metrics for latency and throughput
- Log aggregation for firewall and control-plane events
- Visualization layers for trend analysis
Final tooling choices will follow completion of NetBox as the source of truth.
Scope Boundaries
Monitoring explicitly avoids:
- Automated remediation
- Closed-loop control systems
- Self-modifying network behavior
All network changes remain operator-driven.
Design Notes
- Visibility should reduce uncertainty, not increase noise
- Metrics must map to documented intent
- Failures should be obvious, not subtle
- Monitoring follows architecture, not the reverse
This document applies to all current and future network infrastructure unless explicitly stated otherwise.