NetOpsBench: Open Arena for NetOps in AI Infrastructure

NetOpsBench evaluates whether an agent can detect a network state, localize faults, classify the fault type, and do so with measurable runtime cost. Scoring is localization-first: detecting a fault is required, but useful credit comes from identifying the correct device and interface.

Evaluation target

Each benchmark case pairs a generated scenario with ground truth:

expected verdict: fault_detected or network_healthy;
expected fault type for fault cases;
expected device;
expected interface when the fault is interface-applicable;
accepted peer-side equivalents for symmetric link-like faults.

The agent returns a DiagnosisResult with verdict, findings, confidence, evidence, reasoning, and optional metadata such as tool calls or token usage.

Test suite

The CLOS profiles use a spine-leaf fabric. The fat-tree profiles use distinct core, aggregation, and edge tiers. Scale increases both switch count and Pingmesh endpoint density.

Scale	Family	Fabric devices	Clients
XS	CLOS	2 spine + 2 leaf	2
Small	CLOS	2 spine + 4 leaf	8
Medium	CLOS	4 spine + 8 leaf	16
Large	CLOS	4 spine + 16 leaf	64
XLarge	CLOS	16 spine + 128 leaf	128
Fat-tree K8	Fat-tree	16 core + 32 aggregation + 32 edge	128
Fat-tree K12	Fat-tree	36 core + 72 aggregation + 72 edge	144

The benchmark covers twelve canonical fault types:

Category	Fault types
Link	`link_down`, `link_flapping`
Routing	`blackhole_route`, `static_route_misconfig`, `bgp_neighbor_misconfig`, `route_policy_misconfig`
Impairment	`mtu_mismatch`, `packet_loss`, `packet_corruption`, `high_latency`
System	`device_down`
ACL	`acl_misconfig`

Each scale combines fault cases with healthy negative samples:

Scale	Fault cases	Healthy cases	Total
XS	12	2	14
Small	12	3	15
Medium	24	4	28
Large	48	4	52
XLarge	66	4	70
Fat-tree K8	66	4	70
Fat-tree K12	66	4	70

Scenarios are generated from the packaged canonical campaign spec. Running netopsbench benchmark prepare --scales <scale> materializes YAML files under scenarios/generated/<scale>/; --spec can select a custom campaign.

Observation validity

Pingmesh compares an explicit healthy baseline with the fault observation window. A path is reported as a latency spike when its median rtt_avg increases by at least 20 ms across the complete observation window. Peaks and individual ECMP port batches do not qualify. Packet loss requires both the configured percentage increase and at least three lost probes on that path; loss elsewhere in the same rack does not amplify weak evidence. Jitter remains available in the raw RTT samples but is not a formal anomaly. These rules are shared by built-in and custom topologies.

The canonical packet-loss campaign applies 30% loss on one selected egress member. On redundant fabrics, ECMP can dilute the endpoint-visible symptom; the benchmark reports only the evidence actually measured and does not lower detector thresholds to force an anomaly. Large profiles use a three-second probe cycle so complete ECMP coverage is spread over 102 seconds (xlarge/K8) or 114 seconds (K12) instead of being emitted in a host-scheduling burst.

Background traffic must start its complete topology-derived flow matrix before the baseline is captured. A partial matrix is an infrastructure-invalid case, rather than a degraded observation presented to the agent. The baseline is also invalid when coverage is incomplete, any path is unreachable, latency or network-MTU anomalies exist, or an unexplained local DF probe error occurs. A DF-only local send error is treated as a consequence of the measured path failure only when the same path and socket batch contains at least three lost ordinary RTT probes across the complete observation window and no other local error. Loss on another path or ECMP socket batch does not qualify.

mtu_or_fragmentation_suspect is deliberately a symptom label, not a confirmed MTU root cause. In particular, an isolated DF-only symptom during link flapping still requires interface-MTU or multi-path evidence before an agent should diagnose mtu_mismatch.

Each topology sends native Pingmesh batches through its topology-local Telegraf listener. Multiple warm topologies may therefore keep observability active together; readiness and complete coverage remain mandatory for every case. Host capacity is not hidden by a longer timeout or a topology-specific detector threshold.

The 0.2 collector polls BGP every 10 seconds. On topologies with more than 128 routing devices, every poll writes the collection index, state transitions, and all non-Established neighbors; stable complete-neighbor snapshots are written every 60 seconds. Use the tool's reported data freshness when interpreting BGP evidence.

Scoring dimensions

Dimension	Meaning
Verdict quality	Whether the agent distinguishes faulted cases from healthy negative samples. The results page reports this as verdict F1-score.
Device localization	Whether the predicted device matches ground truth or an accepted equivalent.
Interface localization	Whether the predicted interface matches when interface-level ground truth exists.
Fault type classification	Whether the predicted fault label matches the ground-truth type.
Runtime efficiency	Diagnosis time, tool calls, and token usage.

Fault-case score is verdict-gated. If the agent does not return the correct fault verdict, localization score is zero.

When interface ground truth exists:

score = 0.5 * device_correct + 0.5 * interface_correct

When interface ground truth is unavailable, device localization receives the full localization score.

Healthy cases

Healthy negative samples evaluate false-positive control:

network_healthy receives full credit.
fault_detected and inconclusive receive zero.

Localization and fault-type aggregates should be interpreted on positive fault cases, not healthy cases.

Symmetric locations

Some faults affect both ends of a link or path segment. For link_down, link_flapping, packet_loss, packet_corruption, high_latency, and mtu_mismatch, the scorer can accept an equivalent peer-side location when topology evidence proves both answers refer to the same impairment.

Optional semantic fault-type judge

Canonical string matching is the default. Set NETOPSBENCH_FAULT_TYPE_JUDGE_ENABLED=1 to allow an LLM judge for non-exact but semantically meaningful fault type descriptions, such as mapping "BGP peer AS mismatch" to bgp_neighbor_misconfig.

The judge affects only fault-type classification. Verdict, device localization, interface localization, and efficiency metrics remain deterministic. Reports include fault_type_judgment when semantic matching is used.

Aggregate report fields

Metric	Interpretation
`detection_accuracy`	Fault/healthy verdict accuracy.
`detection_f1`	Fault/healthy verdict F1-score used in the results plots.
`device_localization_rate`	Device-level troubleshooting precision.
`interface_localization_rate`	Interface-level precision for applicable cases.
`localization_composite_score`	Combined localization metric used for comparison.
`average_score`	Mean case score over the run.
`avg_time_seconds`	Mean diagnosis latency per case.
`avg_tool_calls`	Mean tool calls per case.

The scoring implementation lives in netopsbench.evaluator.scorer and session scoring helpers under netopsbench.platform.session.scoring.

Benchmark Methodology