NetOpsBench
Benchmark Runs

Benchmark Methodology

What NetOpsBench measures and how diagnosis outputs are scored.

NetOpsBench evaluates whether an agent can detect a network state, localize faults, classify the fault type, and do so with measurable runtime cost. Scoring is localization-first: detecting a fault is required, but useful credit comes from identifying the correct device and interface.

Evaluation target

Each benchmark case pairs a generated scenario with ground truth:

  • expected verdict: fault_detected or network_healthy;
  • expected fault type for fault cases;
  • expected device;
  • expected interface when the fault is interface-applicable;
  • accepted peer-side equivalents for symmetric link-like faults.

The agent returns a DiagnosisResult with verdict, findings, confidence, evidence, reasoning, and optional metadata such as tool calls or token usage.

Test suite

All topologies use a spine-leaf fabric. Scale increases both switch count and Pingmesh endpoint density.

ScaleSpinesLeafsClientsNetwork devices
XS2224
Small2486
Medium481612
Large4166420

The benchmark covers twelve canonical fault types:

CategoryFault types
Linklink_down, link_flapping
Routingblackhole_route, static_route_misconfig, bgp_neighbor_misconfig, route_policy_misconfig
Impairmentmtu_mismatch, packet_loss, packet_corruption, high_latency
Systemdevice_down
ACLacl_misconfig

Each scale combines fault cases with healthy negative samples:

ScaleFault casesHealthy casesTotal
XS12214
Small12315
Medium24428
Large48452

Scenarios are generated from scenarios/specs/fault_campaign.yaml. Running netopsbench benchmark prepare --scales <scale> materializes YAML files under scenarios/generated/<scale>/.

Scoring dimensions

DimensionMeaning
Verdict qualityWhether the agent distinguishes faulted cases from healthy negative samples. The results page reports this as verdict F1-score.
Device localizationWhether the predicted device matches ground truth or an accepted equivalent.
Interface localizationWhether the predicted interface matches when interface-level ground truth exists.
Fault type classificationWhether the predicted fault label matches the ground-truth type.
Runtime efficiencyDiagnosis time, tool calls, and token usage.

Fault-case score is verdict-gated. If the agent does not return the correct fault verdict, localization score is zero.

When interface ground truth exists:

score = 0.5 * device_correct + 0.5 * interface_correct

When interface ground truth is unavailable, device localization receives the full localization score.

Healthy cases

Healthy negative samples evaluate false-positive control:

  • network_healthy receives full credit.
  • fault_detected and inconclusive receive zero.

Localization and fault-type aggregates should be interpreted on positive fault cases, not healthy cases.

Symmetric locations

Some faults affect both ends of a link or path segment. For link_down, link_flapping, packet_loss, packet_corruption, high_latency, and mtu_mismatch, the scorer can accept an equivalent peer-side location when topology evidence proves both answers refer to the same impairment.

Optional semantic fault-type judge

Canonical string matching is the default. Set NETOPSBENCH_FAULT_TYPE_JUDGE_ENABLED=1 to allow an LLM judge for non-exact but semantically meaningful fault type descriptions, such as mapping "BGP peer AS mismatch" to bgp_neighbor_misconfig.

The judge affects only fault-type classification. Verdict, device localization, interface localization, and efficiency metrics remain deterministic. Reports include fault_type_judgment when semantic matching is used.

Aggregate report fields

MetricInterpretation
detection_accuracyFault/healthy verdict accuracy.
detection_f1Fault/healthy verdict F1-score used in the results plots.
device_localization_rateDevice-level troubleshooting precision.
interface_localization_rateInterface-level precision for applicable cases.
localization_composite_scoreCombined localization metric used for comparison.
average_scoreMean case score over the run.
avg_time_secondsMean diagnosis latency per case.
avg_tool_callsMean tool calls per case.

The scoring implementation lives in netopsbench.evaluator.scorer and session scoring helpers under netopsbench.platform.session.scoring.