NetOpsBench
Benchmark Runs

Benchmark Results

Reference cross-model NetOpsBench results.

This page records the public trace dataset snapshot. Treat it as a reference snapshot under the documented methodology, not as a stable model ranking. Provider defaults, model versions, prompts, and tool policies can change absolute values.

Experiment scope

DimensionValues
ModelsKimi K2.6, DeepSeek V4 Pro, OpenAI GPT-5.5, MiniMax M3
Topology scalesXS, Small, Medium, Large
Fault types12 canonical types across link, routing, impairment, system, and ACL categories
Quality metricsVerdict F1-score, device localization, interface localization, composite score
Efficiency metricsDiagnosis time, tool calls, input/output tokens

Cross-scale snapshot

Each cell is XS -> Large.

ModelVerdict F1 (%)Device Loc (%)Interface Loc (%)Composite (%)Avg Time (s)Avg ToolsInput Tokens (K)
Kimi K2.676.2 -> 84.766.7 -> 75.071.4 -> 71.464.3 -> 74.0272.0 -> 399.536.5 -> 38.8367.6 -> 741.1
DeepSeek V4 Pro100.0 -> 97.983.3 -> 91.757.1 -> 57.178.6 -> 83.783.1 -> 83.124.9 -> 20.7247.6 -> 477.3
OpenAI GPT-5.595.7 -> 95.883.3 -> 79.257.1 -> 21.475.0 -> 60.671.9 -> 85.017.9 -> 15.985.4 -> 153.7
MiniMax M380.0 -> 82.950.0 -> 62.571.4 -> 42.960.7 -> 58.7214.3 -> 230.326.4 -> 24.0133.3 -> 316.1

The main pattern is that verdict classification remains easier than precise localization. Interface localization is the weakest metric on larger topologies.

Quality metrics

Verdict F1-score

Device localization

Interface localization

Composite score

Runtime cost

Average diagnosis time

Tool calls

Input tokens

Output tokens

Compare cost metrics with localization quality, not verdict quality alone. Higher tool or token use does not automatically improve device or interface precision.