NetOpsBench: Open Arena for NetOps in AI Infrastructure

An agent is any Python object with a diagnose(context) method. NetOpsBench passes a DiagnosticContext; the agent returns a structured DiagnosisResult; the evaluator compares that result with scenario ground truth.

Agent contract

from netopsbench.sdk import DiagnosisResult

class MyAgent:
    def diagnose(self, context):
        return DiagnosisResult(
            agent_name="my-agent",
            verdict="fault_detected",
            findings={
                "fault_type": "link_down",
                "location": {"device": "leaf1", "interface": "Ethernet0"},
                "evidence": ["Pingmesh loss is concentrated on paths crossing leaf1."],
            },
            confidence=0.82,
            reasoning="Path symptoms and interface counters point to a leaf1 uplink.",
        )

Pass the agent to a session API:

from netopsbench.sdk import NetOpsBench

with NetOpsBench(workspace=".") as bench:
    run = bench.sessions.run_scenario(
        scenario="scenarios/generated/xs/generated_link_down_xs_001.yaml",
        agent=MyAgent(),
    )
    report = run.wait(raise_on_failure=True)

Input: `DiagnosticContext`

Field	Meaning
`scenario_id`	Scenario or testcase identifier.
`topology`	Switches, clients, links, and scale-specific topology metadata.
`symptoms`	Episode observations, including Pingmesh summaries when available.
`tools`	Optional runtime or MCP-backed tool access.
`metadata`	Runtime metadata such as worker-specific environment values.

The reference LLM agent builds a compact prompt from topology size, symptom keys, episode data, Pingmesh windows, and high-signal anomalies. A custom agent may instead use rules, symbolic logic, retrieval, additional tool calls, or another model wrapper.

Output: `DiagnosisResult`

Field	Requirement
`agent_name`	Stable identifier used in reports.
`verdict`	One of `fault_detected`, `network_healthy`, or `inconclusive`.
`success`	`False` when the agent failed to complete.
`findings.fault_type`	Canonical fault label when a fault is detected.
`findings.location.device`	Predicted faulty device when applicable.
`findings.location.interface`	Predicted faulty interface when applicable.
`findings.evidence`	Concise evidence used to support the verdict.
`confidence`	Float between 0 and 1.
`reasoning`	Short explanation tied to observed evidence.
`metadata`	Optional provider, model, tool-call, token, or timing details.

Return structured fields rather than only natural language. Free-form explanations are useful for review, but scoring depends on verdict, fault type, and location fields.

For reproducibility, NetOpsBench saves a per-attempt runtime trace beside the raw scenario result. The bundled reference agent captures private LLM and tool events by attaching context.trace.langchain_callback() to its LangChain-compatible runtime. Custom non-LangChain agents can use advanced manual recorder methods such as context.trace.record_llm_request(...) and context.trace.record_llm_response(...) when they need private model calls in the trace. The harness writes uniquely named ATIF v1.7 trajectory-<trace-id>.atif.json artifacts for Harbor-style inspection while keeping ground truth out of the agent trajectory; scoring details are linked separately through traces/results.jsonl. Use netopsbench trace view to sync trace-enabled runs into the local Harbor viewer cache, or netopsbench trace view <run_id> to ensure a specific saved run is available in the viewer.

Trace storage preserves visible agent-environment interactions with secret redaction and per-field size limits. NetOpsBench does not monkeypatch arbitrary LLM SDKs, so fully private model prompts and responses are captured only when the agent uses a supported framework callback or the manual recorder methods.

Reference agent

examples/agents/minimal_deepagent/agent.py defines MinimalDeepAgent, the runnable LLM-backed reference implementation used by the public examples. It:

selects a provider preset such as openai, minimax, deepseek, zhipu, or kimi;
builds the built-in MCP server configuration;
summarizes DiagnosticContext into a prompt;
asks the model for a schema-constrained diagnosis;
normalizes the model output into DiagnosisResult.

The schema in examples/agents/minimal_deepagent/schema.py constrains verdict values to the scorer-compatible labels. Keep the same discipline in custom agents.

Scoring implications

NetOpsBench scores the returned result along these dimensions:

Dimension	What must be correct
Detection	Faulted cases should return `fault_detected`; healthy cases should return `network_healthy`.
Device localization	`findings.location.device` should match the ground-truth device or an accepted equivalent location.
Interface localization	`findings.location.interface` should match when interface-level ground truth exists.
Fault type	`findings.fault_type` should match the canonical fault label, or a semantically equivalent label when the optional judge is enabled.
Efficiency	Reports may include diagnosis time, tool calls, and token usage from metadata.

For interface-applicable faults, localization score is split between device and interface correctness. Healthy cases receive credit only for the correct network_healthy verdict.

Practical checklist

Use exact verdict strings: fault_detected, network_healthy, inconclusive.
Include device and interface when the evidence supports localization.
Prefer canonical fault names from the built-in fault catalog.
Return success=False and an inconclusive verdict when the agent runtime fails.

Custom Troubleshooting Agents

Agent contract

Input: `DiagnosticContext`

Output: `DiagnosisResult`

Reference agent

Scoring implications

Practical checklist

On this page

Custom Troubleshooting Agents

Agent contract

Input: DiagnosticContext

Output: DiagnosisResult

Reference agent

Scoring implications

Practical checklist

On this page

Input: `DiagnosticContext`

Output: `DiagnosisResult`