NetOpsBench
Agent Development

Custom Troubleshooting Agents

Implement the diagnosis contract that NetOpsBench evaluates.

An agent is any Python object with a diagnose(context) method. NetOpsBench passes a DiagnosticContext; the agent returns a structured DiagnosisResult; the evaluator compares that result with scenario ground truth.

Agent contract

from netopsbench.sdk import DiagnosisResult

class MyAgent:
    def diagnose(self, context):
        return DiagnosisResult(
            agent_name="my-agent",
            verdict="fault_detected",
            findings={
                "fault_type": "link_down",
                "location": {"device": "leaf1", "interface": "Ethernet0"},
                "evidence": ["Pingmesh loss is concentrated on paths crossing leaf1."],
            },
            confidence=0.82,
            reasoning="Path symptoms and interface counters point to a leaf1 uplink.",
        )

Pass the agent to a session API:

from netopsbench.sdk import NetOpsBench

with NetOpsBench(workspace=".") as bench:
    run = bench.sessions.run_scenario(
        scenario="scenarios/generated/xs/generated_link_down_xs_001.yaml",
        agent=MyAgent(),
    )
    report = run.wait(raise_on_failure=True)

Input: DiagnosticContext

FieldMeaning
scenario_idScenario or testcase identifier.
topologySwitches, clients, links, and scale-specific topology metadata.
symptomsEpisode observations, including Pingmesh summaries when available.
toolsOptional runtime or MCP-backed tool access.
metadataRuntime metadata such as worker-specific environment values.
ground_truthOptional ground truth. Benchmark agents should not use it for diagnosis.

The reference LLM agent builds a compact prompt from topology size, symptom keys, episode data, Pingmesh windows, and high-signal anomalies. A custom agent may instead use rules, symbolic logic, retrieval, additional tool calls, or another model wrapper.

Output: DiagnosisResult

FieldRequirement
agent_nameStable identifier used in reports.
verdictOne of fault_detected, network_healthy, or inconclusive.
successFalse when the agent failed to complete.
findings.fault_typeCanonical fault label when a fault is detected.
findings.location.devicePredicted faulty device when applicable.
findings.location.interfacePredicted faulty interface when applicable.
findings.evidenceConcise evidence used to support the verdict.
confidenceFloat between 0 and 1.
reasoningShort explanation tied to observed evidence.
metadataOptional provider, model, tool-call, token, or timing details.

Return structured fields rather than only natural language. Free-form explanations are useful for review, but scoring depends on verdict, fault type, and location fields.

For reproducibility, NetOpsBench saves a per-case runtime trace beside the raw scenario result. The bundled reference agent captures private LLM and tool events by attaching context.trace.langchain_callback() to its LangChain-compatible runtime. Custom non-LangChain agents can use advanced manual recorder methods such as context.trace.record_llm_request(...) and context.trace.record_llm_response(...) when they need private model calls in the trace. The harness writes ATIF v1.7 trajectory.atif.json artifacts for Harbor-style inspection while keeping ground truth out of the agent trajectory; scoring details are linked separately through traces/results.jsonl. Use netopsbench trace view to sync trace-enabled runs into the local Harbor viewer cache, or netopsbench trace view <run_id> to ensure a specific saved run is available in the viewer.

Trace storage preserves visible agent-environment interactions with secret redaction and per-field size limits. NetOpsBench does not monkeypatch arbitrary LLM SDKs, so fully private model prompts and responses are captured only when the agent uses a supported framework callback or the manual recorder methods.

Reference agent

examples/agents/minimal_deepagent/agent.py defines MinimalDeepAgent, the runnable LLM-backed reference implementation used by the public examples. It:

  1. selects a provider preset such as openai, minimax, deepseek, zhipu, or kimi;
  2. builds the built-in MCP server configuration;
  3. summarizes DiagnosticContext into a prompt;
  4. asks the model for a schema-constrained diagnosis;
  5. normalizes the model output into DiagnosisResult.

The schema in examples/agents/minimal_deepagent/schema.py constrains verdict values to the scorer-compatible labels. Keep the same discipline in custom agents.

Scoring implications

NetOpsBench scores the returned result along these dimensions:

DimensionWhat must be correct
DetectionFaulted cases should return fault_detected; healthy negative samples should return network_healthy.
Device localizationfindings.location.device should match the ground-truth device or an accepted equivalent location.
Interface localizationfindings.location.interface should match when interface-level ground truth exists.
Fault typefindings.fault_type should match the canonical fault label, or a semantically equivalent label when the optional judge is enabled.
EfficiencyReports may include diagnosis time, tool calls, and token usage from metadata.

For interface-applicable faults, localization score is split between device and interface correctness. Healthy negative samples receive credit only for the correct network_healthy verdict.

Practical checklist

  • Use exact verdict strings: fault_detected, network_healthy, inconclusive.
  • Include device and interface when the evidence supports localization.
  • Prefer canonical fault names from the built-in fault catalog.
  • Treat ground_truth as evaluator data, not as diagnosis input.
  • Return success=False and an inconclusive verdict when the agent runtime fails.