Custom Troubleshooting Agents
Implement the diagnosis contract that NetOpsBench evaluates.
An agent is any Python object with a diagnose(context) method. NetOpsBench passes a DiagnosticContext; the agent returns a structured DiagnosisResult; the evaluator compares that result with scenario ground truth.
Agent contract
from netopsbench.sdk import DiagnosisResult
class MyAgent:
def diagnose(self, context):
return DiagnosisResult(
agent_name="my-agent",
verdict="fault_detected",
findings={
"fault_type": "link_down",
"location": {"device": "leaf1", "interface": "Ethernet0"},
"evidence": ["Pingmesh loss is concentrated on paths crossing leaf1."],
},
confidence=0.82,
reasoning="Path symptoms and interface counters point to a leaf1 uplink.",
)Pass the agent to a session API:
from netopsbench.sdk import NetOpsBench
with NetOpsBench(workspace=".") as bench:
run = bench.sessions.run_scenario(
scenario="scenarios/generated/xs/generated_link_down_xs_001.yaml",
agent=MyAgent(),
)
report = run.wait(raise_on_failure=True)Input: DiagnosticContext
| Field | Meaning |
|---|---|
scenario_id | Scenario or testcase identifier. |
topology | Switches, clients, links, and scale-specific topology metadata. |
symptoms | Episode observations, including Pingmesh summaries when available. |
tools | Optional runtime or MCP-backed tool access. |
metadata | Runtime metadata such as worker-specific environment values. |
ground_truth | Optional ground truth. Benchmark agents should not use it for diagnosis. |
The reference LLM agent builds a compact prompt from topology size, symptom keys, episode data, Pingmesh windows, and high-signal anomalies. A custom agent may instead use rules, symbolic logic, retrieval, additional tool calls, or another model wrapper.
Output: DiagnosisResult
| Field | Requirement |
|---|---|
agent_name | Stable identifier used in reports. |
verdict | One of fault_detected, network_healthy, or inconclusive. |
success | False when the agent failed to complete. |
findings.fault_type | Canonical fault label when a fault is detected. |
findings.location.device | Predicted faulty device when applicable. |
findings.location.interface | Predicted faulty interface when applicable. |
findings.evidence | Concise evidence used to support the verdict. |
confidence | Float between 0 and 1. |
reasoning | Short explanation tied to observed evidence. |
metadata | Optional provider, model, tool-call, token, or timing details. |
Return structured fields rather than only natural language. Free-form explanations are useful for review, but scoring depends on verdict, fault type, and location fields.
For reproducibility, NetOpsBench saves a per-case runtime trace beside the raw scenario result. The bundled reference agent captures private LLM and tool events by attaching context.trace.langchain_callback() to its LangChain-compatible runtime. Custom non-LangChain agents can use advanced manual recorder methods such as context.trace.record_llm_request(...) and context.trace.record_llm_response(...) when they need private model calls in the trace. The harness writes ATIF v1.7 trajectory.atif.json artifacts for Harbor-style inspection while keeping ground truth out of the agent trajectory; scoring details are linked separately through traces/results.jsonl. Use netopsbench trace view to sync trace-enabled runs into the local Harbor viewer cache, or netopsbench trace view <run_id> to ensure a specific saved run is available in the viewer.
Trace storage preserves visible agent-environment interactions with secret redaction and per-field size limits. NetOpsBench does not monkeypatch arbitrary LLM SDKs, so fully private model prompts and responses are captured only when the agent uses a supported framework callback or the manual recorder methods.
Reference agent
examples/agents/minimal_deepagent/agent.py defines MinimalDeepAgent, the runnable LLM-backed reference implementation used by the public examples. It:
- selects a provider preset such as
openai,minimax,deepseek,zhipu, orkimi; - builds the built-in MCP server configuration;
- summarizes
DiagnosticContextinto a prompt; - asks the model for a schema-constrained diagnosis;
- normalizes the model output into
DiagnosisResult.
The schema in examples/agents/minimal_deepagent/schema.py constrains verdict values to the scorer-compatible labels. Keep the same discipline in custom agents.
Scoring implications
NetOpsBench scores the returned result along these dimensions:
| Dimension | What must be correct |
|---|---|
| Detection | Faulted cases should return fault_detected; healthy negative samples should return network_healthy. |
| Device localization | findings.location.device should match the ground-truth device or an accepted equivalent location. |
| Interface localization | findings.location.interface should match when interface-level ground truth exists. |
| Fault type | findings.fault_type should match the canonical fault label, or a semantically equivalent label when the optional judge is enabled. |
| Efficiency | Reports may include diagnosis time, tool calls, and token usage from metadata. |
For interface-applicable faults, localization score is split between device and interface correctness. Healthy negative samples receive credit only for the correct network_healthy verdict.
Practical checklist
- Use exact verdict strings:
fault_detected,network_healthy,inconclusive. - Include device and interface when the evidence supports localization.
- Prefer canonical fault names from the built-in fault catalog.
- Treat
ground_truthas evaluator data, not as diagnosis input. - Return
success=Falseand aninconclusiveverdict when the agent runtime fails.