NetOpsBench: Open Arena for NetOps in AI Infrastructure

External integrations should import through the SDK root:

from netopsbench.sdk import NetOpsBench

Use the checked-in examples for runnable scripts and the SDK for application code. Treat netopsbench.platform.* as internal implementation detail.

Run one scenario

Prepare scenarios first with netopsbench benchmark prepare --scales xs, then call run_scenario(...):

from examples.agents import MinimalDeepAgent
from netopsbench.sdk import NetOpsBench, RunFailedError

with NetOpsBench(workspace=".") as bench:
    agent = bench.agents.wrap(MinimalDeepAgent(vendor="openai"))
    run = bench.sessions.run_scenario(
        scenario="scenarios/generated/xs/generated_link_down_xs_001.yaml",
        agent=agent,
    )
    try:
        report = run.wait(raise_on_failure=True)
    except RunFailedError as exc:
        report = exc.report
        raise

print(report.summary)
print(run.report_path)

run is a RunHandle. It records the run id, mode, runtime id, artifact directory, scenario ids, status, and persisted report path. run.wait() loads the saved BenchmarkReport.

bench.agents.wrap(...) is recommended for provider-backed or asynchronous agents because it gives the SDK one lifecycle boundary to close resources.

:::note NetOpsBench 0.2 Session calls are synchronous. The placeholder RunHandle.cancel() method and ignored wait(timeout=...) argument were removed; use run.wait() after a session call returns. Runtime objects are SDK facades—use status() or describe() instead of private platform payload methods. :::

Session APIs

API	Use when	Runtime ownership
`run_scenario(...)`	One scenario for environment validation or agent debugging.	SDK provisions and tears down a runtime unless `keep_runtime=True`.
`run_suite(...)`	One agent over multiple scenarios.	SDK provisions a runtime pool; `workers=N` enables parallel execution.
`run_on_runtime_scenario(...)`	You already provisioned a runtime and want one case.	Caller owns teardown.
`run_on_runtime_suite(...)`	You already provisioned a runtime pool and want a suite.	Caller owns teardown.

Common options:

Option	Meaning
`workers`	Number of isolated runtime workers for automatic suite runs.
`keep_runtime`	Preserve an automatically provisioned runtime for inspection.
`artifacts_dir`	Override where `report.json`, metadata, and raw scenario outputs are written.
`scale`	Explicit topology scale when it cannot be inferred from scenario paths.
`trace`	Save per-case agent runtime traces. Defaults to `True`.

Artifacts

When artifacts_dir is omitted, session artifacts are written under the workspace-managed artifact root. A run directory contains:

report.json for the final BenchmarkReport;
metadata.json for run-level metadata;
raw/ for worker-local scenario outputs.
traces/ for per-attempt agent runtime traces, including uniquely named ATIF v1.7 trajectory-<trace-id>.atif.json files, run-level index.jsonl, and scoring sidecar results.jsonl.

The scenario_summaries[*].raw_result_path fields point to raw JSON files for case-level debugging.

Agent traces are saved by default and can be disabled for a run with trace=False. Disabling trace prevents private runtime trace collection and sidecar artifact creation. Ground truth and score details are written to traces/results.jsonl, not into the agent trajectory.

NetOpsBench stores visible prompts, model messages, tool calls, and observations with secret redaction and a fixed per-field safety limit. The bundled MinimalDeepAgent attaches context.trace.langchain_callback() to its LangChain-compatible runtime so private LLM messages and tool events flow into the same recorder. Non-LangChain agents can use the advanced manual recorder methods, such as context.trace.record_llm_request(...) and context.trace.record_llm_response(...), when they need to capture private model calls.

Open a completed run directly in the Harbor viewer:

netopsbench trace list
netopsbench trace view
netopsbench trace view run-20260605T124040Z

The command exports Harbor-compatible viewer files under <workspace>/.netopsbench/harbor-jobs and starts the local viewer. Export traces from Python when you need a reusable Harbor jobs directory:

bench.artifacts.export_traces(run.id, output="harbor-jobs")
trace_index = bench.artifacts.get_run_traces(run.id)
trace_results = bench.artifacts.get_run_trace_results(run.id)

The CLI also supports an explicit export path for CI or offline inspection:

netopsbench trace export run-20260605T124040Z --output harbor-jobs

NetOpsBench validates exported job/result.json and trial/result.json with Harbor's own models before writing them.

SDK managers

NetOpsBench groups public operations under managers:

Manager	Use
`bench.scales`	Inspect the resolved scale registry and custom profiles.
`bench.scenarios`	Create, load, validate, and save canonical scenarios.
`bench.agents`	Wrap or manage objects implementing `diagnose(context)`.
`bench.sessions`	Run scenarios and suites.
`bench.runtimes`	Provision, attach, list, and tear down runtime pools.
`bench.faults`	Register or inspect fault definitions.
`bench.evaluators`	Score `DiagnosisResult` objects and produce canonical reports.
`bench.simulators`	Create interactive diagnostic environments over benchmark scenarios.
`bench.artifacts`	Resolve run artifacts and report files.

The stable external boundary is netopsbench.sdk, documented example scripts, and documented CLI commands such as benchmark prepare, scenario validate, result show, and runtime teardown.

Migrating to 0.2

Version 0.2 uses one canonical episode per scenario. Replace an episodes: list with a singular episode: object; legacy multi-episode YAML now fails with an explicit migration error. ScenarioManager.create() likewise accepts only episode= and returns ScenarioSpec directly.

Reports now use only the canonical id, summary, scenario_summaries, detailed_results, artifact_paths, and raw fields. The 0.1 ScenarioHandle, PlatformDefaults, ScenarioEvaluator, report_id, and payload compatibility names are not part of the 0.2 SDK.

Fat-tree topology payloads now expose only the real cores, aggs, edges, and clients groups. The former compatibility projection that duplicated cores as spines, edges as leafs, and a client's edge as leaf was incorrect and has been removed. CLOS payloads continue to expose spines, leafs, and clients.

Raw Pingmesh and BGP measurements are internal observability contracts, so 0.2 also removes fields that had no detector, tool, dashboard, or health consumer.

Runtime ownership

Automatic session methods are best for comparable benchmark runs because the SDK owns provisioning, observability startup, execution, and teardown. Existing-runtime methods are useful when debugging Containerlab, SONiC, Pingmesh, Telegraf, Grafana, or repeated agent iterations against a preserved lab.

Python API Guide