NetOpsBench: Open Arena for NetOps in AI Infrastructure

NetOpsBench has one execution path with several entrypoints. Start with one scenario while changing an agent, then move to suites and scale runs when the contract is stable.

Execution modes

Mode	Script or API	Use
One scenario	`examples/01_run_scenario.py` or `run_scenario(...)`	Environment validation, agent contract checks, one-case debugging.
Small suite	`examples/02_run_suite.py` or `run_suite(...)`	Several selected cases with aggregate metrics.
One full scale	`examples/03_run_scale_benchmark.py`	All generated scenarios for one topology scale.
Multi-scale batch	`scripts/run_all_benchmarks.sh`	Repeated scale runs with logs and CSV summary.

Prepare scenario assets before these runs:

# prepare benchmark scenarios for all topology scale (xs, small, medium, large)
netopsbench benchmark prepare

Use --seed when you need reproducible scenario generation, default seed is 42:

netopsbench benchmark prepare --scales xs,small --seed 42

One scenario

PYTHONPATH=. python examples/01_run_scenario.py --vendor <vendor>

The script selects one generated scenario and calls:

run = bench.sessions.run_scenario(scenario=scenario, agent=agent)
report = run.wait(raise_on_failure=True)

Use this mode when Docker, Containerlab, provider credentials, or the agent output schema are still changing.

Small suite

PYTHONPATH=. python examples/02_run_suite.py --vendor <vendor>

The script passes a list of scenarios and requests multiple workers:

run = bench.sessions.run_suite(
    scenarios=scenarios,
    agent=agent,
    scale="xs",
    workers=3,
)

Use a small suite to check whether an agent generalizes beyond one selected case before paying the cost of a full generated corpus.

Full scale run

PYTHONPATH=. python examples/03_run_scale_benchmark.py \
  --scale xs \
  --workers 3 \
  --vendor <vendor>

examples/03_run_scale_benchmark.py discovers all generated scenario YAML files for the selected scale, then runs them with the same run_suite(...) API. This is the main input for comparing agents on a topology size.

Scale choice changes both runtime cost and diagnosis difficulty:

Scale	Use
`xs`	Fastest feedback and provider validation.
`small`	More topology variety with moderate runtime cost.
`medium`	Broader fabric behavior and more generated scenarios.
`large`	Stress test for runtime stability, observability volume, and agent efficiency.

Worker count controls concurrency. Each worker provisions its own lab resources. If Docker, Containerlab, or provider rate limits become unstable, reduce --workers.

Multi-scale batches

BENCH_VENDOR=<vendor> BENCH_SCALES="xs small" bash scripts/run_all_benchmarks.sh

For long-running jobs:

nohup bash scripts/run_all_benchmarks.sh &> benchmark_run.log &

Useful environment overrides:

Variable	Meaning
`BENCH_VENDOR`	Provider passed to `examples/03_run_scale_benchmark.py`.
`BENCH_SCALES`	Space-separated scale list such as `xs small`.
`BENCH_WORKERS_<SCALE>`	Worker count override for one scale, for example `BENCH_WORKERS_LARGE=2`.
`BENCH_CLEAN_RUNS=1`	Explicitly remove previous `.netopsbench/runs/` artifacts before the batch. By default, previous reports and traces are preserved.

The script writes logs under scenario_results/benchmark_logs_<timestamp>/, records the run ids for the batch in benchmark_runs_<timestamp>.jsonl, and writes a CSV summary under scenario_results/benchmark_summary_<timestamp>.csv. Run artifacts use timestamp ids such as run-20260605T124040Z; use netopsbench trace view to sync trace-enabled runs into the local Harbor viewer cache and inspect saved traces.

Inspect reports and runtimes

The CLI supports preparation, scenario checks, report inspection, and cleanup:

netopsbench scenario validate scenarios/generated/xs/generated_link_down_xs_001.yaml
netopsbench result list
netopsbench result show .netopsbench/runs/<run-id>/report.json
netopsbench runtime list
netopsbench runtime teardown <runtime-name>

Failed runs that preserve Containerlab resources should be inspected with Operations.

Running Benchmarks