Nvbenjo Documentation¶

Nvbenjo is a utility for benchmarking inference of deep learning models on NVIDIA GPUs. It supports models in Onnx format as well as PyTorch models, including torch.compile and ahead-of-time (AOT) compiled models.

Nvbenjo generates comprehensive benchmark results including:

CSV file with all measurement data (latency, throughput, memory usage, etc.)
Plots visualizing the benchmark results

Contents

Installing¶

pip install nvbenjo

If you need a specific version of PyTorch or want to benchmark Onnx models adapt your install:

pip install torch torchvision --index-url https://download.pytorch.org/whl/cu130
pip install onnx onnxruntime-gpu
pip install nvbenjo

Usage¶

Nvbenjo can be used as command line tool and uses hydra configuration.

Specify configuration directly from the command line:

nvbenjo \
"+nvbenjo.models={\
    efficientnet: {type_or_path: 'torchvision:efficientnet_b0',  shape:['B',3,224,224],  batch_sizes: [16,32]},\
    resnet:       {type_or_path: 'torchvision:wide_resnet101_2', shape: ['B',3,224,224], batch_sizes: [16,32]}\
}"

Usage with Config File¶

Or better, specify your own config (or one of the pre-defined config files)

nvbenjo -cn small
nvbenjo -cn="/my/config/path/myconfig.yaml"

Override single arguments of your config

nvbenjo -cn="/my/config/path/myconfig.yaml" nvbenjo.models.mymodel.num_batches=10

Example Configuration¶

defaults:
  - default
  - _self_

nvbenjo:
  models:
    shufflenet_torch:
      # Directly uses a model part of torchvision but you may also specify a local path or a Hugging Face model identifier
      type_or_path: torchvision:shufflenet_v2_x0_5
      kwargs: {}
      shape: [B, 3, 224, 224]
      num_warmup_batches: 1
      num_batches: 2
      # You may specify multiple batch sizes to benchmark
      batch_sizes: [1, 2]
      # You may also specify multiple devices to benchmark
      devices: ["cuda:0"]

      # Specify different runtime options to benchmark with different precisions or settings
      # You may also use `amp`, `amp_fp16`, `amp_bfloat16` for automatic mixed precision
      runtime_options:
        run_FP32:
          compile: false
          precision: FP32
          matmul_precision: highest
        run_FP16_compiled:
          compile: torch_compile
          cuda_graphs: false
          precision: fp16
        run_FP32_profiled:
          precision: FP32
          enable_profiling: true
          profiling_prefix: "shufflenet_profile/shufflenet"

      custom_batchmetrics:
        # Define a custom metric that computes frames per second (fps) 
        # `1 / time_total_batch_normalized`
        fps: 1

    # Specify an ONNX model
    resnet_onnx:
      type_or_path: ~/Downloads/resnet50-v2-7.onnx

      # Input shape can also be specified using a dictionary with input type and min-max values
      shape: [{"name": "data", "shape": [B, 3, 224, 224], "type": "float", "min_max": [0, 1]}]

      num_warmup_batches: 3
      num_batches: 2
      batch_sizes: [1, 8, 16, 32]
      devices: ["cuda:0"]
      runtime_options:
        DEFAULT:
          # onnx runtime session options
          intra_op_num_threads: 2
          graph_optimization_level: ORT_ENABLE_BASIC
          enable_profiling: true
          profiling_prefix: "resnet_profile"

      custom_batchmetrics:
        # Define a custom metric that computes frames per second (fps) 
        # `1 / time_total_batch_normalized`
        fps: 1

See Examples for more configuration file examples.

Usage with Python API¶

See the Python API Reference for examples and detailed documentation of all available functions and classes.

Basic PyTorch benchmark via Python API¶

"""Basic PyTorch benchmark comparing precision modes."""

import torch

from nvbenjo import benchmark, cfg
from nvbenjo.utils import PrecisionType

device = "cuda" if torch.cuda.is_available() else "cpu"

model_cfg = cfg.TorchModelConfig(
    name="resnet50",
    type_or_path="torchvision:resnet50",
    shape=(("B", 3, 224, 224),),
    devices=(device,),
    batch_sizes=(1, 8),
    num_warmup_batches=2,
    num_batches=5,
    runtime_options={
        "fp32": cfg.TorchRuntimeConfig(
            precision=PrecisionType.FP32,
            matmul_precision="high",
            cuda_graphs=True,
            compile="torch_compile",
            enable_profiling=False,
        ),
    },
)
results = benchmark.benchmark_models({"resnet50": model_cfg})
# results is a pandas DataFrame with latency, throughput, and memory columns
print(results[["model", "runtime_options", "batch_size", "time_inference"]].to_string())