Nvbenjo Documentation¶
Nvbenjo is a utility for benchmarking inference of deep learning models on NVIDIA GPUs.
It supports models in Onnx format as well as PyTorch models, including torch.compile and ahead-of-time (AOT) compiled models.
Nvbenjo generates comprehensive benchmark results including:
CSV file with all measurement data (latency, throughput, memory usage, etc.)
Plots visualizing the benchmark results
Installing¶
pip install nvbenjo
If you need a specific version of PyTorch or want to benchmark Onnx models adapt your install:
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu130
pip install onnx onnxruntime-gpu
pip install nvbenjo
Usage¶
Nvbenjo can be used as command line tool and uses hydra configuration.
Specify configuration directly from the command line:
nvbenjo \
"+nvbenjo.models={\
efficientnet: {type_or_path: 'torchvision:efficientnet_b0', shape:['B',3,224,224], batch_sizes: [16,32]},\
resnet: {type_or_path: 'torchvision:wide_resnet101_2', shape: ['B',3,224,224], batch_sizes: [16,32]}\
}"
Usage with Config File¶
Or better, specify your own config (or one of the pre-defined config files)
nvbenjo -cn small
nvbenjo -cn="/my/config/path/myconfig.yaml"
Override single arguments of your config
nvbenjo -cn="/my/config/path/myconfig.yaml" nvbenjo.models.mymodel.num_batches=10
defaults:
- default
- _self_
nvbenjo:
models:
shufflenet_torch:
# Directly uses a model part of torchvision but you may also specify a local path or a Hugging Face model identifier
type_or_path: torchvision:shufflenet_v2_x0_5
kwargs: {}
shape: [B, 3, 224, 224]
num_warmup_batches: 1
num_batches: 2
# You may specify multiple batch sizes to benchmark
batch_sizes: [1, 2]
# You may also specify multiple devices to benchmark
devices: ["cuda:0"]
# Specify different runtime options to benchmark with different precisions or settings
# You may also use `amp`, `amp_fp16`, `amp_bfloat16` for automatic mixed precision
runtime_options:
run_FP32:
compile: false
precision: FP32
matmul_precision: highest
run_FP16_compiled:
compile: torch_compile
cuda_graphs: false
precision: fp16
run_FP32_profiled:
precision: FP32
enable_profiling: true
profiling_prefix: "shufflenet_profile/shufflenet"
custom_batchmetrics:
# Define a custom metric that computes frames per second (fps)
# `1 / time_total_batch_normalized`
fps: 1
# Specify an ONNX model
resnet_onnx:
type_or_path: ~/Downloads/resnet50-v2-7.onnx
# Input shape can also be specified using a dictionary with input type and min-max values
shape: [{"name": "data", "shape": [B, 3, 224, 224], "type": "float", "min_max": [0, 1]}]
num_warmup_batches: 3
num_batches: 2
batch_sizes: [1, 8, 16, 32]
devices: ["cuda:0"]
runtime_options:
DEFAULT:
# onnx runtime session options
intra_op_num_threads: 2
graph_optimization_level: ORT_ENABLE_BASIC
enable_profiling: true
profiling_prefix: "resnet_profile"
custom_batchmetrics:
# Define a custom metric that computes frames per second (fps)
# `1 / time_total_batch_normalized`
fps: 1
See Examples for more configuration file examples.
Usage with Python API¶
See the Python API Reference for examples and detailed documentation of all available functions and classes.
"""Basic PyTorch benchmark comparing precision modes."""
import torch
from nvbenjo import benchmark, cfg
from nvbenjo.utils import PrecisionType
device = "cuda" if torch.cuda.is_available() else "cpu"
model_cfg = cfg.TorchModelConfig(
name="resnet50",
type_or_path="torchvision:resnet50",
shape=(("B", 3, 224, 224),),
devices=(device,),
batch_sizes=(1, 8),
num_warmup_batches=2,
num_batches=5,
runtime_options={
"fp32": cfg.TorchRuntimeConfig(
precision=PrecisionType.FP32,
matmul_precision="high",
cuda_graphs=True,
compile="torch_compile",
enable_profiling=False,
),
},
)
results = benchmark.benchmark_models({"resnet50": model_cfg})
# results is a pandas DataFrame with latency, throughput, and memory columns
print(results[["model", "runtime_options", "batch_size", "time_inference"]].to_string())