Zml-smi: universal monitoring tool for GPUs, TPUs and NPUs

zml-smi is a universal diagnostic and monitoring tool for GPUs, TPUs, and NPUs, offering real-time performance insights across NVIDIA, AMD, Google TPU, and AWS Trainium platforms.
zml-smi is a universal diagnostic and monitoring tool for GPUs, TPUs and NPUs. It provides real-time insights into the performance and health of your hardware. It is a mix between nvidia-smi and nvtop. It transparently supports all the platforms ZML supports. That is NVIDIA, AMD, Google TPU and AWS Trainium devices. It will be extended to support more platforms in the future as ZML continues to expand its hardware support.
Getting started
You can download zml-smi from the official mirror.
$ curl -LO 'https://mirror.zml.ai/zml-smi/zml-smi-v0.2.tar.zst'
$ tar -xf zml-smi-v0.2.tar.zst
$ ./zml-smi/zml-smi
Listing devices
$ zml-smi
Monitoring devices
The --top flag provides real-time monitoring of device performance, including utilization, temperature, and memory usage.
$ zml-smi --top
Completely sandboxed
zml-smi doesn’t require any software on the target machine besides the device driver and the GLIBC.
Metrics
Host
zml-smi displays host-level metrics such as CPU model and utilization, memory usage, and temperature.
Processes
zml-smi also provides insights into the processes utilizing the devices, including their resource usage and command lines.
NVIDIA
Metrics are given through the NVML library, which ships with the driver.
AMD
Metrics are provided through the AMD SMI library. In order to support the latest AMD GPUs, zml-smi at build time downloads the amdgpu.ids file and merges them. This allows support for models like Ryzen AI Max+ 395 (Strix Halo) even before official ROCm releases. We created a shared object named zmlxrocm.so to intercept fopen64 calls and redirect them to the sandboxed file.
TPU
Metrics are provided via the local gRPC endpoint exposed by the TPU runtime, including TensorCore Duty Cycle and HBM usage.
AWS Trainium
Metrics are provided through a private API found in libnrt.so, including Core Utilization and HBM usage.
Source: Hacker News











