jido (Korean for "map") is an ML systems toolkit that maps the relationship between hardware, software backends, and kernel performance. The goal is to remove the setup friction that comes with benchmarking ML workloads across different machines: detecting what hardware is available, discovering which inference backends are installed, and running standardized benchmarks that produce comparable results regardless of where they execute.
Hardware Detection
jido scans CPU capabilities (core counts, ISA flags like AVX-512 and FMA), system memory, and GPUs across three vendors: NVIDIA via NVML with nvidia-smi fallback, AMD via rocm-smi and amd-smi, and Intel via sycl-ls. For NVIDIA GPUs it also pulls compute capability, SM count, clock rates, and computes theoretical peak FP32 FLOPS and memory bandwidth. Each machine gets a deterministic SHA-256 fingerprint derived from its hardware profile, so benchmark results from the same physical system are always grouped together.
Backend Discovery
On every run, jido probes the runtime environment for seven inference backends—PyTorch, ONNX Runtime, vLLM, TensorRT, llama.cpp, OpenVINO, and DeepSpeed—along with compilers (gcc, clang, nvcc, hipcc, icx) and vendor tools. This lets the toolkit know what is actually available to benchmark against, without requiring manual configuration.
Benchmarking
The benchmark runner covers three core operations—matrix multiplication, scaled dot-product attention, and 2D convolution—each with multiple default problem sizes and three data types (fp32, fp16, bf16). Every run records timing percentiles (mean, median, p95, p99), peak memory allocation, GPU utilization via NVML, TFLOPS throughput, and correctness deltas against a reference kernel. Results are written to a structured JSON schema that includes the full hardware scan, environment snapshot, and machine fingerprint, making runs reproducible and cross-machine comparisons straightforward.
What's Next
- Parameter sweep expansion across problem sizes, dtypes, and batch dimensions
- Cross-backend kernel benchmarks (Triton, TensorRT, ONNX Runtime execution providers)
- Recommended-config generation from detected hardware profiles
- Terminal-based reporting and visualization
- NPU and FPGA detection