Deep-learning throughput hinges on how successfully a compiler stack maps tensor applications to GPU execution: thread/block schedules, reminiscence motion, and instruction choice (e.g., Tensor Core MMA pipelines). On this article we are going to deal with 4 dominant stacks—CUDA, ROCm, Triton, and TensorRT—from the compiler’s perspective and explains which optimizations transfer the needle in apply.
What truly determines efficiency on trendy GPUs
Throughout distributors, the identical levers recur:
- Operator scheduling & fusion: scale back kernel launches and round-trips to HBM; expose longer producer→client chains for register/shared-memory reuse. TensorRT and cuDNN “runtime fusion engines” exemplify this for consideration and conv blocks.
- Tiling & information structure: match tile shapes to Tensor Core/WGMMA/WMMA native fragment sizes; keep away from shared-memory financial institution conflicts and partition tenting. CUTLASS paperwork warp-level GEMM tiling for each Tensor Cores and CUDA cores.
- Precision & quantization: FP16/BF16/FP8 for coaching/inference; INT8/INT4 (calibrated or QAT) for inference. TensorRT automates calibration and kernel choice below these precisions.
- Graph seize & runtime specialization: graph execution to amortize launch overheads; dynamic fusion of widespread subgraphs (e.g., consideration). cuDNN 9 added graph help for consideration fusion engines.
- Autotuning: search tile sizes, unroll components, and pipelining depths per arch/SKU. Triton and CUTLASS expose specific autotune hooks; TensorRT performs builder-time tactic choice.
With that lens, right here’s how every stack implements the above.
CUDA: nvcc/ptxas, cuDNN, CUTLASS, and CUDA Graphs
Compiler path. CUDA code compiles via nvcc into PTX, then ptxas lowers PTX to SASS (arch-specific machine code). Controlling optimization requires feeding flags to each host and gadget phases; for kernels the secret is -Xptxas
. Builders typically miss that -O3
alone impacts solely host code.
Kernel era & libraries.
- CUTLASS offers parametric templates for GEMM/conv, implementing warp-level tiling, Tensor Core MMA pipelines, and smem iterators designed for conflict-free entry—canonical references for writing peak kernels, together with Hopper’s WGMMA path.
- cuDNN 9 launched runtime fusion engines (notably for consideration blocks), native CUDA Graph integration for these engines, and updates for brand new compute capabilities—materially decreasing dispatch overheads and bettering reminiscence locality in Transformer workloads.
Efficiency implications.
- Transferring from unfused PyTorch ops to cuDNN consideration fusion sometimes cuts kernel launches and world reminiscence visitors; mixed with CUDA Graphs, it reduces CPU bottlenecks in short-sequence inference.
- On Hopper/Blackwell, aligning tile shapes to WGMMA/Tensor Core native sizes is decisive; CUTLASS tutorials quantify how mis-sized tiles waste tensor-core throughput.
When CUDA is the appropriate device. You want most management over instruction choice, occupancy, and smem choreography; otherwise you’re extending kernels past library protection whereas staying on NVIDIA GPUs.
ROCm: HIP/Clang toolchain, rocBLAS/MIOpen, and the 6.x collection
Compiler path. ROCm makes use of Clang/LLVM to compile HIP (CUDA-like) into GCN/RDNA ISA. The 6.x collection has centered on perf and framework protection; launch notes monitor component-level optimizations and HW/OS help.
Libraries and kernels.
- rocBLAS and MIOpen implement GEMM/conv primitives with arch-aware tiling and algorithm choice related in spirit to cuBLAS/cuDNN. The consolidated changelog highlights iterative perf work throughout these libraries.
- Current ROCm workstream consists of higher Triton enablement on AMD GPUs, enabling Python-level kernel authoring whereas nonetheless decreasing via LLVM to AMD backends.
Efficiency implications.
- On AMD GPUs, matching LDS (shared reminiscence) financial institution widths and vectorized world masses to matrix tile shapes is as pivotal as smem financial institution alignment on NVIDIA. Compiler-assisted fusion in frameworks (e.g., consideration) plus library autotuning in rocBLAS/MIOpen sometimes closes a big fraction of the hole to handwritten kernels, contingent on structure/driver. Launch documentation signifies steady tuner enhancements in 6.0–6.4.x.
When ROCm is the appropriate device. You want native help and optimization on AMD accelerators, with HIP portability from current CUDA-style kernels and a transparent LLVM toolchain.
Triton: a DSL and compiler for customized kernels
Compiler path. Triton is a Python-embedded DSL that lowers by way of LLVM; it handles vectorization, reminiscence coalescing, and register allocation whereas giving specific management over block sizes and program IDs. Construct docs present the LLVM dependency and customized builds; NVIDIA’s developer supplies talk about Triton’s tuning for newer architectures (e.g., Blackwell) with FP16/FP8 GEMM enhancements.
Optimizations.
- Autotuning over tile sizes,
num_warps
, and pipelining levels; static masking for boundary circumstances with out scalar fallbacks; shared-memory staging and software program pipelining to overlap world masses with compute. - Triton’s design goals to automate the error-prone components of CUDA-level optimization whereas leaving block-level tiling decisions to the writer; the unique announcement outlines that separation of considerations.
Efficiency implications.
- Triton shines whenever you want a fused, shape-specialized kernel outdoors library protection (e.g., bespoke consideration variants, normalization-activation-matmul chains). On trendy NVIDIA components, vendor collabs report architecture-specific enhancements within the Triton backend, decreasing the penalty versus CUTLASS-style kernels for widespread GEMMs.
When Triton is the appropriate device. You need near-CUDA efficiency for customized fused ops with out writing SASS/WMMA, and also you worth Python-first iteration with autotuning.
TensorRT (and TensorRT-LLM): builder-time graph optimization for inference
Compiler path. TensorRT ingests ONNX or framework graphs and emits a hardware-specific engine. Through the construct, it performs layer/tensor fusion, precision calibration (INT8, FP8/FP16), and kernel tactic choice; best-practice docs describe these builder phases. TensorRT-LLM extends this with LLM-specific runtime optimizations.
Optimizations.
- Graph-level: fixed folding, concat-slice canonicalization, conv-bias-activation fusion, consideration fusion.
- Precision: post-training calibration (entropy/percentile/mse) and per-tensor quantization, plus smooth-quant/QAT workflows in TensorRT-LLM.
- Runtime: paged-KV cache, in-flight batching, and scheduling for multi-stream/multi-GPU deployments (TensorRT-LLM docs).
Efficiency implications.
- The biggest wins sometimes come from: end-to-end INT8 (or FP8 on Hopper/Blackwell the place supported), eradicating framework overhead by way of a single engine, and aggressive consideration fusion. TensorRT’s builder produces per-arch engine plans to keep away from generic kernels at runtime.
When TensorRT is the appropriate device. Manufacturing inference on NVIDIA GPUs the place you may pre-compile an optimized engine and profit from quantization and large-graph fusion.
Sensible steering: selecting and tuning the stack
- Coaching vs. inference.
- Coaching/experimental kernels → CUDA + CUTLASS (NVIDIA) or ROCm + rocBLAS/MIOpen (AMD); Triton for customized fused ops.
- Manufacturing inference on NVIDIA → TensorRT/TensorRT-LLM for world graph-level positive factors.
- Exploit architecture-native directions.
- On NVIDIA Hopper/Blackwell, guarantee tiles map to WGMMA/WMMA sizes; CUTLASS supplies present how warp-level GEMM and smem iterators needs to be structured.
- On AMD, align LDS utilization and vector widths to CU datapaths; leverage ROCm 6.x autotuners and Triton-on-ROCm for shape-specialized ops.
- Fuse first, then quantize.
- Kernel/graph fusion reduces reminiscence visitors; quantization reduces bandwidth and will increase math density. TensorRT’s builder-time fusions plus INT8/FP8 typically ship multiplicative positive factors.
- Use graph execution for brief sequences.
- CUDA Graphs built-in with cuDNN consideration fusions amortize launch overheads in autoregressive inference.
- Deal with compiler flags as first-class.
- For CUDA, bear in mind device-side flags: instance,
-Xptxas -O3,-v
(and-Xptxas -O0
when diagnosing). Host-only-O3
isn’t enough.
- For CUDA, bear in mind device-side flags: instance,
References:

Michal Sutter is an information science skilled with a Grasp of Science in Information Science from the College of Padova. With a stable basis in statistical evaluation, machine studying, and information engineering, Michal excels at remodeling complicated datasets into actionable insights.