Native CUDA Model in Safe Rust Without C++ or DSLs
cuda-oxide targets CUDA's SIMT execution model natively in Rust, closer to C++ global functions than generic Rust async code. Mark kernels with #kernel proc macro, which namespaces them as cuda_oxide_kernel_
Supports match/if let, generics like fn scale<T: Copy>, intrinsics (shfl_sync, ballot_sync), and thread::index_1d() for unique hardware indices. DisjointSlice
Disables rustc JumpThreading on device code to preserve bar.sync convergence—duplication breaks GPU barrier semantics. Marks syncs convergent in LLVM IR to block optimizer reordering.
Pure-Rust Pipeline Except One External Tool
Intercepts rustc at CodegenBackend::codegen_crate() using rustc_public (Stable MIR) for version-proof internals. Pipeline: Rust → rustc frontend → Stable MIR → dialect-mir (Rust MIR semantics: places/projections/rvalues/terminators) → mem2reg → dialect-llvm → textual .ll → llc (LLVM 21+ NVPTX) → .ptx.
Uses Pliron (Rust-native MLIR-like IR) for dialects: dialect-mir, dialect-llvm (.ll export), dialect-nvvm (NVIDIA intrinsics like thread indexing/barriers/TMA). Entire stack builds with cargo—no C++/CMake/tablegen. Observe full trace: cargo oxide pipeline vecadd prints MIR through PTX.
Complements rust-cuda (Rust ergonomics/async on GPU) by focusing on CUDA-native model. vecadd example adds 1024 f32s on GPU, verifies on host: cargo oxide run vecadd outputs ✓ SUCCESS: All 1024 elements correct!
Linux-Only Setup with Pinned Nightly and LLVM 21
Requires Ubuntu 24.04, Rust nightly-2026-04-03 (rustup toolchain install nightly-2026-04-03; add rust-src/rustc-dev), LLVM 21+ NVPTX (sudo apt install llvm-21; llc-21 --version | grep nvptx), Clang 21 (sudo apt install clang-21 for bindgen/cuda.h). Pin llc: export CUDA_OXIDE_LLC=/usr/bin/llc-21.
Clone repo or cargo install --git https://github.com/NVlabs/cuda-oxide.git cargo-oxide. cargo oxide doctor checks deps. Debug: cargo oxide debug vecadd --tui with cuda-gdb on target/debug/vecadd + .ptx.
Next: generics (cargo oxide run generic), host closures (cuda_launch_async! with .await/.sync()), async_mlp, gemm_sol (cta_group::2, index_2d(stride), unsafe for &mut T). Docs: nvlabs.github.io/cuda-oxide.