Custom kernels for everyone with Codex and Claude

tl;dr: We built an agent skill that teaches coding agents how to write production CUDA kernels. Next, I gave Claude and Codex two actual targets: a diffuser pipeline and a transformer model. The agent produced both working kernels end-to-end using the correct PyTorch bindings and benchmarks.

Writing CUDA kernels is difficult. It is even more difficult to write a CUDA kernel that integrates correctly with transformers and diffusers. There are architecture-specific memory access patterns, vectorization strategies, warp shuffle reduction, and numerous integration pitfalls that trip up even the most experienced developers. This is truly a specialized, high-stakes problem where an agent’s skills come into play.

We provided our coding agents with the domain knowledge they needed, including which GPU architectures to target, how to build kernel builder projects, when to use shared memory and registers, and how to write PyTorch bindings. My agent did the rest. If you’ve used the LLM Training Skills or read “Claude Taught Me the Open Model,” you’ll be familiar with the pattern of packaging domain expertise into a skill, pointing out a problem to an agent, and making it work.

Why does the kernel need skills?

Kernel Hub solves the distribution of custom hardware kernels. You can load a precompiled kernel from the hub with a single get_kernel call. No builds, no flags. However, someone still needs to write the kernel. That’s the gap this skill fills.

CUDA kernel development has a demanding surface area:

Hardware-specific optimization guides for each generation of GPUs. H100, A100, and T4 each have different computing capabilities, shared memory sizes, and bandwidth profiles. In libraries, diffusers and transformers have different module hierarchies, normalization rules, and integration patterns. You need to register your custom kernel with PyTorch so that torch.compile can recognize it. For distribution, the kernel can rely on versions of CUDA, Pytorch, and Python to create a large environment matrix.

This is domain knowledge that gets lost in documentation tabs and Stack Overflow answers. The agent skill packages it into a context that is loaded on demand.

We’ll start by quickly demonstrating how to use the skill, and then look at the details of how we benchmarked the kernel.

Installing skills

This skill is shipped with the kernel library. Install on your coding agent with a single command.

# You need to install the kernel from main for this
pip install git+https://github.com/huggingface/kernels.git kernel skills add cuda-kernels –claude

This will drop the skill into .claude/skills/cuda-kernels/ and the claude code and cursor will automatically pick it up. For other agents:

# codex
Kernel skills add cuda-kernels –codex

# open code
Kernel skills add cuda-kernels –opencode

# custom destination
Kernel skills add cuda-kernels –dest ./my-agent/skills/

# Install globally (available to all projects)
Kernel skills add cuda-kernels –global

# Overwrite existing installation
Kernel skills add cuda-kernels –claude –force

Once installed, the agent displays the following prompt:

Build a vectorized RMSNorm kernel for H100 targeting the Qwen3-8B model of transformers.

Alternatively, you can choose a more free method.

Build an attention kernel optimized for H100 that targets the Qwen3-8B model of transformers. Benchmark against the PyTorch baseline to verify end-to-end performance improvements.

The agent can read your skill, select appropriate architecture parameters, generate CUDA sources, write PyTorch bindings, configure build.toml, and create benchmark scripts.

If you’re working on more complex kernel or architecture-specific optimizations that aren’t covered in the skill, the skill provides the basic building blocks and patterns to get started. We also welcome contributions related to the skills themselves.

The content of the skill is

This skill includes approximately 550 tokens of structured guidance, plus reference scripts, GPU optimization guides, troubleshooting documentation, and complete working examples. Agent coding tools like Codex and Claude can read this and generate a working kernel project.

The contents are as follows.

NVIDIA GPU architecture-aware optimizations for H100, A100, and T4 (compute capabilities, memory bandwidth, shared memory size, block size) Integration patterns for both diffusers and transformers (including pitfalls specific to each library) Kernel templates with vectorized memory access patterns for BF16, FP16, and FP32 Isolated kernels Benchmarks for both microbenchmarks and FP32 Workflow End-to-end pipeline comparison HuggingFace kernel hub integration via get_kernel to load community kernels .claude/skills/cuda-kernels/ §── SKILL.md # Main instruction (~550 tokens) §── scripts/ │ §── benchmark_example.py # End-to-end benchmark template │ §── benchmark_rmsnorm.py # Separated kernel micro-benchmark │ §── ltx_kernel_injection_example.py # Diffuser integration pattern │ §──transformers_injection_example.py # Transformer integration pattern │ └── Hugging face_kernels_example.py # Kernel hub integration └──references/ §── diffusers-integration.md # Diffuser guide with pitfalls §──Transformers-integration.md #Transformers guide §── hackgingface-kernels-integration.md §── h100-optimization-guide.md §── a100-optimization-guide.md §── t4-optimization-guide.md §── kernel-templates.md └── Troubleshooting.md

When the agent loads this, it has everything it needs to go from “Write an RMSNorm kernel” to a buildable and benchmarkable project. grep and glob your skill to find related files and directories. Therefore, it’s important to organize your skills in a way that makes them easy to find.

The agent is instructed to generate a kernel that conforms to the templates in references/kernel-templates.md and generate a complete kernel project.

Examples/your_model/ §── kernel_src/ │ └── rmsnorm.cu # Vectorized CUDA kernel §── torch-ext/ │ §── your_kernels/__init__.py │ └── torch_binding.cpp # PyTorch C++ binding §── benchmark_rmsnorm.py # Microbenchmark script §── build.toml #kernel builder configuration §── setup.py # pip install -e. └── pyproject.toml

I tested this on two real targets.

Kernel Benchmark: Diffuser (LTX video on H100)

This agent built the RMSNorm, RoPE 3D, GEGLU, and AdaLN kernels for LTX-Video, a video generation pipeline from the diffuser. A complete example can be found in examples/ltx_video/. Optimized RMSNorm kernel for H100. Both benchmarks were run on an H100 80GB HBM3 with precision BFloat16.

If you want to check out the generated kernel, please visit this example.

Separated RMSNorm benchmark

First, we compare the performance of the isolated RMSNorm kernel to the PyTorch baseline. This is the main speedup in the optimized pipeline.

table

Shape Custom (ms) PyTorch (ms) Speedup (1x1024x2048) 0.039 0.064 1.64x (2x1024x2048) 0.040 0.073 1.82x (4x1024x2048) 0.052 0.093 1.78x (1x4096x2048) 0.052 0.093 1.79x (2x4096x3072) 0.102 0.209 2.04x (1x8192x2048) 0.083 0.150 1.81x (4x4096x3072) 0.173 0.393 2.26 times

Average speedup: 1.88x; Bandwidth efficiency: 34.7% of H100 theoretical (3,350 GB/s)

End-to-end video generation (49 frames, 30 steps, H100 80GB)

Next, we compare the end-to-end video generation performance of the optimized kernel with the baseline (no compilation) and the torch.compile baseline.

table

Configuration time (sec) it/s Speedup Baseline (no compilation) 2.87 12.58 1.00x Generated optimized kernel 2.70 13.52 1.06x Baseline + torch.compile 2.14 19.05 1.34x Optimization + torch.compile 2.01 18.45 1.43x

RMSNorm accounts for approximately 5% of the total compute for LTX video. The remaining time is spent on attention, linear projection, and VAE decoding. The 6% end-to-end speedup with a single kernel type is consistent with that profile.

Kernel Benchmark: Transformers (Qwen3-8B on H100)

The agent built an RMSNorm kernel for Qwen3-8B, a large language model from a transformer with 65 RMSNorm modules spanning 32 layers. A complete example can be found at examples/qwen3_8b/. Optimized RMSNorm kernel for H100. Both benchmarks were run on an H100 80GB HBM3 with precision BFloat16.

If you want to know more about kernels, check here.

Separated RMSNorm benchmark

Once again, we compare the performance of the isolated RMSNorm kernel to the PyTorch baseline.

Average speedup: 1.94x; Bandwidth efficiency: 22.3% of H100 theoretical (3,350 GB/s)

table

Shape Custom (ms) PyTorch (ms) Speedup (1x128x4096) 0.040 0.062 1.58x (1x512x4096) 0.038 0.064 1.69x (1x1024x4096) 0.037 0.071 1.90x (1x2048x4096) 0.045 0.091 2.03x (1x4096x4096) 0.071 0.150 2.12x (4x512x4096) 0.056 0.093 1.67x (8x256x4096) 0.045 0.092 2.06x (1x8192x4096) 0.109 0.269 2.47x

The speedup scales with sequence length: 1.58x for 128 tokens and 2.47x for 8192 tokens. For long context inference, the custom kernel roughly halves RMSNorm latency.

Publish the kernel to the hub

The agent provides a kernel to work with. A kernel hub allows you to share it so anyone can load it without having to compile it. Below is the full path to the exposed kernel from the agent output.

1. Check the project structure

The agent generates a project that already follows the kernel builder layout.

your_kernel/ §── build.toml # Build settings ├── kernel_src/ │ └── rmsnorm.cu # CUDA kernel source └── torch-ext/ §── torch_binding.cpp # Register torch ops └── your_kernels/ └── __init__.py # Wrapping Python API _ops

build.toml tells the kernel builder what to build. The agent generates this containing the correct cuda features for the target GPU.

(General) name = “your_kernels” backends = (“cuda”) (torch) src = (“torch-ext/torch_binding.cpp”) (kernel.rmsnorm) backend = “cuda” src = (“kernel_src/rmsnorm.cu”) depends = (“torch”) cuda-capabilities = (“9.0”) # H100

2. Build all variants using Nix

Kernel Hub kernels must support all modern PyTorch and CUDA configurations. The kernel builder Nix flakes handle this automatically. Copy the sample flake.nix to your project and run:

nix flake update nix run .#build-and-copy -L

This will build kernels for all required PyTorch/CUDA variants and place the results in build/. To speed up your builds, enable HuggingFace Nix cache.

nix run nixpkgs#cachix — use hug face

3. Create and push the Hub repository

Create a model repository on the hub and upload the built kernel.

hackingface-cli create repository your-org/your-kernel –type modelhuggingface-cli upload your-org/your-kernel ./build

4. Others load in one line

Once published, anyone can use the kernel without any compilation.

from kernel import get_kernel rmsnorm = get_kernel(“Your Organization/Your Kernel”)

get_kernel detects the user’s Python, PyTorch, and CUDA versions and downloads matching precompiled binaries. No builds or flags are required, and it’s typically ready in seconds.

Skills and hubs are complementary. Skills handle development. The hub handles distribution. Use your skill to build a kernel, validate it with benchmark scripts, publish it to your hub, and it becomes a one-liner for other users.

conclusion

We built an agent skill that teaches coding agents how to write production CUDA kernels. Next, I gave Claude and Codex two actual targets: a diffuser pipeline and a transformer model. The agent produced both working kernels end-to-end using the correct PyTorch bindings and benchmarks. We benchmarked the kernel and found that the optimized kernel can provide speedups in both isolated and end-to-end performance.

resource

versatileai

See Full Bio

What's Hot

Custom kernels for everyone with Codex and Claude

Updates to AI models designed for science

State-sponsored hackers exploit AI in cyber attacks: Google

Updates to AI models designed for science

State-sponsored hackers exploit AI in cyber attacks: Google

The future of the global open source AI ecosystem: From DeepSeek to AI+

CIO’s Governance Guide

NVIDIA powers local AI art generation with RTX-optimized ComfyUI workflow

Bridging the gap between AI agent benchmarks and industrial reality

Most Popular

CIO’s Governance Guide

NVIDIA powers local AI art generation with RTX-optimized ComfyUI workflow

Bridging the gap between AI agent benchmarks and industrial reality

Don't Miss

Custom kernels for everyone with Codex and Claude

Updates to AI models designed for science

State-sponsored hackers exploit AI in cyber attacks: Google

Subscribe to Updates

What's Hot

Custom kernels for everyone with Codex and Claude

Why does the kernel need skills?

Installing skills

The content of the skill is

Kernel Benchmark: Diffuser (LTX video on H100)

Separated RMSNorm benchmark

End-to-end video generation (49 frames, 30 steps, H100 80GB)

Kernel Benchmark: Transformers (Qwen3-8B on H100)

Separated RMSNorm benchmark

Publish the kernel to the hub

1. Check the project structure

2. Build all variants using Nix

3. Create and push the Hub repository

4. Others load in one line

conclusion

resource

Related Posts