Close Menu
Versa AI hub
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools
  • Resources

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

What's Hot

Custom kernels for everyone with Codex and Claude

February 13, 2026

Updates to AI models designed for science

February 13, 2026

State-sponsored hackers exploit AI in cyber attacks: Google

February 12, 2026
Facebook X (Twitter) Instagram
Versa AI hubVersa AI hub
Friday, February 13
Facebook X (Twitter) Instagram
Login
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools
  • Resources
Versa AI hub
Home»Tools»Custom kernels for everyone with Codex and Claude
Tools

Custom kernels for everyone with Codex and Claude

versatileaiBy versatileaiFebruary 13, 2026No Comments9 Mins Read
Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
#image_title
Share
Facebook Twitter LinkedIn Pinterest Email

tl;dr: We built an agent skill that teaches coding agents how to write production CUDA kernels. Next, I gave Claude and Codex two actual targets: a diffuser pipeline and a transformer model. The agent produced both working kernels end-to-end using the correct PyTorch bindings and benchmarks.

Writing CUDA kernels is difficult. It is even more difficult to write a CUDA kernel that integrates correctly with transformers and diffusers. There are architecture-specific memory access patterns, vectorization strategies, warp shuffle reduction, and numerous integration pitfalls that trip up even the most experienced developers. This is truly a specialized, high-stakes problem where an agent’s skills come into play.

We provided our coding agents with the domain knowledge they needed, including which GPU architectures to target, how to build kernel builder projects, when to use shared memory and registers, and how to write PyTorch bindings. My agent did the rest. If you’ve used the LLM Training Skills or read “Claude Taught Me the Open Model,” you’ll be familiar with the pattern of packaging domain expertise into a skill, pointing out a problem to an agent, and making it work.

Why does the kernel need skills?

Kernel Hub solves the distribution of custom hardware kernels. You can load a precompiled kernel from the hub with a single get_kernel call. No builds, no flags. However, someone still needs to write the kernel. That’s the gap this skill fills.

CUDA kernel development has a demanding surface area:

Hardware-specific optimization guides for each generation of GPUs. H100, A100, and T4 each have different computing capabilities, shared memory sizes, and bandwidth profiles. In libraries, diffusers and transformers have different module hierarchies, normalization rules, and integration patterns. You need to register your custom kernel with PyTorch so that torch.compile can recognize it. For distribution, the kernel can rely on versions of CUDA, Pytorch, and Python to create a large environment matrix.

This is domain knowledge that gets lost in documentation tabs and Stack Overflow answers. The agent skill packages it into a context that is loaded on demand.

We’ll start by quickly demonstrating how to use the skill, and then look at the details of how we benchmarked the kernel.

Installing skills

This skill is shipped with the kernel library. Install on your coding agent with a single command.

# You need to install the kernel from main for this
pip install git+https://github.com/huggingface/kernels.git kernel skills add cuda-kernels –claude

This will drop the skill into .claude/skills/cuda-kernels/ and the claude code and cursor will automatically pick it up. For other agents:

# codex
Kernel skills add cuda-kernels –codex

# open code
Kernel skills add cuda-kernels –opencode

# custom destination
Kernel skills add cuda-kernels –dest ./my-agent/skills/

# Install globally (available to all projects)
Kernel skills add cuda-kernels –global

# Overwrite existing installation
Kernel skills add cuda-kernels –claude –force

Once installed, the agent displays the following prompt:

Build a vectorized RMSNorm kernel for H100 targeting the Qwen3-8B model of transformers.

Alternatively, you can choose a more free method.

Build an attention kernel optimized for H100 that targets the Qwen3-8B model of transformers. Benchmark against the PyTorch baseline to verify end-to-end performance improvements.

The agent can read your skill, select appropriate architecture parameters, generate CUDA sources, write PyTorch bindings, configure build.toml, and create benchmark scripts.

If you’re working on more complex kernel or architecture-specific optimizations that aren’t covered in the skill, the skill provides the basic building blocks and patterns to get started. We also welcome contributions related to the skills themselves.

The content of the skill is

This skill includes approximately 550 tokens of structured guidance, plus reference scripts, GPU optimization guides, troubleshooting documentation, and complete working examples. Agent coding tools like Codex and Claude can read this and generate a working kernel project.

The contents are as follows.

NVIDIA GPU architecture-aware optimizations for H100, A100, and T4 (compute capabilities, memory bandwidth, shared memory size, block size) Integration patterns for both diffusers and transformers (including pitfalls specific to each library) Kernel templates with vectorized memory access patterns for BF16, FP16, and FP32 Isolated kernels Benchmarks for both microbenchmarks and FP32 Workflow End-to-end pipeline comparison HuggingFace kernel hub integration via get_kernel to load community kernels .claude/skills/cuda-kernels/ §── SKILL.md # Main instruction (~550 tokens) §── scripts/ │ §── benchmark_example.py # End-to-end benchmark template │ §── benchmark_rmsnorm.py # Separated kernel micro-benchmark │ §── ltx_kernel_injection_example.py # Diffuser integration pattern │ §──transformers_injection_example.py # Transformer integration pattern │ └── Hugging face_kernels_example.py # Kernel hub integration └──references/ §── diffusers-integration.md # Diffuser guide with pitfalls §──Transformers-integration.md #Transformers guide §── hackgingface-kernels-integration.md §── h100-optimization-guide.md §── a100-optimization-guide.md §── t4-optimization-guide.md §── kernel-templates.md └── Troubleshooting.md

When the agent loads this, it has everything it needs to go from “Write an RMSNorm kernel” to a buildable and benchmarkable project. grep and glob your skill to find related files and directories. Therefore, it’s important to organize your skills in a way that makes them easy to find.

The agent is instructed to generate a kernel that conforms to the templates in references/kernel-templates.md and generate a complete kernel project.

Examples/your_model/ §── kernel_src/ │ └── rmsnorm.cu # Vectorized CUDA kernel §── torch-ext/ │ §── your_kernels/__init__.py │ └── torch_binding.cpp # PyTorch C++ binding §── benchmark_rmsnorm.py # Microbenchmark script §── build.toml #kernel builder configuration §── setup.py # pip install -e. └── pyproject.toml

I tested this on two real targets.

Kernel Benchmark: Diffuser (LTX video on H100)

This agent built the RMSNorm, RoPE 3D, GEGLU, and AdaLN kernels for LTX-Video, a video generation pipeline from the diffuser. A complete example can be found in examples/ltx_video/. Optimized RMSNorm kernel for H100. Both benchmarks were run on an H100 80GB HBM3 with precision BFloat16.

If you want to check out the generated kernel, please visit this example.

Separated RMSNorm benchmark

First, we compare the performance of the isolated RMSNorm kernel to the PyTorch baseline. This is the main speedup in the optimized pipeline.

Separated rmsnorm benchmark ltx-video

table

Shape Custom (ms) PyTorch (ms) Speedup (1x1024x2048) 0.039 0.064 1.64x (2x1024x2048) 0.040 0.073 1.82x (4x1024x2048) 0.052 0.093 1.78x (1x4096x2048) 0.052 0.093 1.79x (2x4096x3072) 0.102 0.209 2.04x (1x8192x2048) 0.083 0.150 1.81x (4x4096x3072) 0.173 0.393 2.26 times

Average speedup: 1.88x; Bandwidth efficiency: 34.7% of H100 theoretical (3,350 GB/s)

End-to-end video generation (49 frames, 30 steps, H100 80GB)

Next, we compare the end-to-end video generation performance of the optimized kernel with the baseline (no compilation) and the torch.compile baseline.

e2e benchmark ltx-video

table

Configuration time (sec) it/s Speedup Baseline (no compilation) 2.87 12.58 1.00x Generated optimized kernel 2.70 13.52 1.06x Baseline + torch.compile 2.14 19.05 1.34x Optimization + torch.compile 2.01 18.45 1.43x

RMSNorm accounts for approximately 5% of the total compute for LTX video. The remaining time is spent on attention, linear projection, and VAE decoding. The 6% end-to-end speedup with a single kernel type is consistent with that profile.

Kernel Benchmark: Transformers (Qwen3-8B on H100)

The agent built an RMSNorm kernel for Qwen3-8B, a large language model from a transformer with 65 RMSNorm modules spanning 32 layers. A complete example can be found at examples/qwen3_8b/. Optimized RMSNorm kernel for H100. Both benchmarks were run on an H100 80GB HBM3 with precision BFloat16.

If you want to know more about kernels, check here.

Separated RMSNorm benchmark

Once again, we compare the performance of the isolated RMSNorm kernel to the PyTorch baseline.

Separated rmsnorm benchmark qwen3-8b

Average speedup: 1.94x; Bandwidth efficiency: 22.3% of H100 theoretical (3,350 GB/s)

table

Shape Custom (ms) PyTorch (ms) Speedup (1x128x4096) 0.040 0.062 1.58x (1x512x4096) 0.038 0.064 1.69x (1x1024x4096) 0.037 0.071 1.90x (1x2048x4096) 0.045 0.091 2.03x (1x4096x4096) 0.071 0.150 2.12x (4x512x4096) 0.056 0.093 1.67x (8x256x4096) 0.045 0.092 2.06x (1x8192x4096) 0.109 0.269 2.47x

The speedup scales with sequence length: 1.58x for 128 tokens and 2.47x for 8192 tokens. For long context inference, the custom kernel roughly halves RMSNorm latency.

Publish the kernel to the hub

The agent provides a kernel to work with. A kernel hub allows you to share it so anyone can load it without having to compile it. Below is the full path to the exposed kernel from the agent output.

1. Check the project structure

The agent generates a project that already follows the kernel builder layout.

your_kernel/ §── build.toml # Build settings ├── kernel_src/ │ └── rmsnorm.cu # CUDA kernel source └── torch-ext/ §── torch_binding.cpp # Register torch ops └── your_kernels/ └── __init__.py # Wrapping Python API _ops

build.toml tells the kernel builder what to build. The agent generates this containing the correct cuda features for the target GPU.

(General) name = “your_kernels” backends = (“cuda”) (torch) src = (“torch-ext/torch_binding.cpp”) (kernel.rmsnorm) backend = “cuda” src = (“kernel_src/rmsnorm.cu”) depends = (“torch”) cuda-capabilities = (“9.0”) # H100

2. Build all variants using Nix

Kernel Hub kernels must support all modern PyTorch and CUDA configurations. The kernel builder Nix flakes handle this automatically. Copy the sample flake.nix to your project and run:

nix flake update nix run .#build-and-copy -L

This will build kernels for all required PyTorch/CUDA variants and place the results in build/. To speed up your builds, enable HuggingFace Nix cache.

nix run nixpkgs#cachix — use hug face

3. Create and push the Hub repository

Create a model repository on the hub and upload the built kernel.

hackingface-cli create repository your-org/your-kernel –type modelhuggingface-cli upload your-org/your-kernel ./build

4. Others load in one line

Once published, anyone can use the kernel without any compilation.

from kernel import get_kernel rmsnorm = get_kernel(“Your Organization/Your Kernel”)

get_kernel detects the user’s Python, PyTorch, and CUDA versions and downloads matching precompiled binaries. No builds or flags are required, and it’s typically ready in seconds.

Skills and hubs are complementary. Skills handle development. The hub handles distribution. Use your skill to build a kernel, validate it with benchmark scripts, publish it to your hub, and it becomes a one-liner for other users.

conclusion

We built an agent skill that teaches coding agents how to write production CUDA kernels. Next, I gave Claude and Codex two actual targets: a diffuser pipeline and a transformer model. The agent produced both working kernels end-to-end using the correct PyTorch bindings and benchmarks. We benchmarked the kernel and found that the optimized kernel can provide speedups in both isolated and end-to-end performance.

resource

author avatar
versatileai
See Full Bio
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
Previous ArticleUpdates to AI models designed for science
versatileai

Related Posts

Tools

Updates to AI models designed for science

February 13, 2026
Tools

State-sponsored hackers exploit AI in cyber attacks: Google

February 12, 2026
Tools

The future of the global open source AI ecosystem: From DeepSeek to AI+

February 12, 2026
Add A Comment

Comments are closed.

Top Posts

CIO’s Governance Guide

January 22, 202611 Views

NVIDIA powers local AI art generation with RTX-optimized ComfyUI workflow

January 22, 20269 Views

Bridging the gap between AI agent benchmarks and industrial reality

January 22, 20269 Views
Stay In Touch
  • YouTube
  • TikTok
  • Twitter
  • Instagram
  • Threads
Latest Reviews

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

Most Popular

CIO’s Governance Guide

January 22, 202611 Views

NVIDIA powers local AI art generation with RTX-optimized ComfyUI workflow

January 22, 20269 Views

Bridging the gap between AI agent benchmarks and industrial reality

January 22, 20269 Views
Don't Miss

Custom kernels for everyone with Codex and Claude

February 13, 2026

Updates to AI models designed for science

February 13, 2026

State-sponsored hackers exploit AI in cyber attacks: Google

February 12, 2026
Service Area
X (Twitter) Instagram YouTube TikTok Threads RSS
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms and Conditions
  • Disclaimer
© 2026 Versa AI Hub. All Rights Reserved.

Type above and press Enter to search. Press Esc to cancel.

Sign In or Register

Welcome Back!

Login to your account below.

Lost password?