Applying custom policies through inference: Faster, more secure AI applications

Most safety models enforce a single, generalized policy that blocks obviously harmful content, harm, and jailbreak attempts. This can be applied to a wide range of categories, but real-world applications demand much more. Common content safety mechanisms can break down when rules are sensitive or context is important.

Consider an e-commerce chatbot that must avoid culturally sensitive topics such as religion and politics. Telco support bots should block PII requests, prevent fraudulent billing advice, and prevent unsafe technical instructions such as disabling firewalls. Healthcare applications face similar challenges when it comes to HIPAA compliance and avoiding untested medical advice. These requirements do not fit into a one-size-fits-all policy, and developers often rely on weak prompt engineering and manual rule sets that fail due to complexity.

This is why NVIDIA introduced Nemotron Content Safety Reasoning, a model designed to combine the flexibility of inference with the speed needed for production environments. In this blog, we explore why inference is important to AI safety, what makes this model unique, how it’s built, and the evidence points behind its performance.

Why inference is important for content safety

Static classifiers label content as safe or unsafe, but have a hard time dealing with domain-specific policies. Developers need content security that dynamically adapts, such as avoiding comparisons with competitors, restricting certain legal advice, or blocking sensitive topics in certain regions.

Inference-based safety models solve this by interpreting policies according to context, rather than relying on fixed logic. They analyze intent, apply subtle rules, and detect subtle violations that common models miss. This flexibility makes inference essential for applying complex and evolving policies without retraining. The challenge is performance. Traditional reasoning models generate long chains of thought that create delays that make real-time deployment impractical. Developers need the benefits of inference without the costs.

Reasoning about the safety of NVIDIA Nemotron content

Nemotron Content Safety Reasoning provides dynamic, policy-driven safety and topic moderation for LLM-powered applications, allowing organizations to apply both standard and fully custom policies at inference time without retraining. It combines nuanced domain-aware inference with low-latency execution, providing developers with a flexible and robust solution for tailoring AI output to their unique requirements.

Unlike static guardrails that rely on strict rule sets or general safety guard models that rely on predefined global safety policies, this model dynamically interprets nuanced policies and adapts across geographies, industries, and domains. This flexibility, combined with production-ready performance, avoids the latency penalties inherent in inference models and enables optimized inference that makes decisions in a single sentence. Developers can define policies in natural language, load them into models, and immediately apply them. Whether it’s a chatbot, an AI agent, or a customer-facing application, Nemotron Content Safety Reasoning combines domain-aware inference with low-latency execution to keep your AI tailored to your unique requirements.

NVIDIA has long invested in open technology for LLM safety and guardrails. NeMo Guardrails is one of the first open source frameworks for integrating safety into AI applications, complemented by shared training datasets and research papers to promote transparency and reproducibility. NVIDIA also released the Nemotron model, which focuses on content safety, topic control, and jailbreak detection. These model endpoints are also available as NVIDIA NIM™ for easy deployment in GPU acceleration systems.

structure

The Nemotron Content Safety Reasoning model accepts three inputs: a policy that defines allowed and prohibited content, a user prompt, and optionally an assistant response. Predict whether an interaction is compliant with your policy and provide a simple reason why. This model is trained for dual-mode inference, allowing developers to turn inference tracing on and off. This allows developers to choose between maximum flexibility (inference on) and minimal delay (inference off). Unified pipeline for efficient safe inference

Figure 1: An integrated pipeline for efficient content safety inference in four stages: distillation, difficulty-aware refinement, dual-mode shortened inference, and custom policy adaptation.

Our training pipeline consists of four main stages.

Extract inference traces and supervised fine-tuning Refine for difficulty Improve efficiency with abbreviated inference and dual mode Adapt custom policies

Inference trace extraction and supervised fine-tuning. The first stage uses powerful inference models (e.g., DeepSeek-R1-0528, Qwen3-32B, gpt-oss-120b) to extract a dataset of inference traces to determine whether a user prompt or assistant response is harmful according to standard safety classifications. In our case, we used Nemotron Content Safety Dataset V2 with its underlying safety policy. We also found it important to provide ground truth labels at this stage, as even a strong inference model can misclassify some safety prompts. Using the extracted inference traces, we trained a small model starting with Gemma-3-4b-it using supervised fine-tuning (SFT), which acts as an inference guard model. Although the final model was trained on inference traces from Qwen3-32B only, we are releasing the entire dataset on Hugging Face (see Nemotron Content Safety Reasoning Dataset).

Improvements with consideration to difficulty level. In our experiments, we observed that the trained inference guard model requires only a fraction of the training data compared to the non-inference model. Therefore, we were able to train an initial inference guard model on a subset of 5,000 random samples and predict the remaining labels in the original training set. Using an approach similar to best-of-N sampling, we consider difficult samples to be those that are either not always correctly predicted by the model (too easy) or always incorrectly predicted by the model (most likely noisy annotations). Only a small sample can be extracted using this process, and running continuous SFT on this data will further improve model performance.

Faster inference and increased efficiency with dual mode. Guard models are typically used in addition to the main LLM to ensure that interactions follow the desired policy, so they need to be fast. To improve the efficiency of the Nemotron content-safe inference model, we extracted a one-sentence summary of the inference chain to limit the number of output tokens and improve latency. We found that this process did not reduce the validity of the model. At the same time, training in dual mode of inference on/off improves the performance of inference off mode and can be used for common safety tasks.

Adaptation of custom policies. Even when trained only on the standard safety dataset, the inference guard model achieves better performance with custom safety policies, but we observe that adding policies improves robustness and overall performance. In our case, we want our model to work on both topic and interaction moderation in parallel with safety moderation, so we train it on a topic moderation dataset called CantTalkAboutThis that NVIDIA introduced last year. We augment this dataset with inference traces and add to the general safety data before applying SFT.

Benchmark: Ultra-efficient inference and dynamic policy enforcement

The Nemotron content safety inference model provides accurate policy inference in a single sentence. This is up to 40% faster than traditional inference safety models. Supports custom and evolving policies without retraining during inference, achieving powerful results with fewer training examples. The benchmark shows:

Higher accuracy for custom policies than comparable models. Latency improved by 2-3x compared to large-scale inference models. Achieve production-ready performance with a GPU with 8GB or more of VRAM. Dual-mode operation: Reasoning Off: Low-latency mode for standard fast classification. This is very effective for general safety. Reasoning On: An advanced mode that provides explicit reasoning traces for decisions, improving the performance of complex or novel custom policies.

The evaluation focused on evaluating the performance of the inference model and investigating the latency costs. We used both common and custom safety datasets to evaluate the effectiveness of our model using different guardrail policies. For general safety, we calculate harmful F1 scores for prompts and responses for combinations of datasets that use similar safety policies, such as WildguardMix-Test, Aegis (Nemotron Content Safety) 2.0 Test, OpenAI Moderation, ToxicChat, XSTest, SimpleSafetyTests, and JailbreakBench. To ensure custom safety, we selected the CoSApien and Dyanguardrail datasets, which have more realistic custom policies and user prompts. We compare Nemotron Content Safety Reasoning to the leading open source safety guard models (Nemotron Content Safety v2, Alternative 7B classifier guard model, and two Alternative 20B and 120B Reasoning Guard MoE models) for both harmful F1 and latency.

Figure 2: Comparison of harmful F1 scores of NVIDIA Nemotron content safety inference and alternative safety inference models on mixed datasets with similar safety policies.

Figure 3: Average latency comparison: NVIDIA Nemotron content safety inference and alternative safety and safety inference models.

Complete benchmark results and ablation studies are available in the Findings of EMNLP 2025 paper. For more information about training and evaluation datasets, see Model Data Cards.

Get started: policy, speed, and control

Real-world AI systems require safety or “guardrails” that adapt to brand guidelines, regulatory requirements, and evolving domain rules. Consider an in-car assistant that must adhere to strict safety and brand policies. Reactions to navigation and infotainment should be limited while avoiding comparisons and recommendations with competitors. These scenarios require flexibility and speed, and that’s exactly what this inference-based Nemotron content safety model provides. Get access to the models and datasets you need to train and evaluate on Hugging Face today.

All artifacts are published under the NVIDIA Open Model License Agreement, which permits modification and redistribution. Latency benchmarks were run on an H100 GPU, but this model has low VRAM requirements and can be used with any GPU with more than 8GB VRAM. Finally, Nemotron Content Safety Reasoning is supported by all major reasoning toolkits: Hugging Face Inference, vLLM, TensorRT-LLM, and SGLang. The model is a fine-tuned Gemma-3-4B-it, so you can use any inference engine that supports it.

versatileai

See Full Bio

What's Hot

From experiment to corporate reality

Identify content created with Google’s AI tools

Inadequate introduction of AI may be the reason behind the reduction in personnel

From experiment to corporate reality

Identify content created with Google’s AI tools

Inadequate introduction of AI may be the reason behind the reduction in personnel

Open Source DeepResearch – Unlocking Search Agents

How to use AI to support better tropical cyclone forecasting — Google DeepMind

CIO’s Governance Guide

Most Popular

Open Source DeepResearch – Unlocking Search Agents

How to use AI to support better tropical cyclone forecasting — Google DeepMind

CIO’s Governance Guide

Don't Miss

From experiment to corporate reality

Identify content created with Google’s AI tools

Inadequate introduction of AI may be the reason behind the reduction in personnel

Subscribe to Updates

What's Hot

Applying custom policies through inference: Faster, more secure AI applications

Why inference is important for content safety

Reasoning about the safety of NVIDIA Nemotron content

structure

Benchmark: Ultra-efficient inference and dynamic policy enforcement

Get started: policy, speed, and control

Related Posts