Today, we are excited to announce the launch of the Hug Face Nvidia Nim API (ServerLess), a new Hug Face Hub service available to enterprise hub organizations. This new service makes it easy to use open models with services such as Accelerated Compute Platform and NVIDIA DGX Cloud Accelerated Compute Platform for Throughterversion. You can build this solution to easily access the latest NVIDIA AI technologies in a serverless way for enterprise hub users to perform inference on popular generated AI models such as Llama and Mistral using standardized APIs and several lines of code within the hagging face hub.

Serverless reasoning with nvidia nim
This new experience simplifies access and use of open-generated AI models in NVIDIA accelerated computing, based on its collaboration with NVIDIA. One of the main challenges faced by developers and organizations is the upfront costs of infrastructure and the complexity of optimizing LLM inference workloads. By hugging the Face Nvidia Nim API (ServerLess), you can provide a simple solution to these challenges, providing instant access to cutting-edge, open-generated AI models optimized for the NVIDIA infrastructure. The pay-as-you-go pricing model is an economical option for businesses of all sizes, as it allows you to pay only the request time you use.
The NVIDIA NIM API (serverless) complements training from DGX Cloud, an AI training service already available on Face.
How it works
Performing serverless inference using a hugging face model has never been easier. Here’s a step-by-step guide to getting you started:
Note: You must access an organization with a hugging hugging Face Enterprise Hub subscription to perform inference.
Before you begin, make sure you meet the following requirements:
You are a member of an Enterprise Hub organization. You have created fine-grained tokens for your organization. Follow the steps below to create a token:
Create a fine-grained token
Fine grained tokens allow users to create tokens with specific permissions to access resources and namespaces accurately. First, hug the face access token, click “Create new token” and select “Fine tweak.”

Enter the Token Name, select the Enterprise Organization as the scope under ORG Permissions, and click (Create Token). There is no need to select any additional scopes.

Next, save this token value to authenticate the request later.
Find your nim
You can find “nvidia nim api (serverless)” in the model page for supported generated AI models. You can find all supported models in this NVIDIA NIM Collection and Pricing section.
Use Metalama/Metalama-3-8B-instruct. Open the Metalama/Metalama-3-8b-instruct model card and open the “Expand” menu and select “nvidia nim api (serverless).”

Submit a request
The nvidia nim api (serverless) is standardized in the Openai API. This allows you to use Openai’s SDK for inference. Replace your_fine_graine_token_here with a fine grain token and you’re ready to perform the inference.
from Openai Import OpenAI client = openai(base_url =“https://huggingface.co/api/integrations/dgx/v1”,api_key =“your_fine_graine_token_here”
)chat_completion = client.chat.completions.create(model =“Metalama/Metalama-3-8b-instruct”message = ({“role”: “system”, “content”: “You’re a kind assistant.”},{“role”: “user”, “content”: “Count 500”}), stream =truth,max_tokens =1024
))
for message in chat_completion:
printing(message.choices)0).delta.content, end =”))
Congratulations! You can now start building a generated AI application using the open open model. 🔥
nvidia nim api (serverless) currently only supports chat.completions.create and models.list api. I’m working on extending this while adding models. You can use Models.list to see which models are currently available for inference.
Model = client.models.list()
for m in Model. data:
printing(m.id))
Supported models and pricing
The usage of Face Nvidia Nim API (serverless) hugging usage is billed based on the calculation time used per request. It uses only an Nvidia H100 tensor core GPU and costs $8.25 per hour. To make this easier to understand with per-request pricing, you can convert this to per-second.
$8.25 per hour = $0.0023 per second (below rounded decimal point)
The total cost of a request depends on the model size, the number of GPUs required, and the time it takes to process the request. Here is a breakdown of current model provision, GPU requirements, typical response times, and estimated costs per request:
Model ID NVIDIA H100 Typical response time for number of GPUs (500 input tokens, 100 output tokens) Estimated cost per request Metalama/Metalama-3-8b-instruct 1 1 second $0.0023 Metalama/Metalama-3-70B-intract 2 2 seconds $0.018444 Metalama/Metalama-3.1-405b-instruct-fp8 8 5 seconds $0.0917
The royalties accrue during the Enterprise Hub organization’s current monthly billing cycle. You can view current and past usage at any time within your Enterprise Hub organization’s billing environment.
Supported models
Accelerating AI inference with nvidia tensort-llm
We are pleased to continue our collaboration with Nvidia to push the boundaries between AI inference performance and accessibility. A key focus of our ongoing efforts is to embrace the Nvidia Tensorrt-LLM library and integrate it into Face’s Text Generation Inference (TGI) framework.
We will share details, benchmarks and best practices for using TGI with Nvidia Tensort-llm in the near future. Stay tuned for even more exciting developments as we continue to expand our collaboration with NVIDIA and continue to bring powerful AI capabilities to developers and organizations around the world!