Today, we are excited to announce the launch of Hugging Face Generative AI Services (also known as HUGS). It is an optimized, zero-configuration inference microservice designed to simplify and accelerate the development of AI applications using open models. Built on open source Hugging Face technologies such as Text Generation Inference and Transformers, HUGS provides the best solution for efficiently building and scaling Generative AI applications on your own infrastructure. HUGS is optimized to run open models on a variety of hardware accelerators, including NVIDIA GPUs, AMD GPUs, and soon AWS Inferentia and Google TPUs.
Inference optimized with zero configuration for open models
HUGS simplifies optimized deployment of open models on proprietary infrastructure and a variety of hardware. One of the key challenges facing developers and organizations is the engineering complexity of optimizing LLM inference workloads on a given GPU or AI accelerator. HUGS allows for maximum throughput deployment of the most popular open LLMs with no configuration required. Each deployment configuration provided by HUGS is fully tested and maintained to work out of the box.
HUGS model deployment provides an OpenAI-compatible API for drop-in replacement of existing Generative AI applications built on model provider APIs. Simply point your code to a HUGS deployment and power your applications using an open model hosted on your own infrastructure.
Why do we hug?
HUGS provides an easy way to build AI applications using open models hosted on your own infrastructure, with the following benefits:
Within your infrastructure: Deploy an open model within your own secure environment. Keep your data and models off the internet. Zero-configuration deployment: HUGS reduces deployment time from weeks to minutes with zero-configuration setup and automatically optimizes model and service configurations for NVIDIA, AMD GPUs, or AI accelerators. Hardware-optimized inference: HUGS is built on Hugging Face’s text generation inference (TGI) and is optimized for best performance across a variety of hardware configurations. Hardware flexibility: You can run HUGS on a variety of accelerators, including NVIDIA GPUs and AMD GPUs. Support for AWS Inferentia and Google TPU will also be coming soon. Model flexibility: HUGS is compatible with a wide range of open source models, ensuring flexibility and choice for your AI applications. Industry-standard API: Easily deploy HUGS using Kubernetes with OpenAI API-compatible endpoints and minimal code changes. Enterprise Distribution: HUGS is an enterprise distribution of Hugging Face open source technology, offering long-term support, rigorous testing, and SOC2 compliance. Enterprise Compliance: Minimize compliance risks by including necessary licenses and terms of use.
We provided early access to HUGS to some Enterprise Hub customers.
HUGS saves you a lot of time to locally deploy ready-to-use models with great performance. Before HUGS, it used to take a week, but now it can be done in less than an hour. For customers with sovereign AI requirements, this is a game changer. – Henri Jouhaud, CTO, Polyconseil
I tried deploying Gemma 2 on GCP using L4 GPU with HUGS. There was no need to modify the library, version, or parameters; it could be used as is. HUGS gives you the confidence to extend your internal use of open models. – Ghislain Putois, Orange Research Engineer
structure
Using HUGS is easy. Here’s how to get started:
Note: Depending on the deployment method you choose, you will need access to the appropriate subscription or marketplace product.
where to find hugs
HUGS is available through several channels.
Cloud Service Provider (CSP) Marketplace: Find and deploy HUGS on Amazon Web Services (AWS) and Google Cloud Platform (GCP). Support for Microsoft Azure is coming soon. DigitalOcean: HUGS is available natively within DigitalOcean as a new 1-Click model service powered by Hugging Face HUGS and GPU Droplets. Enterprise Hub: If your organization has been upgraded to Enterprise Hub, contact your sales team to gain access to HUGS.
Please refer to the related documentation linked above for specific deployment instructions for each platform.
Pricing
HUGS offers on-demand pricing based on uptime for each container, except for deployments on DigitalOcean.
AWS Marketplace and Google Cloud Platform Marketplace: $1 per hour per container, no minimum charges (compute usage is billed separately by CSP). AWS offers a 5-day free trial period where you can test HUGS for free. DigitalOcean: The 1-Click model powered by Hugging Face HUGS is available at no additional charge on DigitalOcean. Regular GPU droplet compute costs apply. Enterprise Hub: Provides custom HUGS access for Enterprise Hub organizations. Please contact our sales team for more information.
Performing inference
HUGS is based on Text Generation Inference (TGI) and provides a seamless inference experience. For detailed instructions and examples, see the Performing Inference with HUGS guide. HUGS leverages the OpenAI-compatible Messages API, allowing you to use familiar tools and libraries to send requests, such as cURL, huggingface_hub SDK, and openai SDK.
from hug face hub import Inference client ENDPOINT_URL=“Exchange”
client = InferenceClient(base_url=ENDPOINT_URL, api_key=“-“) chat_completion = client.chat.completions.create(messages=( {“role”:“user”,“content”:“What is deep learning?”}, ), temperature=0.7top_p=0.95max tokens =128,)
Supported models and hardware
HUGS supports an open model and a growing ecosystem of hardware platforms. Please see the Supported Models and Supported Hardware page for the latest information.
Today we are releasing 13 popular open LLMs.
See the documentation for supported model x hardware details.
Get started with HUGS now
HUGS makes it easy to harness the power of open models with zero-configuration, optimized inference within your own infrastructure. With HUGS, you can take control of your AI applications and easily move proof-of-concept applications built in a closed model to an open model that you host yourself.
Get started today and deploy HUGS on AWS, Google Cloud, or DigitalOcean.