We are pleased to announce that native integration of Intel Gaudi hardware support will be integrated directly into Text Generation Inference (TGI), a production-ready delivery solution for large-scale language models (LLM). This integration brings the power of Intel’s professional AI accelerators to the high-performance inference stack, increasing the deployment options for the open source AI community.
β¨What’s new?
Fully integrated Gaudi support into the main codebase of TGI in PR#3091. Previously, I kept another fork for Tgi-Gaudi’s Gaudi device. This was a hassle for users and was unable to support the latest TGI features at launch. Currently, we are using the new TGI multi-back elsite technique to support Gaudi directly with TGI.
This integration supports a full line of Intel’s Gaudi hardware.
You can also find more details about Gaudi hardware on Intel’s Gaudi product page
πWhy is this important?
TGI’s Gaudi backend offers several important benefits.
Hardware versatility π: More options for deploying LLM to production beyond traditional GPUs π°: Gaudi hardware often offers attractive price performance for specific workloads βοΈ: All robustness of TGI (dynamic batch, streaming response, etc.) features π₯: Support for multi-card inference (shards), vision language models, and FP8 accuracy
Start TGI with gaudi
The easiest way to run TGI in Gaudi is to use the official Docker image. You need to run the image on a Gaudi hardware machine. Here is a basic example to get you started:
Model=Metalama/Metalama-3.1-8b-instruct volume =$ pwd/data hf_token = your_hf_access_token docker run – runtime = habana – cap -add = sys_nice -ipc = host \ -p 8080:80 \ -v $Volume:/data \ -e hf_token =$ hf_token \ -e habana_visible_devices = all \ ghcr.io/huggingface/text-generation-inference:3.2.1-gaudi \ – model-id $Model
Once the server is running, you can send an inference request.
Curl 127.0.0.1:8080/Generate -X Post -D ‘{“inputs”: “What is deep learning?”, “Parameters”: {“max_new_tokens”: 32}}’
-H ‘Content-Type: Application/JSON’
For comprehensive documentation on using TGI with Gaudi, including how-to guides and advanced configuration, see the new dedicated Gaudi backend documentation.
πTop Function
We optimized the following models for both single-card and multi-card configurations: This means that these models run as fast as possible with Intel Gaudi. Particularly optimize your modeling code targeting Intel Gaudi hardware to provide the best performance and take full advantage of Gaudi’s capabilities.
llama 3.1 (8b and 70b) llama 3.3 (70b) llama 3.2 Vision (11b) Mistral (7b) Mixtral (8x7b) Codellama (13b) Falcon (180b) Qwen2 (72b) Starcoder and Starcoder2 Gemma (7b) Llava-v1.6-mirtral-7b phi-2 phi-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-
It also offers many advanced features on Gaudi hardware, such as FP8 quantization thanks to Intel Neural Compressor (INC), allowing for even greater performance optimization.
β¨ is coming soon! We look forward to expanding our model lineup with cutting edge additions, including DeepSeek-R1/V3, QWen-VL, and more powerful models, to enhance our AI applications. π
Intertecting
Invite the community to try out TGI on Gaudi hardware and provide feedback. The complete documentation is available in the TGI Gaudi Backend documentation. If you’re interested in contributing, check out the contribution guidelines and open the issue with GitHub feedback. By providing Intel Intel Gaudi support directly to TGI, we continue our mission to provide flexible, efficient, production-ready tools for deploying LLMS. I’m looking forward to what I’ll build with this new feature! π