Accelerating LLM inference with TGI in Intel Gaudi

We are pleased to announce that native integration of Intel Gaudi hardware support will be integrated directly into Text Generation Inference (TGI), a production-ready delivery solution for large-scale language models (LLM). This integration brings the power of Intel’s professional AI accelerators to the high-performance inference stack, increasing the deployment options for the open source AI community.

✨What’s new?

Fully integrated Gaudi support into the main codebase of TGI in PR#3091. Previously, I kept another fork for Tgi-Gaudi’s Gaudi device. This was a hassle for users and was unable to support the latest TGI features at launch. Currently, we are using the new TGI multi-back elsite technique to support Gaudi directly with TGI.

This integration supports a full line of Intel’s Gaudi hardware.

You can also find more details about Gaudi hardware on Intel’s Gaudi product page

🌟Why is this important?

TGI’s Gaudi backend offers several important benefits.

Hardware versatility 🔄: More options for deploying LLM to production beyond traditional GPUs 💰: Gaudi hardware often offers attractive price performance for specific workloads ⚙️: All robustness of TGI (dynamic batch, streaming response, etc.) features 🔥: Support for multi-card inference (shards), vision language models, and FP8 accuracy

Start TGI with gaudi

The easiest way to run TGI in Gaudi is to use the official Docker image. You need to run the image on a Gaudi hardware machine. Here is a basic example to get you started:

Model=Metalama/Metalama-3.1-8b-instruct volume =$ pwd/data hf_token = your_hf_access_token docker run – runtime = habana – cap -add = sys_nice -ipc = host \ -p 8080:80 \ -v $Volume:/data \ -e hf_token =$ hf_token \ -e habana_visible_devices = all \ ghcr.io/huggingface/text-generation-inference:3.2.1-gaudi \ – model-id $Model

Once the server is running, you can send an inference request.

Curl 127.0.0.1:8080/Generate -X Post -D ‘{“inputs”: “What is deep learning?”, “Parameters”: {“max_new_tokens”: 32}}’
-H ‘Content-Type: Application/JSON’

For comprehensive documentation on using TGI with Gaudi, including how-to guides and advanced configuration, see the new dedicated Gaudi backend documentation.

🎉Top Function

We optimized the following models for both single-card and multi-card configurations: This means that these models run as fast as possible with Intel Gaudi. Particularly optimize your modeling code targeting Intel Gaudi hardware to provide the best performance and take full advantage of Gaudi’s capabilities.

llama 3.1 (8b and 70b) llama 3.3 (70b) llama 3.2 Vision (11b) Mistral (7b) Mixtral (8x7b) Codellama (13b) Falcon (180b) Qwen2 (72b) Starcoder and Starcoder2 Gemma (7b) Llava-v1.6-mirtral-7b phi-2 phi-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-2-

It also offers many advanced features on Gaudi hardware, such as FP8 quantization thanks to Intel Neural Compressor (INC), allowing for even greater performance optimization.

✨ is coming soon! We look forward to expanding our model lineup with cutting edge additions, including DeepSeek-R1/V3, QWen-VL, and more powerful models, to enhance our AI applications. 🚀

Intertecting

Invite the community to try out TGI on Gaudi hardware and provide feedback. The complete documentation is available in the TGI Gaudi Backend documentation. If you’re interested in contributing, check out the contribution guidelines and open the issue with GitHub feedback. By providing Intel Intel Gaudi support directly to TGI, we continue our mission to provide flexible, efficient, production-ready tools for deploying LLMS. I’m looking forward to what I’ll build with this new feature! 🎉

versatileai

See Full Bio

What's Hot

AI Art Generation Using Primo Models: Unlock Creative Business Opportunities in 2024 | AI News Details

Benchmarks for speech models from wild text

Creating innovative content at your fingertips

Benchmarks for speech models from wild text

The UK and Singapore form an alliance to guide AI into finance

StarCoder2 and Stack V2

New Star: Discover why 보니 is the future of AI art

Impact International | EU AI ACT Enforcement: Business Transparency and Human Rights Impact in 2025

Presight plans to expand its AI business internationally

Most Popular

New Star: Discover why 보니 is the future of AI art

Impact International | EU AI ACT Enforcement: Business Transparency and Human Rights Impact in 2025

Presight plans to expand its AI business internationally

Don't Miss

AI Art Generation Using Primo Models: Unlock Creative Business Opportunities in 2024 | AI News Details

Benchmarks for speech models from wild text

Creating innovative content at your fingertips

Subscribe to Updates

What's Hot

Accelerating LLM inference with TGI in Intel Gaudi

✨What’s new?

🌟Why is this important?

Start TGI with gaudi

🎉Top Function

Intertecting

Related Posts