Since its first release in 2022, Text-Generation-Inference (TGI) has provided Hugging Face and the AI community with a performance-focused solution for easily deploying large-scale language models (LLMs). TGI originally offered a nearly code-free solution for loading models from Hugging Face Hub and deploying them to production on NVIDIA GPUs. Over time, support has expanded to include AMD Instinct GPUs, Intel GPUs, AWS Trainium/Inferentia, Google TPUs, and Intel Gaudi.
Over the years, multiple inference solutions have emerged including vLLM, SGLang, llama.cpp, TensorRT-LLM, etc., splitting the entire ecosystem. Different models, hardware, and use cases may require specific backends to achieve optimal performance. However, configuring each backend correctly, managing licenses, and integrating them into your existing infrastructure can be difficult for users.
To address this, we are pleased to introduce the concept of a TGI backend. This new architecture gives you the flexibility to integrate with any of the above solutions via TGI as a single unified front-end layer. This change makes it easier for the community to switch backends depending on modeling, hardware, and performance requirements to get the best performance for production workloads.
The Hugging Face team contributes and collaborates with the teams building vLLM, llama.cpp, TensorRT-LLM, and the teams at AWS, Google, NVIDIA, AMD, and Intel to provide a robust and consistent user experience for TGI users. I’m excited about it. It doesn’t matter what backend or hardware you use.
TGI backend: internal
TGI consists of multiple components and is primarily written in Rust and Python. Rust powers the HTTP and scheduling layers, and Python remains the go-to for modeling.
Simply put, Rust allows you to improve the overall robustness of your serving layer through static analysis and compiler-based memory safety enforcement. It brings the ability to more easily scale to multiple cores while ensuring the same safety. By leveraging Rust’s powerful type system for the HTTP layer and scheduler, Python-based environments can bypass the Global Interpreter Lock (GIL) and avoid memory issues while maximizing concurrency. Masu.
Speaking of Rust… Surprisingly, this is TGI’s starting point for integrating new backends – 🤗
Earlier this year, the TGI team worked to expose some fundamental knobs for figuring out how a real HTTP server and scheduler are coupled. This work introduced a new Rust trait backend to interface current and future inference engines.
Having this new backend interface (a trait in Rusty terminology) opens the door to modularity, allowing us to actually route incoming requests to different modeling and execution engines.
Looking to the future: 2025
TGI’s new multi-backend capabilities open up many impactful roadmap opportunities. Looking ahead to 2025, we’re excited to share some of the TGI developments we’re most excited about.
NVIDIA TensorRT-LLM Backend: We’re working with the NVIDIA TensorRT-LLM team to bring all the optimized NVIDIA GPU and TensorRT performance to our community. We will discuss this work in more detail in a future blog post. This, along with TGI+TRT-LLM, provides AI builders with open source availability of optimal NVIDIA quantization/build/evaluation TensorRT-compatible artifacts to easily deploy, run, and scale deployments on NVIDIA GPUs. closely related to our mission to Llama.cpp backend: We are working with the llama.cpp team to extend support for server production use cases. TGI’s llama.cpp backend provides a powerful CPU-based option for those wishing to deploy on Intel, AMD, or ARM CPU servers. vLLM backend: We are contributing to the vLLM project and are looking to integrate vLLM as a TGI backend in Q1 2025. AWS Neuron backend: We are working with the Neuron team at AWS to enable support for Inferentia 2 and Trainium 2 natively in TGI. Google TPU backend: We work with Google Jetstream and the TPU team to deliver the best performance through TGI.
We believe the TGI backend will help simplify LLM deployment and bring versatility and performance to all TGI users. You will soon be able to use the TGI backend directly within your inference endpoints. Customers will soon be able to easily deploy models with the highest performance and reliability TGI backends on a variety of hardware.
Stay tuned for the next blog post that will detail the technical details and performance benchmarks of the upcoming backend.