Large-scale language models (LLMs) are revolutionizing natural language processing and are increasingly deployed to solve large-scale complex problems. Achieving optimal performance with these models is notoriously challenging due to its unique, intense computational demands. LLMS’ optimized performance is extremely valuable for end users looking for a snap-responsive experience, as well as for scaled deployments where improved throughput is converted into dollars saved.
That’s where the best NVIDIA inference library comes in. It can be used by hugging your face. Best NVIDIA will dramatically accelerate LLM inference on the NVIDIA platform via a very simple API. By changing only one line of code, you can unlock up to 28 times faster inference and 1,200 tokens/sec on the NVIDIA platform.
Optimum-nvidia is the first embracing face reasoning library that benefits from the new Float8 format supported by Nvidia Ada Lovelace and Hopper Architectures. In addition to the advanced compilation capabilities of the Nvidia Tensorrt-LLM software software, FP8 dramatically accelerates LLM inference.
How to do it
Using Optimum-nvidia’s pipeline, you can start running Llama with a blurry, fast inference speed with just three lines of code. If you already have a pipeline set up to hug Face’s Transformers library to run Llama, then just change one line of code to unlock Peak Performance.
– Import pipelines from Transformers.Pipelines
+ Import pipeline from optimum.nvidia.pipelines
# Everything else is the same as a trance! Pipe = Pipeline (‘Text-Generation’, ‘Meta-lama/llama-2-7b-chat-hf’, use_fp8 = true) pipe (“Describe the real application of AI in sustainable energy.”)
You can also enable FP8 quantization using a single flag. This allows you to run larger models on a single GPU at faster speeds and without sacrificing accuracy. The flags shown in this example use predefined calibration strategies by default, but can provide your own calibration dataset and customized tokenization to adjust quantization for use cases.
The pipeline interface is great for quick wake-up and running, but power users who want fine control over setting sampling parameters can use the model API.
– Import Automodelforcausallm from transformers
+ Import autorotelforcausalllm from optimum.nvidia
Imported from transformer from autotokenizer = autotokenizer.from_pretrained(“Meta-llama/llama-2-13b-chat-hf”, padding_side = “left”) model = autotodelforcausalllm.from_pretrained(“Meta-llama/llama-2-13b-chat-hf”,
+ use_fp8 = true,
)Model_inputs = Tokenizer ((“How does autonomous vehicle technology translate the future of transportation and urban planning?”), return_tensors = “pt”). tokenizer.batch_decode(generated_ids(0), skip_special_tokens = true)
See the documentation for more information
Performance evaluation
When evaluating LLM performance, consider two metrics: initial token latency and throughput. Measures the first token latency (also known as the time of the first token or prefill latency). This metric lets you know how the model responds from waiting for the time to enter the prompt before starting to receive output. Optimum-nvidia offers up to 3.3 times the first token latency compared to stock transformers.
Throughput, on the other hand, measures what is particularly relevant when batching generations at the speed at which a model can generate tokens. There are several ways to calculate throughput, but we have adopted a standard method of splitting end-to-end latency by total sequence length. Optimum-nvidia offers up to 28 times the throughput compared to stock transformers.
The recently announced initial evaluation of the NVIDIA H200 tensorcore GPU shows up to twice the throughput of the Llama model compared to the NVIDIA H100 tensorcore GPU. As H200 GPUs become more readily available, they share performance data from Optimum-nvidia running them.
Next Steps
Optimum-nvidia currently offers peak performance for Llamaforcausallm architecture + tasks, so Llama-based models with fine-tuned versions should work with today’s best nvidia. We are actively expanding our support to include architecture and tasks for other text generation models.
It continues to push the boundaries of performance and plans to incorporate cutting-edge optimization techniques such as in-flight batching to improve throughput when streaming prompts and INT4 quantization run even larger models on a single GPU.
Try it: We are releasing an Optimum-Nvidia repository with instructions on how to get started. Share your feedback with us! 🤗