Hugging your face with AMD Instinct MI300 GPU

Join us in the next hugcast on June 6th to ask questions to the post-authors, watch a live demo of Llama 3 on the MI300X on Azure, and a bonus demo of deploying models locally on Ryzen AI PCs!

Register at https://streammyard.com/watch/imzuvjnmz8bv

introduction

When you hug your face, you want to make it easier to build AI using either open models and open source, the cloud you use, or stacked open models and open source. A key component is the ability to deploy AI models to a multipurpose selection of hardware. Through collaboration with AMD, we will help you ensure that for almost a year, there are always devices running the largest large community on AMD Freet, including AMD Instinct™, Radeon™ GPUS, EPYC™, Ryzen™ CPUs, and Ryzen AI NPUs. Today, we are pleased to announce that embracing faces and AMD is working together to embrace the latest generation of AMD GPU servers, namely the AMD Instinct MI300, and have first-class citizen integration into the overall embracing face platform. From prototyping in a local environment to running production models on Azure nd Mi300x V5 VMs, there is no need to modify your code using transformer (1), text generation inference, or other libraries. Let’s jump in!

Open Source and Production Realization

Maintaining support for AMD instinct GPUs in trans and text generation inference

With so much going on with AI right now, we had to make sure our Mi300 lineup was being properly tested and monitored over the long term. To achieve this, we are working closely with the infrastructure team here to ensure that we have robust building blocks available to those who need to enable continuous integration and deployment (CI/CD) and that we have robust building blocks that are painless and without affecting others who already exist.

To make that possible, we worked with AMD and the Microsoft Azure team to leverage the recently deployed Azure ND MI300X V5 as a building block targeting the MI300. In a few hours, the infrastructure team was able to deploy, set up and run everything to get their MI300.

I’ve also moved from my old infrastructure to a managed Kubernetes cluster, with the scheduling of all Github workflows that hug all Github workflows I want to run on hardware-specific pods. This migration allows you to run the exact same CI/CD pipelines on a variety of hardware platforms away from developers. I was able to get my CI/CD up and running within a few days without much effort on my Azure MI300X VM.

As a result, trans and text generation inferences are currently being tested regularly on previous generations of AMD’s instinct GPUs, namely both the MI250 and the latest MI300. In fact, there are tens of thousands of unit tests that regularly verify the state of these repositories, ensuring the correctness and robustness of integration over the long term.

Improved performance for production AI workloads

Inference performance

As mentioned in the Prelude, we are enabling the new AMD Instinct MI300 GPU to efficiently execute inference workloads via open source inference solutions. Text Generation Inference (TGI) TGI can be considered as three different components: Continuous batch) To increase the density of hardware calculations without affecting the user experience – Modeling layers, running real calculations on devices and leveraging highly optimized routines involved in the model

Here, with the help of AMD engineers, we focused on this last component, modeling, effectively setting up, running and optimizing workloads that provide models as the Metalama family. In particular, we focused on the following: -Flash Atterness V2 – Paging Notes – GPTQ/AWQ Compression Technology – Pytorch Integration for ROCM Tunableop – Optimized Fusion Kernel Integration

Most of these have been around for quite some time, Flashattention V2, Pagedattention, and GPTQ/AWQ compression methods (particularly optimized routines/kernels). I don’t know the details of the above three, and I recommend you go to the original implementation page to see the details.

Still, with the whole new hardware platform, the new SDK release, it was important to carefully verify, profile and optimize every bit to get all the power from this new platform.

Last but not least, as part of this TGI release, we’re integrating the recently released AMD Tunableop, which is part of Pytorch 2.3. Tunableop provides a versatile mechanism to find the most efficient way to perform common matrix expansion (i.e. GEMMS) in terms of shapes and data types. Tunableop is integrated into Pytorch and is still active in development, but as shown below, it can improve the performance of GEMMS operations without significantly affecting the user’s experience. Specifically, we use Tunableop for small input sequences corresponding to the decoding phase of the generation of autoregressive models to obtain a speedup of 8-10% in latency.

In fact, when a new TGI instance is created, it takes a dummy payload and begins the first warming step to make sure the model and its memory are allocated and ready to shine.

Tunableop allows the GEMM routine tuner to allocate time and search for the most optimal setup for user-provided parameters, such as sequence length, maximum batch size, and more. Once the warm-up phase is complete, disable the tuner for the rest of the server’s life and take advantage of the optimized routine.

As mentioned before, I ran all the benchmarks using Azure nd Mi300x V5. Recently, we have observed the A 2x-3x speedup of a Microsoft GPU integrated Microsoft GPU with Metalama 3 70B previous generation MI250, deployment, deployment, and First Token Lated (Prefill in Prefil first Token Latup). step.

Metalama 3 70b TGI latency results. Compare AMD Instinct MI300X on Azure VM with previous generation AMD Instinct MI250

Model fine tuning performance

Hug your face library can also be used to fine-tune your model. Using a transformer and PEFT library, use a low rank adapter to Fintune (Lora). Accelerate the library to handle parallel processing on several devices.

On the Llama 3 70b, the workload consists of a batch of 448 tokens, with a batch size of 2. Using the low rank adapter, the original 70,570,090,496 parameters of the model are frozen and instead train an additional subset of 16,384,000 parameters thanks to the low rank adapter.

Comparisons on the Llama 3 70b allow you to train about twice as fast on Azure VMs with MI300X, compared to HPC servers using the previous generation AMD Instinct MI250.

Additionally, the MI300X benefits from 192 GB HBM3 memory (compared to 128 GB on the MI250), allowing you to fully load and fine-tune the Metalama 3 70B on a single device, while the MI250 GPU cannot fit perfectly into the ~140 GB model on a single device. Because it’s always important to be able to replicate and challenge benchmarks, we’re releasing a companion GitHub repository that contains all the artifacts and source code that we used to collect the performance featured in this blog.

What’s next?

Pipes for these new AMD Instinct MI300 GPUs have many exciting features. One of the key areas where we will invest a lot of effort in the coming weeks is Minifloat (i.e. float8 and later). These data layouts have the inherent advantage of compressing information in a non-uniform way that mitigates some of the problems facing integers.

In scenarios such as LLM inference, they are divided by two sizes of the key value cache that are normally used in LLM. Then combined float8 with the stored key value cache and float8/float8 matrix-multiplications adds performance benefits along with reduced memory footprint.

Conclusion

As you can see, the AMD MI300 significantly improves performance for AI use cases, covering end-to-end use cases from training to inference. We are extremely excited to hold our faces and see what our communities and businesses can achieve with these new hardware and integrations. We want to reach out to you and help with your use case.

Stop by the Inference GitHub repository for the best AMD and Text Generations to get the latest performance optimizations for your AMD GPU!

versatileai

See Full Bio

What's Hot

Creating innovative content at your fingertips

The UK and Singapore form an alliance to guide AI into finance

StarCoder2 and Stack V2

The UK and Singapore form an alliance to guide AI into finance

StarCoder2 and Stack V2

Intel®Gaudi®2AI Accelerator Text Generation Pipeline

New Star: Discover why 보니 is the future of AI art

Impact International | EU AI ACT Enforcement: Business Transparency and Human Rights Impact in 2025

Presight plans to expand its AI business internationally

Most Popular

New Star: Discover why 보니 is the future of AI art

Impact International | EU AI ACT Enforcement: Business Transparency and Human Rights Impact in 2025

Presight plans to expand its AI business internationally

Don't Miss

Creating innovative content at your fingertips

The UK and Singapore form an alliance to guide AI into finance

StarCoder2 and Stack V2

Subscribe to Updates

What's Hot

Hugging your face with AMD Instinct MI300 GPU

introduction

Open Source and Production Realization

Maintaining support for AMD instinct GPUs in trans and text generation inference

Improved performance for production AI workloads

Inference performance

Model fine tuning performance

What’s next?

Conclusion

Related Posts