Close Menu
Versa AI hub
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

What's Hot

Researchers raise red flags on false images generated by AI in biomedical research

May 9, 2025

Apple develops custom chips such as smart glasses

May 9, 2025

AI Act: Statewide Spotlight – Regulatory Surveillance Podcast | Troutman Pepperlock

May 9, 2025
Facebook X (Twitter) Instagram
Versa AI hubVersa AI hub
Friday, May 9
Facebook X (Twitter) Instagram
Login
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools
Versa AI hub
Home»Tools»Hugging your face with AMD Instinct MI300 GPU
Tools

Hugging your face with AMD Instinct MI300 GPU

versatileaiBy versatileaiMay 9, 2025No Comments7 Mins Read
Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
#image_title
Share
Facebook Twitter LinkedIn Pinterest Email

Join us in the next hugcast on June 6th to ask questions to the post-authors, watch a live demo of Llama 3 on the MI300X on Azure, and a bonus demo of deploying models locally on Ryzen AI PCs!

Register at https://streammyard.com/watch/imzuvjnmz8bv

introduction

When you hug your face, you want to make it easier to build AI using either open models and open source, the cloud you use, or stacked open models and open source. A key component is the ability to deploy AI models to a multipurpose selection of hardware. Through collaboration with AMD, we will help you ensure that for almost a year, there are always devices running the largest large community on AMD Freet, including AMD Instinct™, Radeon™ GPUS, EPYC™, Ryzen™ CPUs, and Ryzen AI NPUs. Today, we are pleased to announce that embracing faces and AMD is working together to embrace the latest generation of AMD GPU servers, namely the AMD Instinct MI300, and have first-class citizen integration into the overall embracing face platform. From prototyping in a local environment to running production models on Azure nd Mi300x V5 VMs, there is no need to modify your code using transformer (1), text generation inference, or other libraries. Let’s jump in!

Open Source and Production Realization

Maintaining support for AMD instinct GPUs in trans and text generation inference

With so much going on with AI right now, we had to make sure our Mi300 lineup was being properly tested and monitored over the long term. To achieve this, we are working closely with the infrastructure team here to ensure that we have robust building blocks available to those who need to enable continuous integration and deployment (CI/CD) and that we have robust building blocks that are painless and without affecting others who already exist.

To make that possible, we worked with AMD and the Microsoft Azure team to leverage the recently deployed Azure ND MI300X V5 as a building block targeting the MI300. In a few hours, the infrastructure team was able to deploy, set up and run everything to get their MI300.

I’ve also moved from my old infrastructure to a managed Kubernetes cluster, with the scheduling of all Github workflows that hug all Github workflows I want to run on hardware-specific pods. This migration allows you to run the exact same CI/CD pipelines on a variety of hardware platforms away from developers. I was able to get my CI/CD up and running within a few days without much effort on my Azure MI300X VM.

As a result, trans and text generation inferences are currently being tested regularly on previous generations of AMD’s instinct GPUs, namely both the MI250 and the latest MI300. In fact, there are tens of thousands of unit tests that regularly verify the state of these repositories, ensuring the correctness and robustness of integration over the long term.

Improved performance for production AI workloads

Inference performance

As mentioned in the Prelude, we are enabling the new AMD Instinct MI300 GPU to efficiently execute inference workloads via open source inference solutions. Text Generation Inference (TGI) TGI can be considered as three different components: Continuous batch) To increase the density of hardware calculations without affecting the user experience – Modeling layers, running real calculations on devices and leveraging highly optimized routines involved in the model

Here, with the help of AMD engineers, we focused on this last component, modeling, effectively setting up, running and optimizing workloads that provide models as the Metalama family. In particular, we focused on the following: -Flash Atterness V2 – Paging Notes – GPTQ/AWQ Compression Technology – Pytorch Integration for ROCM Tunableop – Optimized Fusion Kernel Integration

Most of these have been around for quite some time, Flashattention V2, Pagedattention, and GPTQ/AWQ compression methods (particularly optimized routines/kernels). I don’t know the details of the above three, and I recommend you go to the original implementation page to see the details.

Still, with the whole new hardware platform, the new SDK release, it was important to carefully verify, profile and optimize every bit to get all the power from this new platform.

Last but not least, as part of this TGI release, we’re integrating the recently released AMD Tunableop, which is part of Pytorch 2.3. Tunableop provides a versatile mechanism to find the most efficient way to perform common matrix expansion (i.e. GEMMS) in terms of shapes and data types. Tunableop is integrated into Pytorch and is still active in development, but as shown below, it can improve the performance of GEMMS operations without significantly affecting the user’s experience. Specifically, we use Tunableop for small input sequences corresponding to the decoding phase of the generation of autoregressive models to obtain a speedup of 8-10% in latency.

In fact, when a new TGI instance is created, it takes a dummy payload and begins the first warming step to make sure the model and its memory are allocated and ready to shine.

Tunableop allows the GEMM routine tuner to allocate time and search for the most optimal setup for user-provided parameters, such as sequence length, maximum batch size, and more. Once the warm-up phase is complete, disable the tuner for the rest of the server’s life and take advantage of the optimized routine.

As mentioned before, I ran all the benchmarks using Azure nd Mi300x V5. Recently, we have observed the A 2x-3x speedup of a Microsoft GPU integrated Microsoft GPU with Metalama 3 70B previous generation MI250, deployment, deployment, and First Token Lated (Prefill in Prefil first Token Latup). step.

Metalama 3 70b TGI latency results. Compare AMD Instinct MI300X on Azure VM with previous generation AMD Instinct MI250

Model fine tuning performance

Hug your face library can also be used to fine-tune your model. Using a transformer and PEFT library, use a low rank adapter to Fintune (Lora). Accelerate the library to handle parallel processing on several devices.

On the Llama 3 70b, the workload consists of a batch of 448 tokens, with a batch size of 2. Using the low rank adapter, the original 70,570,090,496 parameters of the model are frozen and instead train an additional subset of 16,384,000 parameters thanks to the low rank adapter.

Comparisons on the Llama 3 70b allow you to train about twice as fast on Azure VMs with MI300X, compared to HPC servers using the previous generation AMD Instinct MI250.

PEFT Finetuning with MI300 vs MI250

Additionally, the MI300X benefits from 192 GB HBM3 memory (compared to 128 GB on the MI250), allowing you to fully load and fine-tune the Metalama 3 70B on a single device, while the MI250 GPU cannot fit perfectly into the ~140 GB model on a single device. Because it’s always important to be able to replicate and challenge benchmarks, we’re releasing a companion GitHub repository that contains all the artifacts and source code that we used to collect the performance featured in this blog.

What’s next?

Pipes for these new AMD Instinct MI300 GPUs have many exciting features. One of the key areas where we will invest a lot of effort in the coming weeks is Minifloat (i.e. float8 and later). These data layouts have the inherent advantage of compressing information in a non-uniform way that mitigates some of the problems facing integers.

In scenarios such as LLM inference, they are divided by two sizes of the key value cache that are normally used in LLM. Then combined float8 with the stored key value cache and float8/float8 matrix-multiplications adds performance benefits along with reduced memory footprint.

Conclusion

As you can see, the AMD MI300 significantly improves performance for AI use cases, covering end-to-end use cases from training to inference. We are extremely excited to hold our faces and see what our communities and businesses can achieve with these new hardware and integrations. We want to reach out to you and help with your use case.

Stop by the Inference GitHub repository for the best AMD and Text Generations to get the latest performance optimizations for your AMD GPU!

author avatar
versatileai
See Full Bio
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
Previous ArticleWill AI apps help carry the mental load of moms?
Next Article AI Act: Statewide Spotlight – Regulatory Surveillance Podcast | Troutman Pepperlock
versatileai

Related Posts

Tools

Apple develops custom chips such as smart glasses

May 9, 2025
Tools

Coding, Web Apps Using Gemini

May 8, 2025
Tools

ServiceNow bets on unified AI to solve the complexity of enterprises

May 8, 2025
Add A Comment
Leave A Reply Cancel Reply

Top Posts

New Star: Discover why 보니 is the future of AI art

February 26, 20255 Views

AI image generation using Flux models: WebUI Forge Quick Start

November 23, 20243 Views

Will AI apps help carry the mental load of moms?

May 8, 20252 Views
Stay In Touch
  • YouTube
  • TikTok
  • Twitter
  • Instagram
  • Threads
Latest Reviews

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

Most Popular

New Star: Discover why 보니 is the future of AI art

February 26, 20255 Views

AI image generation using Flux models: WebUI Forge Quick Start

November 23, 20243 Views

Will AI apps help carry the mental load of moms?

May 8, 20252 Views
Don't Miss

Researchers raise red flags on false images generated by AI in biomedical research

May 9, 2025

Apple develops custom chips such as smart glasses

May 9, 2025

AI Act: Statewide Spotlight – Regulatory Surveillance Podcast | Troutman Pepperlock

May 9, 2025
Service Area
X (Twitter) Instagram YouTube TikTok Threads RSS
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms and Conditions
  • Disclaimer
© 2025 Versa AI Hub. All Rights Reserved.

Type above and press Enter to search. Press Esc to cancel.

Sign In or Register

Welcome Back!

Login to your account below.

Lost password?