Close Menu
Versa AI hub
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools
  • Resources

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

What's Hot

Piclumen AI launches Primo models for high quality dark fantasy art generation | AI News Details

August 25, 2025

Box Acceleration using Large Language Model AMD GPU

August 25, 2025

Piclumen AI introduces weekend digital art creation with AI-powered image generation | AI News Details

August 25, 2025
Facebook X (Twitter) Instagram
Versa AI hubVersa AI hub
Tuesday, August 26
Facebook X (Twitter) Instagram
Login
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools
  • Resources
Versa AI hub
Home»Tools»Box Acceleration using Large Language Model AMD GPU
Tools

Box Acceleration using Large Language Model AMD GPU

versatileaiBy versatileaiAugust 25, 2025No Comments7 Mins Read
Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
#image_title
Share
Facebook Twitter LinkedIn Pinterest Email

Earlier this year, AMD and Haggingface announced a partnership to accelerate AI models during AMD’s AI day event. We have worked hard to bring this vision to reality and ensure that the embracing face community can run the latest AI models on AMD hardware at the best possible performance.

AMD powers some of the world’s most powerful supercomputers, including the fastest European Lumi, surpassing the 10,000 MI250X AMD GPU. At this event, AMD revealed its latest generation of server GPUs, the AMD Instinct™ MI300 Series Accelerator.

This blog post provides excellent box-out box support for AMD GPUs and provides up-to-date information on progress to improve interoperability with the latest server-grade AMD Instinct GPUs.

Accelerate the box immediately

Can I find the following AMD-specific code changes? Don’t hurt your eyes, not compared to running with Nvidia GPUS.

from transformer Import AutoTokenizer, AutomodelForcausallum
Import Torch Model_id = “01-AI/YI-6B”
Tokenizer = autotokenizer.from_pretrained(model_id)
and torch.device (“cuda”): model = automodelforcausallm.from_pretrained(model_id, torch_dtype = torch.float16)inp = tokenizer((())“I’m in Paris today.”), Paddy=truthreturn_tensors =“PT”). In (“cuda”)res = model.generate(**inp, max_new_tokens =30))

printing(tokenizer.batch_decode(res))

One of the main aspects we have been working on is the ability to run face trans models that hug without changing the code. Currently, it supports all transformer models and tasks for AMD Instinct GPUs. Also, collaboration has not stopped here as we are investigating box-out box support for diffuser models, other libraries, and other AMD GPUs.

Achieving this milestone was a critical effort and collaboration between the team and the company. To maintain support and performance of the embracing face community, we have built an integration test that hugged the AMD Instinct GPU face open source library in our data center. And we were able to minimize the carbon impact of these new workloads that work with Verne Global to deploy AMD instinct servers in Iceland.

In addition to native support, another major aspect of our collaboration is offering the latest innovations and feature integrations available on AMD GPUs. Through collaboration between Face Team, AMD engineers and open source community members, we are pleased to announce the following support:

We are very excited to make these cutting-edge acceleration tools available and easy to use to embrace face users, providing support and performance directly integrated into the new continuous integration and development pipeline of AMD instinct GPUs.

One AMD instinct MI250 GPU with 128 GB of high bandwidth memory has two different ROCM devices (GPU 0 and 1), each with 64 GB of high bandwidth memory.

Two devices displayed by mi250 “rocm-smi”

This means that you can have just one MI250 GPU card and two Pytorch devices that are extremely easy to use with tensor and data parallelism, providing higher throughput and lower latency.

The rest of the blog post reports performance results for two related steps during text generation through a large language model.

Prefill Latency: The amount of time it takes for a model to calculate the representation of a user’s provided input or prompt (also known as “time to the first token”). Decode per token latency: The amount of time it takes to generate a new token using the auto-escaping method after a prefill step. Throughput Decode: The number of tokens generated per second during the decoding phase.

Using optimal benchmarks and execution inference benchmarks on A100 GPUs with and without MI250 and optimization, you can get the following results:

Inference benchmark using transformers and PEFT libraries. FA2 represents “Flash Atterness 2”, “Tensor Parallel” TP, and “Distributed Data Parallel” DDP.

In the plot above, you can see how well the MI250 performs, especially in a production setting where requests are processed in large batches, providing over 2.33 times the token (decode throughput), and half the time for the first token (prefill latency) compared to the A100 card.

Running the training benchmark as shown below will allow one MI250 card to fit a larger batch of training samples, reaching a higher training throughput.

Training benchmark using transformer libraries with maximum batch size (two powers) that fits on a particular card

Production Solutions

Another important focus of our collaboration is to start with text generation inference (TGI) and build support to embrace facial production solutions. TGI offers an end-to-end solution for deploying large language models for scale inference.

Initially, TGI was driven primarily towards Nvidia GPUs and took advantage of most of the recent optimizations made for postampere architectures, including Flash Attention V1 and V2, GPTQ weight quantization, and paging attention.

Today we announced initial support for TGI’s AMD Instinct MI210 and MI250 GPUs, leveraging all the great open source work mentioned above, and are ready to be integrated into a complete end-to-end solution.

On the performance side, I spent a lot of time benchmarking text generation inferences on AMD Instinct GPUs, examining and discovering where the focus should be on optimization. Therefore, with the support of AMD GPUS engineers, we were able to achieve matching performance compared to what TGI already had.

In this context, we have built a long-term relationship between our embraced faces with AMD, so we’re integrating and testing it with the AMD GEMM tuner tool that allows us to find the best setup for improved performance to tune the GEMM (Matrix Multiplication) kernel we use in TGI. The GEMM Tuner Tool is scheduled to be released as part of Pytorch.

With all of the above being said, we are excited to showcase the first performance numbers that show the latest AMD technology. It places text-generating inference on AMD GPUs at the forefront of efficient inference solutions with the LLAMA model family.

Llama 34B TGI latency results. Compare A100SXM4-80GB with one AMD instinct MI250. As explained above, one MI250 supports two Pytorch devices.

Llama 70b TGI latency results, comparing two A100-SXM4-80GB and two AMD instinct MI250 (using tensor parallelism)

The missing bar on the A100 is on llama 70b weights 138 gb in float16, so it handles out-of-memory errors and requires sufficient free memory for intermediate activation, kv cache buffers (>5gb with 2048 sequence length, batch size 8, cuda context, etc.). MI250 sequence, big batch).

Text generation inference is ready for deployment during production on AMD Instinct GPU via Docker Image ghcr.io/huggingface/text-generation-inference:1.2-ROCM. Be sure to refer to the support and its limitations documentation.

What’s next?

I hope this blog post is just as exciting as I have a face to face about this partnership with AMD. Of course, this is just the beginning of our journey and we look forward to enabling more use cases with more AMD hardware.

We are committed to bringing support and validation of AMD Radeon GPUs over the next few months. AMDRadeonGPU can be placed on its own desktop for local use, lowering the barrier to accessibility and paving the way for even more versatility for users.

Of course, we’ll be working on performance optimizations for our MI300 lineup right away, ensuring that both open source and solutions deliver the latest innovations at the highest level of stability we’ve always hugged.

Another area of ​​focus for us is around AMD Ryzen AI technology, powered by the latest generation of AMD laptop CPUs, allowing AI to run on the edge on devices. As coding assistants, image generation tools and personal assistants become more and more widely available, it is important to provide solutions that can meet your privacy needs to leverage these powerful tools. In this context, Ryzen AI-compatible models are already available on the facehub of Hugs and are working closely with AMD to bring more in the coming months.

author avatar
versatileai
See Full Bio
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
Previous ArticlePiclumen AI introduces weekend digital art creation with AI-powered image generation | AI News Details
Next Article Piclumen AI launches Primo models for high quality dark fantasy art generation | AI News Details
versatileai

Related Posts

Tools

When AI Data Center reaches space limit: New NVIDIA fix

August 25, 2025
Tools

Optimum-nvidia unlocks blurry and fast LLM inference with just one line of code

August 24, 2025
Tools

Huawei Cloud’s broad and open approach wins Gartner’s honor

August 24, 2025
Add A Comment

Comments are closed.

Top Posts

Understand the impact of top LLMs and AI on content creation — KHTS Radio — Santa Clarita Radio

January 25, 20257 Views

Sam Altman defends AI art amidst the backlash from Ghibli, saying it is a “net victory” for society

April 9, 20254 Views

Best AI Image Generation Bot Telegram

December 20, 20233 Views
Stay In Touch
  • YouTube
  • TikTok
  • Twitter
  • Instagram
  • Threads
Latest Reviews

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

Most Popular

Understand the impact of top LLMs and AI on content creation — KHTS Radio — Santa Clarita Radio

January 25, 20257 Views

Sam Altman defends AI art amidst the backlash from Ghibli, saying it is a “net victory” for society

April 9, 20254 Views

Best AI Image Generation Bot Telegram

December 20, 20233 Views
Don't Miss

Piclumen AI launches Primo models for high quality dark fantasy art generation | AI News Details

August 25, 2025

Box Acceleration using Large Language Model AMD GPU

August 25, 2025

Piclumen AI introduces weekend digital art creation with AI-powered image generation | AI News Details

August 25, 2025
Service Area
X (Twitter) Instagram YouTube TikTok Threads RSS
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms and Conditions
  • Disclaimer
© 2025 Versa AI Hub. All Rights Reserved.

Type above and press Enter to search. Press Esc to cancel.

Sign In or Register

Welcome Back!

Login to your account below.

Lost password?