Updates from December 24, 2024:
SemiAnalysis founder Dylan Patel spoke with AMD CEO Lisa Su for 90 minutes to discuss the software issues detailed in this report. Patel said Su recognized the gaps in AMD’s software stack and took the team’s recommendations seriously. Patel said a number of changes are already in development, but did not provide details.
Original article published on December 23, 2024:
A five-month investigation by SemiAnalysis reveals that AMD’s new MI300X AI chip is falling short of its potential due to significant software issues, and Nvidia’s market dominance remains unchallenged.
advertisement
The study found that AMD’s software is riddled with bugs, making it nearly impossible to train AI models without extensive debugging. While AMD struggles with quality assurance and ease of use, Nvidia continues to widen its lead by rolling out new features, libraries, and performance updates.
Analysts ran extensive tests, including GEMM benchmarks and single-node training, but found that AMD was unable to overcome what they called the “CUDA moat,” or Nvidia’s strong software advantage. .
On paper, the MI300X looks impressive, offering 1,307 TeraFLOPS in FP16 calculations and 192 GB of HBM3 memory. This compares to Nvidia’s H100 with 989 TeraFLOPS and 80 GB memory, while Nvidia’s new H200 closes this memory gap with a 141 GB configuration. AMD systems also offer lower total cost of ownership thanks to lower prices and more affordable Ethernet networks.
Hardware advantages overshadow software issues
However, these benefits mean little in practice. According to SemiAnalysis, comparing these specs is like “simply comparing cameras by looking at their megapixel counts,” which means AMD is playing numbers games without providing sufficient real-world performance. It suggests that.
Analysts had to work directly with AMD engineers to fix numerous bugs in order to obtain useful benchmark results. In contrast, Nvidia’s system ran smoothly right out of the box.
recommendation
“AMD’s Out of the Box Experience is very tricky and can require considerable patience and effort to get to a usable state,” they wrote.
A particularly important detail is that Tensorwave, AMD’s largest GPU cloud provider, is offering AMD’s own team free access to GPUs (the same hardware that Tensorwave purchased from AMD) just to resolve software issues. It became clear that we had to provide.
SemiAnalysis recommends that AMD CEO Lisa Su invest heavily in software development and testing. Specifically, we are proposing to follow Nvidia’s approach and allocate thousands of MI300X chips to automated testing, simplifying complex environment variables while implementing better default settings. “Enable an out-of-the-box experience!” they write.
SemiAnarise hopes to see AMD succeed as a competitor to Nvidia, but says, “Unfortunately, there is still a lot of work to do.” As Nvidia prepares to launch its next-generation Blackwell chips, AMD risks falling further behind without significant software improvements, but reports indicate that Nvidia’s next-generation rollout isn’t entirely smooth sailing either. Suggested.