Google Cloud C4 improves TCO of GPT OSS by 70% using Intel and Hugging Face

Intel and Hugging Face have teamed up to demonstrate the real value of upgrading to Google’s latest C4 virtual machines (VMs) running on Intel® Xeon® 6 processors, codenamed Granite Rapids (GNR). We specifically wanted to benchmark the text generation performance improvements of the OpenAI GPT OSS Large Language Model (LLM).

The results are impressive, demonstrating a 1.7x improvement in total cost of ownership (TCO) compared to previous generation Google C3 VM instances. Google Cloud C4 VM instances provided additional results:

1.4x to 1.7x TPOT throughput/vCPU/$ Lower hourly price than C3 VMs

introduction

GPT OSS is the common name for the open source Mixture of Experts (MoE) model released by OpenAI. The MoE model is a deep neural network architecture that uses specialized “expert” subnetworks and “gate networks” to decide which expert to use for a given input. MoE models allow you to efficiently scale model capacity without linearly scaling computing costs. It also enables specialization, where different “experts” learn different skills, allowing them to adapt to diverse data distributions.

Even if the parameters are very large, only a small subset of experts are activated per token to make CPU inference viable.

Intel and Hugging Face have teamed up to integrate expert execution optimizations (PR #40304) to eliminate redundant computations where each expert processes every token to a transformer. This optimization instructs each expert to only run on the tokens it is routed to, eliminating FLOP waste and increasing utilization.

Benchmark scope and hardware

We benchmarked GPT OSS under a controlled repeatable generational workload to isolate architectural differences (GCP C4 VM on Intel Xeon 6 Processor (GNR) vs. GCP C3 VM on 4th Generation Intel Xeon Processor (SPR)) and MoE execution efficiency. The focus is on steady-state decoding (per-token latency) and end-to-end normalized throughput as the batch size increases while keeping the sequence length fixed. All executions use static KV cache and SDPA attention for determinism.

Configuration overview

Model: unsloth/gpt-oss-120b-BF16 Precision: bfloat16 Task: Text generation Input length: 1024 tokens (left justified) Output length: 1024 tokens Batch size: 1, 2, 4, 8, 16, 32, 64 Enabled features: Static KV cache SDPA attention backend Reported metrics: Throughput (tokens per second aggregated across total batches generated)

Hardware under test

Instance Architecture vCPU C3 4th Generation Intel Xeon Processor (SPR) 172 C4 Intel Xeon 6 Processor (GNR) 144

Creating an instance

C3

Go to Google Cloud Console and click (Create VM) under your project. Follow these steps to create a 176 vCPU instance.

Select C3 in the machine configuration and specify the machine type as c3-standard-176. You also need to configure your CPU platform and turn on All-Core Turbo for more stable performance.

Set the OS and storage tabs as below.

Leave the other settings as default and click the Create button

C4

Go to Google Cloud Console and click (Create VM) under your project. Follow these steps to create a 144 vCPU instance.

Select C4 on the (Machine Configuration) tab and specify the machine type as c4-standard-144. You can also configure your CPU platform to turn on All-Core Turbo for more consistent performance.

Configure the required OS and storage tabs for C3. Leave the other settings as default and click the Create button

Set up the environment

Log in to your instance with SSH and install docker. You can easily create an environment by following the steps below. For better reproducibility, list the version and commit you are using in the command.

$ git clone https://github.com/huggingface/transformers.git $ cd transformers/ $ git checkout 26b65fb5168f324277b85c558ef8209bfceae1fe $ cd docker/transformers-intel-cpu/ $ sudo docker build . -t $ sudo docker run -it –rm –privileged -v /home/:/workspace /bin/bash

Now that you’re in a container, do the following steps:

$ pip install git+https://github.com/huggingface/transformers.git@26b65fb5168f324277b85c558ef8209bfceae1fe $ pip install torch==2.8.0 torchvision torchaudio –index-url https://download.pytorch.org/whl/cpu

Benchmark procedure

For each batch size,

Constructs a left-justified batch of fixed length 1024 tokens. Perform one warm-up round. Measure the total latency by setting max_new_tokens=1024 and get $throughput = (OUTPUT\_TOKENS *Batch\_Size) / total\_latency$.

Run umactl -l python benchmark.py on the following code.

import OS
import time
import torch
from dataset import Load dataset
from transformer import AutoModelForCausalLM, AutoTokenizer INPUT_TOKENS = 1024
Output token = 1024

surely get_inputs(tokenizer, batch size): dataset =load_dataset(“ola13/small-the_pile”split =“train”) tokenizer.padding_side = “left”
selected text = ()
for sample in Dataset: input_ids = tokenizer(sample(“Sentence”), return_tensors=“pt”).input_ids
if Ren(selected text) == 0 and input_ids.shape(-1) >= INPUT_TOKENS: selected_texts.append(sample(“Sentence”))
Elif Ren(selected text) > 0: selected_texts.append(sample(“Sentence”))
if Ren(selected text) == batch size:
break

return tokenizer(selected_texts, max_length=INPUT_TOKENS, padding=“maximum length”truncate =truthreturn_tensors=“pt”)

surely run_generate(Model, input, generation configuration): input(“Generation composition”) = Generation_config model.generate(**inputs) pre = time.time() model.generate(**inputs) Latency = (time.time() – pre)
return latency

surely benchmark(Model, tokenizer, batch size, generation configuration): inputs = get_inputs(tokenizer, batch size) generate_config.max_new_tokens = 1
generation_config.min_new_tokens = 1
prefill_latency = run_generate(model, input, generation configuration) generation configuration.max_new_tokens = OUTPUT_TOKENS generation configuration.min_new_tokens = OUTPUT_TOKENS total_latency = run_generate(model, input, generation configuration) decoding_latency = (total_latency – prefill_latency) / (OUTPUT_TOKENS – 1) Throughput = OUTPUT_TOKENS * batch size / total_latency

return prefill_latency, decoding_latency, throughput

if __name__ == “__Major__”: Model ID = “unsloth/gpt-oss-120b-BF16”
tokenizer = AutoTokenizer.from_pretrained(model_id) model_kwargs = {“d type”: torch.bfloat16} model = AutoModelForCausalLM.from_pretrained(model_id, **model_kwargs) model.config._attn_implementation=“SDPA”
Generation configuration = model.generation configuration generation configuration.do_sample = error
generation_config.cache_implementation=“static”

for batch size in (1, 2, 4, 8, 16, 32, 64):
print(Run generation with f”———- batch size =. {batch size} ———-”flash =truth) prefill_latency, decoding_latency, throughput = benchmark (model, tokenizer, batch size, generation configuration)
print(f”throughput = {throughput}”flash =truth)

result

Normalized throughput per vCPU

C4 with Intel Xeon 6 processors consistently outperforms C3 by 1.4-1.7x throughput per vCPU over batch sizes up to 64. The formula is:

$Normalized_throughput_per_vCPU=throughput_C4/vCPUs_C4throughput_C3/vCPUs_C3 \text{normalized\_throughput\_per\_vCPU} = \frac{\text{throughput\_C4} / \text{vCPUs\_C4}} {\text{Throughput\_C3} / \text{vCPUs\_C3}}$

Cost and TCO

At a batch size of 64, C4 provides 1.7 times the throughput per vCPU of C3. Because the price per vCPU is roughly equivalent (cost per hour increases linearly with the number of vCPUs), the TCO is 1.7x better (C3 requires 1.7x more spending for the same amount of tokens generated).

Throughput ratio per vCPU:
$Throughput_C4/vCPUs_C4Throughput_C3/vCPUs_C3=1.7⇒TCO_C3TCO_C4≈1.7 \frac{\text{Throughput\_C4} / \text{vCPUs\_C4}}{\text{Throughput\_C3} / \text{vCPUs\_C3}} = 1.7 \Rightarrow \frac{\text{TCO\_C3}}{\text{TCO\_C4}} \approx. 1.7$

conclusion

Google Cloud C4 VMs, powered by Intel Xeon 6 processors (GNR), deliver both significant performance improvements and greater cost efficiency for large-scale MoE inference compared to the previous generation Google Cloud C3 VMs (powered by 4th generation Intel Xeon processors). We observed a combination of increased throughput, lower latency, and lower cost for GPT OSS MoE inference. These results highlight that large-scale MoE models can be efficiently delivered on next-generation general-purpose CPUs thanks to targeted framework optimizations by Intel and Hugging Face.

versatileai

See Full Bio

What's Hot

Computer vision helps retailers improve productivity

Automate council planning tasks with Google Cloud-generated AI

The open source community powers OpenEnv for Agentic RL

Computer vision helps retailers improve productivity

Automate council planning tasks with Google Cloud-generated AI

The open source community powers OpenEnv for Agentic RL

Huawei fills the AI gap left in China by Apple

Xebia: Why AI agents fail without the right data foundation

Trends and insights with new multilingual and long-form tracks

Most Popular

Huawei fills the AI gap left in China by Apple

Xebia: Why AI agents fail without the right data foundation

Trends and insights with new multilingual and long-form tracks

Don't Miss

Computer vision helps retailers improve productivity

Automate council planning tasks with Google Cloud-generated AI

The open source community powers OpenEnv for Agentic RL

Subscribe to Updates

What's Hot

Google Cloud C4 improves TCO of GPT OSS by 70% using Intel and Hugging Face

introduction

Benchmark scope and hardware

Configuration overview

Hardware under test

Creating an instance

C3

C4

Set up the environment

Benchmark procedure

result

Normalized throughput per vCPU

Cost and TCO

conclusion

Related Posts