Intel and Hugging Face have teamed up to demonstrate the real value of upgrading to Google’s latest C4 virtual machines (VMs) running on Intel® Xeon® 6 processors, codenamed Granite Rapids (GNR). We specifically wanted to benchmark the text generation performance improvements of the OpenAI GPT OSS Large Language Model (LLM).
The results are impressive, demonstrating a 1.7x improvement in total cost of ownership (TCO) compared to previous generation Google C3 VM instances. Google Cloud C4 VM instances provided additional results:
1.4x to 1.7x TPOT throughput/vCPU/$ Lower hourly price than C3 VMs
introduction
GPT OSS is the common name for the open source Mixture of Experts (MoE) model released by OpenAI. The MoE model is a deep neural network architecture that uses specialized “expert” subnetworks and “gate networks” to decide which expert to use for a given input. MoE models allow you to efficiently scale model capacity without linearly scaling computing costs. It also enables specialization, where different “experts” learn different skills, allowing them to adapt to diverse data distributions.
Even if the parameters are very large, only a small subset of experts are activated per token to make CPU inference viable.
Intel and Hugging Face have teamed up to integrate expert execution optimizations (PR #40304) to eliminate redundant computations where each expert processes every token to a transformer. This optimization instructs each expert to only run on the tokens it is routed to, eliminating FLOP waste and increasing utilization.
Benchmark scope and hardware
We benchmarked GPT OSS under a controlled repeatable generational workload to isolate architectural differences (GCP C4 VM on Intel Xeon 6 Processor (GNR) vs. GCP C3 VM on 4th Generation Intel Xeon Processor (SPR)) and MoE execution efficiency. The focus is on steady-state decoding (per-token latency) and end-to-end normalized throughput as the batch size increases while keeping the sequence length fixed. All executions use static KV cache and SDPA attention for determinism.
Configuration overview
Model: unsloth/gpt-oss-120b-BF16 Precision: bfloat16 Task: Text generation Input length: 1024 tokens (left justified) Output length: 1024 tokens Batch size: 1, 2, 4, 8, 16, 32, 64 Enabled features: Static KV cache SDPA attention backend Reported metrics: Throughput (tokens per second aggregated across total batches generated)
Hardware under test
Instance Architecture vCPU C3 4th Generation Intel Xeon Processor (SPR) 172 C4 Intel Xeon 6 Processor (GNR) 144
Creating an instance
C3
Go to Google Cloud Console and click (Create VM) under your project. Follow these steps to create a 176 vCPU instance.
Select C3 in the machine configuration and specify the machine type as c3-standard-176. You also need to configure your CPU platform and turn on All-Core Turbo for more stable performance.

Set the OS and storage tabs as below.

Leave the other settings as default and click the Create button
C4
Go to Google Cloud Console and click (Create VM) under your project. Follow these steps to create a 144 vCPU instance.
Select C4 on the (Machine Configuration) tab and specify the machine type as c4-standard-144. You can also configure your CPU platform to turn on All-Core Turbo for more consistent performance.

Configure the required OS and storage tabs for C3. Leave the other settings as default and click the Create button
Set up the environment
Log in to your instance with SSH and install docker. You can easily create an environment by following the steps below. For better reproducibility, list the version and commit you are using in the command.
$ git clone https://github.com/huggingface/transformers.git $ cd transformers/ $ git checkout 26b65fb5168f324277b85c558ef8209bfceae1fe $ cd docker/transformers-intel-cpu/ $ sudo docker build . -t $ sudo docker run -it –rm –privileged -v /home/:/workspace /bin/bash
Now that you’re in a container, do the following steps:
$ pip install git+https://github.com/huggingface/transformers.git@26b65fb5168f324277b85c558ef8209bfceae1fe $ pip install torch==2.8.0 torchvision torchaudio –index-url https://download.pytorch.org/whl/cpu
Benchmark procedure
For each batch size,
Constructs a left-justified batch of fixed length 1024 tokens. Perform one warm-up round. Measure the total latency by setting max_new_tokens=1024 and get $throughput = (OUTPUT\_TOKENS *Batch\_Size) / total\_latency$.
Run umactl -l python benchmark.py on the following code.
import OS
import time
import torch
from dataset import Load dataset
from transformer import AutoModelForCausalLM, AutoTokenizer INPUT_TOKENS = 1024
Output token = 1024
surely get_inputs(tokenizer, batch size): dataset =load_dataset(“ola13/small-the_pile”split =“train”) tokenizer.padding_side = “left”
selected text = ()
for sample in Dataset: input_ids = tokenizer(sample(“Sentence”), return_tensors=“pt”).input_ids
if Ren(selected text) == 0 and input_ids.shape(-1) >= INPUT_TOKENS: selected_texts.append(sample(“Sentence”))
Elif Ren(selected text) > 0: selected_texts.append(sample(“Sentence”))
if Ren(selected text) == batch size:
break
return tokenizer(selected_texts, max_length=INPUT_TOKENS, padding=“maximum length”truncate =truthreturn_tensors=“pt”)
surely run_generate(Model, input, generation configuration): input(“Generation composition”) = Generation_config model.generate(**inputs) pre = time.time() model.generate(**inputs) Latency = (time.time() – pre)
return latency
surely benchmark(Model, tokenizer, batch size, generation configuration): inputs = get_inputs(tokenizer, batch size) generate_config.max_new_tokens = 1
generation_config.min_new_tokens = 1
prefill_latency = run_generate(model, input, generation configuration) generation configuration.max_new_tokens = OUTPUT_TOKENS generation configuration.min_new_tokens = OUTPUT_TOKENS total_latency = run_generate(model, input, generation configuration) decoding_latency = (total_latency – prefill_latency) / (OUTPUT_TOKENS – 1) Throughput = OUTPUT_TOKENS * batch size / total_latency
return prefill_latency, decoding_latency, throughput
if __name__ == “__Major__”: Model ID = “unsloth/gpt-oss-120b-BF16”
tokenizer = AutoTokenizer.from_pretrained(model_id) model_kwargs = {“d type”: torch.bfloat16} model = AutoModelForCausalLM.from_pretrained(model_id, **model_kwargs) model.config._attn_implementation=“SDPA”
Generation configuration = model.generation configuration generation configuration.do_sample = error
generation_config.cache_implementation=“static”
for batch size in (1, 2, 4, 8, 16, 32, 64):
print(Run generation with f”———- batch size =. {batch size} ———-”flash =truth) prefill_latency, decoding_latency, throughput = benchmark (model, tokenizer, batch size, generation configuration)
print(f”throughput = {throughput}”flash =truth)
result
Normalized throughput per vCPU
C4 with Intel Xeon 6 processors consistently outperforms C3 by 1.4-1.7x throughput per vCPU over batch sizes up to 64. The formula is:
Normalized_throughput_per_vCPU=throughput_C4/vCPUs_C4throughput_C3/vCPUs_C3 \text{normalized\_throughput\_per\_vCPU} = \frac{\text{throughput\_C4} / \text{vCPUs\_C4}} {\text{Throughput\_C3} / \text{vCPUs\_C3}}
Cost and TCO
At a batch size of 64, C4 provides 1.7 times the throughput per vCPU of C3. Because the price per vCPU is roughly equivalent (cost per hour increases linearly with the number of vCPUs), the TCO is 1.7x better (C3 requires 1.7x more spending for the same amount of tokens generated).
Throughput ratio per vCPU:
Throughput_C4/vCPUs_C4Throughput_C3/vCPUs_C3=1.7⇒TCO_C3TCO_C4≈1.7 \frac{\text{Throughput\_C4} / \text{vCPUs\_C4}}{\text{Throughput\_C3} / \text{vCPUs\_C3}} = 1.7 \Rightarrow \frac{\text{TCO\_C3}}{\text{TCO\_C4}} \approx. 1.7
conclusion
Google Cloud C4 VMs, powered by Intel Xeon 6 processors (GNR), deliver both significant performance improvements and greater cost efficiency for large-scale MoE inference compared to the previous generation Google Cloud C3 VMs (powered by 4th generation Intel Xeon processors). We observed a combination of increased throughput, lower latency, and lower cost for GPT OSS MoE inference. These results highlight that large-scale MoE models can be efficiently delivered on next-generation general-purpose CPUs thanks to targeted framework optimizations by Intel and Hugging Face.

