Close Menu
Versa AI hub
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools
  • Resources

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

What's Hot

Computer vision helps retailers improve productivity

June 19, 2026

Automate council planning tasks with Google Cloud-generated AI

June 17, 2026

The open source community powers OpenEnv for Agentic RL

June 17, 2026
Facebook X (Twitter) Instagram
Versa AI hubVersa AI hub
Friday, June 19
Facebook X (Twitter) Instagram
Login
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools
  • Resources
Versa AI hub
Home»Tools»Google Cloud C4 improves TCO of GPT OSS by 70% using Intel and Hugging Face
Tools

Google Cloud C4 improves TCO of GPT OSS by 70% using Intel and Hugging Face

versatileaiBy versatileaiOctober 16, 2025No Comments6 Mins Read
Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
Share
Facebook Twitter LinkedIn Pinterest Email

Intel and Hugging Face have teamed up to demonstrate the real value of upgrading to Google’s latest C4 virtual machines (VMs) running on Intel® Xeon® 6 processors, codenamed Granite Rapids (GNR). We specifically wanted to benchmark the text generation performance improvements of the OpenAI GPT OSS Large Language Model (LLM).

The results are impressive, demonstrating a 1.7x improvement in total cost of ownership (TCO) compared to previous generation Google C3 VM instances. Google Cloud C4 VM instances provided additional results:

1.4x to 1.7x TPOT throughput/vCPU/$ Lower hourly price than C3 VMs

introduction

GPT OSS is the common name for the open source Mixture of Experts (MoE) model released by OpenAI. The MoE model is a deep neural network architecture that uses specialized “expert” subnetworks and “gate networks” to decide which expert to use for a given input. MoE models allow you to efficiently scale model capacity without linearly scaling computing costs. It also enables specialization, where different “experts” learn different skills, allowing them to adapt to diverse data distributions.

Even if the parameters are very large, only a small subset of experts are activated per token to make CPU inference viable.

Intel and Hugging Face have teamed up to integrate expert execution optimizations (PR #40304) to eliminate redundant computations where each expert processes every token to a transformer. This optimization instructs each expert to only run on the tokens it is routed to, eliminating FLOP waste and increasing utilization.

gpt_oss_expert

Benchmark scope and hardware

We benchmarked GPT OSS under a controlled repeatable generational workload to isolate architectural differences (GCP C4 VM on Intel Xeon 6 Processor (GNR) vs. GCP C3 VM on 4th Generation Intel Xeon Processor (SPR)) and MoE execution efficiency. The focus is on steady-state decoding (per-token latency) and end-to-end normalized throughput as the batch size increases while keeping the sequence length fixed. All executions use static KV cache and SDPA attention for determinism.

Configuration overview

Model: unsloth/gpt-oss-120b-BF16 Precision: bfloat16 Task: Text generation Input length: 1024 tokens (left justified) Output length: 1024 tokens Batch size: 1, 2, 4, 8, 16, 32, 64 Enabled features: Static KV cache SDPA attention backend Reported metrics: Throughput (tokens per second aggregated across total batches generated)

Hardware under test

Instance Architecture vCPU C3 4th Generation Intel Xeon Processor (SPR) 172 C4 Intel Xeon 6 Processor (GNR) 144

Creating an instance

C3

Go to Google Cloud Console and click (Create VM) under your project. Follow these steps to create a 176 vCPU instance.

Select C3 in the machine configuration and specify the machine type as c3-standard-176. You also need to configure your CPU platform and turn on All-Core Turbo for more stable performance.
alternative text
Set the OS and storage tabs as below.
alternative text
Leave the other settings as default and click the Create button

C4

Go to Google Cloud Console and click (Create VM) under your project. Follow these steps to create a 144 vCPU instance.

Select C4 on the (Machine Configuration) tab and specify the machine type as c4-standard-144. You can also configure your CPU platform to turn on All-Core Turbo for more consistent performance.
alternative text
Configure the required OS and storage tabs for C3. Leave the other settings as default and click the Create button

Set up the environment

Log in to your instance with SSH and install docker. You can easily create an environment by following the steps below. For better reproducibility, list the version and commit you are using in the command.

$ git clone https://github.com/huggingface/transformers.git $ cd transformers/ $ git checkout 26b65fb5168f324277b85c558ef8209bfceae1fe $ cd docker/transformers-intel-cpu/ $ sudo docker build . -t $ sudo docker run -it –rm –privileged -v /home/:/workspace /bin/bash

Now that you’re in a container, do the following steps:

$ pip install git+https://github.com/huggingface/transformers.git@26b65fb5168f324277b85c558ef8209bfceae1fe $ pip install torch==2.8.0 torchvision torchaudio –index-url https://download.pytorch.org/whl/cpu

Benchmark procedure

For each batch size,

Constructs a left-justified batch of fixed length 1024 tokens. Perform one warm-up round. Measure the total latency by setting max_new_tokens=1024 and get $throughput = (OUTPUT\_TOKENS *Batch\_Size) / total\_latency$.

Run umactl -l python benchmark.py on the following code.

import OS
import time
import torch
from dataset import Load dataset
from transformer import AutoModelForCausalLM, AutoTokenizer INPUT_TOKENS = 1024
Output token = 1024

surely get_inputs(tokenizer, batch size): dataset =load_dataset(“ola13/small-the_pile”split =“train”) tokenizer.padding_side = “left”
selected text = ()
for sample in Dataset: input_ids = tokenizer(sample(“Sentence”), return_tensors=“pt”).input_ids
if Ren(selected text) == 0 and input_ids.shape(-1) >= INPUT_TOKENS: selected_texts.append(sample(“Sentence”))
Elif Ren(selected text) > 0: selected_texts.append(sample(“Sentence”))
if Ren(selected text) == batch size:
break

return tokenizer(selected_texts, max_length=INPUT_TOKENS, padding=“maximum length”truncate =truthreturn_tensors=“pt”)

surely run_generate(Model, input, generation configuration): input(“Generation composition”) = Generation_config model.generate(**inputs) pre = time.time() model.generate(**inputs) Latency = (time.time() – pre)
return latency

surely benchmark(Model, tokenizer, batch size, generation configuration): inputs = get_inputs(tokenizer, batch size) generate_config.max_new_tokens = 1
generation_config.min_new_tokens = 1
prefill_latency = run_generate(model, input, generation configuration) generation configuration.max_new_tokens = OUTPUT_TOKENS generation configuration.min_new_tokens = OUTPUT_TOKENS total_latency = run_generate(model, input, generation configuration) decoding_latency = (total_latency – prefill_latency) / (OUTPUT_TOKENS – 1) Throughput = OUTPUT_TOKENS * batch size / total_latency

return prefill_latency, decoding_latency, throughput

if __name__ == “__Major__”: Model ID = “unsloth/gpt-oss-120b-BF16”
tokenizer = AutoTokenizer.from_pretrained(model_id) model_kwargs = {“d type”: torch.bfloat16} model = AutoModelForCausalLM.from_pretrained(model_id, **model_kwargs) model.config._attn_implementation=“SDPA”
Generation configuration = model.generation configuration generation configuration.do_sample = error
generation_config.cache_implementation=“static”

for batch size in (1, 2, 4, 8, 16, 32, 64):
print(Run generation with f”———- batch size =. {batch size} ———-”flash =truth) prefill_latency, decoding_latency, throughput = benchmark (model, tokenizer, batch size, generation configuration)
print(f”throughput = {throughput}”flash =truth)

result

Normalized throughput per vCPU

C4 with Intel Xeon 6 processors consistently outperforms C3 by 1.4-1.7x throughput per vCPU over batch sizes up to 64. The formula is:

Normalized_throughput_per_vCPU=throughput_C4/vCPUs_C4throughput_C3/vCPUs_C3 \text{normalized\_throughput\_per\_vCPU} = \frac{\text{throughput\_C4} / \text{vCPUs\_C4}} {\text{Throughput\_C3} / \text{vCPUs\_C3}}
Normalized throughput per vCPU=Throughput_C3/vCPU_C3Throughput_C4/vCPU_C4​

Throughput per vcpu gpt-oss

Cost and TCO

At a batch size of 64, C4 provides 1.7 times the throughput per vCPU of C3. Because the price per vCPU is roughly equivalent (cost per hour increases linearly with the number of vCPUs), the TCO is 1.7x better (C3 requires 1.7x more spending for the same amount of tokens generated).

Throughput ratio per vCPU:
Throughput_C4/vCPUs_C4Throughput_C3/vCPUs_C3=1.7⇒TCO_C3TCO_C4≈1.7 \frac{\text{Throughput\_C4} / \text{vCPUs\_C4}}{\text{Throughput\_C3} / \text{vCPUs\_C3}} = 1.7 \Rightarrow \frac{\text{TCO\_C3}}{\text{TCO\_C4}} \approx. 1.7
Throughput_C3/vCPU_C3Throughput_C4/vCPU_C4​=1.7⇒TCO_C4TCO_C3​≈1.7

Throughput per dollar gpt-oss

conclusion

Google Cloud C4 VMs, powered by Intel Xeon 6 processors (GNR), deliver both significant performance improvements and greater cost efficiency for large-scale MoE inference compared to the previous generation Google Cloud C3 VMs (powered by 4th generation Intel Xeon processors). We observed a combination of increased throughput, lower latency, and lower cost for GPT OSS MoE inference. These results highlight that large-scale MoE models can be efficiently delivered on next-generation general-purpose CPUs thanks to targeted framework optimizations by Intel and Hugging Face.

author avatar
versatileai
See Full Bio
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
Previous ArticleAI Regulation in Telecommunications: Navigating the Complex Web
Next Article President Trump’s AI advisor accuses Anthropic of ‘regulatory usurpation’
versatileai

Related Posts

Tools

Computer vision helps retailers improve productivity

June 19, 2026
Tools

Automate council planning tasks with Google Cloud-generated AI

June 17, 2026
Tools

The open source community powers OpenEnv for Agentic RL

June 17, 2026
Add A Comment

Comments are closed.

Top Posts

Huawei fills the AI ​​gap left in China by Apple

June 16, 20265 Views

Xebia: Why AI agents fail without the right data foundation

June 14, 20264 Views

Trends and insights with new multilingual and long-form tracks

November 22, 20254 Views
Stay In Touch
  • YouTube
  • TikTok
  • Twitter
  • Instagram
  • Threads
Latest Reviews

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

Most Popular

Huawei fills the AI ​​gap left in China by Apple

June 16, 20265 Views

Xebia: Why AI agents fail without the right data foundation

June 14, 20264 Views

Trends and insights with new multilingual and long-form tracks

November 22, 20254 Views
Don't Miss

Computer vision helps retailers improve productivity

June 19, 2026

Automate council planning tasks with Google Cloud-generated AI

June 17, 2026

The open source community powers OpenEnv for Agentic RL

June 17, 2026
Service Area
X (Twitter) Instagram YouTube TikTok Threads RSS
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms and Conditions
  • Disclaimer
© 2026 Versa AI Hub. All Rights Reserved.

Type above and press Enter to search. Press Esc to cancel.

Sign In or Register

Welcome Back!

Login to your account below.

Lost password?