Reduce AI inference costs with NVIDIA and Google infrastructure

At the Google Cloud Next conference, Google and NVIDIA outlined a hardware roadmap designed to address the cost of AI inference at scale.

The companies detailed the new A5X bare metal instances running on NVIDIA Vera Rubin NVL72 rack-scale systems. Through co-design of hardware and software, this architecture aims to reduce inference cost per token by up to 10x compared to previous generations, while increasing token throughput per megawatt by 10x.

Connecting thousands of processors requires large amounts of bandwidth to prevent processing delays. A5X instances address this hardware challenge by combining NVIDIA ConnectX-9 SuperNIC and Google Virgo networking technology.

This configuration can scale up to 80,000 NVIDIA Rubin GPUs in a single-site cluster and up to 960,000 GPUs across multi-site deployments. Operating at this scale requires advanced workload management. Routing data across nearly a million parallel processors requires precise synchronization to avoid idle computing time.

Mark Lohmeyer, vice president and general manager of AI and compute infrastructure at Google Cloud, said: “At Google Cloud, we believe the next decade of AI will be shaped by our customers’ ability to run their most demanding workloads on a truly integrated, AI-optimized infrastructure stack.

“Google Cloud’s scalable infrastructure and managed AI services, combined with NVIDIA’s industry-leading platforms, systems, and software, give us the flexibility to train, tune, and deliver everything from frontier and open models to agentic and physical AI workloads while optimizing performance, cost, and sustainability.”

Sovereign data governance and cloud security requirements

Beyond raw processing power, data governance remains a key issue for enterprise deployments. In highly regulated sectors such as finance and healthcare, machine learning efforts are often bogged down by data sovereignty requirements and the risk of sensitive information being compromised.

To address these compliance obligations, Google Gemini models running on NVIDIA Blackwell and Blackwell Ultra GPUs are entering preview on Google Distributed Cloud. This deployment method allows organizations to maintain Frontier models entirely within a managed environment alongside their most sensitive data stores.

This architecture includes NVIDIA Confidential Computing. This hardware-level security protocol ensures that your training model operates within a protected environment where prompts and tweak data remain encrypted. Encryption prevents unauthorized parties, including the cloud infrastructure operator itself, from viewing or modifying the underlying data.

For multi-tenant public cloud environments, the preview of Confidential G4 VMs with NVIDIA RTX PRO 6000 Blackwell GPUs introduces these same cryptographic protections, giving regulated industries access to high-performance hardware without violating data privacy standards. This release represents the first cloud-based confidential computing product for NVIDIA Blackwell GPUs.

Operational overhead for agent AI training

Building multi-step agent systems requires connecting large language models to complex application programming interfaces, maintaining continuous synchronization of vector databases, and actively mitigating illusions caused by running algorithms.

To streamline this heavy engineering requirement, NVIDIA Nemotron 3 Super is now available on the Gemini Enterprise Agent Platform. The platform provides developers with tools to customize and deploy inference and multimodal models designed specifically for agent tasks. The broad NVIDIA platform on Google Cloud is optimized for a variety of models, including Google’s Gemini and Gemma families, and gives developers the tools to build systems that reason, plan, and execute.

Training these models at scale incurs significant operational overhead, especially when managing cluster sizing and hardware failures during long reinforcement learning cycles.

Google Cloud and NVIDIA have introduced managed training clusters on the Gemini Enterprise Agent Platform, including managed reinforcement learning APIs built on NVIDIA NeMo RL. The system automates cluster sizing, disaster recovery, and job execution, allowing data science teams to focus on model quality rather than low-level infrastructure management.

CrowdStrike actively leverages NVIDIA NeMo open libraries, such as NeMo Data Designer and NeMo Megatron Bridge, to generate synthetic data and fine-tune models for domain-specific cybersecurity applications. Running these models on managed training clusters with Blackwell GPUs accelerates automated threat detection and response capabilities.

Legacy architecture integration and physics simulation

Integrating machine learning into heavy industry and manufacturing poses a different kind of engineering challenge. Connecting digital models to the physical factory floor requires accurate physical simulation, massive computational power, and standardization across traditional data formats. NVIDIA’s AI infrastructure and physical AI libraries are now available on Google Cloud, providing a foundation for organizations to simulate and automate real-world manufacturing workflows.

Leading industrial software providers such as Cadence and Siemens are making their solutions available on Google Cloud, accelerated by NVIDIA infrastructure. These tools power the engineering and manufacturing of heavy equipment, aerospace platforms, and autonomous vehicles.

Manufacturing companies often operate product lifecycle management systems that are decades old, making it difficult to convert geometry and physical data. By leveraging the NVIDIA Omniverse library and open source NVIDIA Isaac Sim framework via Google Cloud Marketplace, developers can avoid some of these translation issues and build physically accurate digital twins to train robotics simulation pipelines prior to physical deployment.

Deploying NVIDIA NIM microservices, such as the Cosmos Reason 2 model, on Google Vertex AI and Google Kubernetes Engine enables vision-based agents and robots to interpret and navigate the physical environment. Together, these platforms allow developers to go directly from computer-aided design to actual industrial digital twins.

Impact on the entire accelerated computing ecosystem

To translate these hardware specifications into quantifiable financial benefits, we need to examine how early adopters utilize the infrastructure.

The broad portfolio includes options to scale down from full NVL72 racks to fractional G4 VMs offering just 1/8th of the GPU. This allows customers to precisely provision acceleration capabilities for mixed-expert inference and data processing tasks.

Thinking Machines Lab extends the Tinker API on A4X Max VM to accelerate training. OpenAI uses large-scale inference on NVIDIA GB300 and GB200 NVL72 systems on Google Cloud to handle demanding workloads such as ChatGPT operations.

Snap moved its data pipeline to GPU-accelerated Spark on Google Cloud to reduce the huge costs associated with large-scale A/B testing. In the pharmaceutical space, Schrödinger leveraged NVIDIA accelerated computing on Google Cloud to compress drug discovery simulations that previously took weeks into hours.

The developer ecosystem extending these tools has expanded rapidly. Within a year, over 90,000 developers joined the NVIDIA and Google Cloud joint developer community.

Startups like CodeRabbit and Factory apply NVIDIA Nemotron-based models to Google Cloud to perform code reviews and run autonomous software development agents. Aible, Mantis AI, Photoroom, and Baseten use full-stack platforms to build enterprise data, video intelligence, and generated imagery solutions.

Together, NVIDIA and Google Cloud aim to provide a computing foundation designed to evolve experimental agents and simulations into production systems that secure real-world fleets and optimize factories.

See also: Reversing enterprise security costs with AI vulnerability detection

Want to learn more about AI and big data from industry leaders? Check out the AI & Big Data Expos in Amsterdam, California, and London. This comprehensive event is part of TechEx and co-located with other major technology events such as Cyber Security & Cloud Expo. Click here for more information.

AI News is brought to you by TechForge Media. Learn about other upcoming enterprise technology events and webinars.

versatileai

See Full Bio

What's Hot

Reduce AI inference costs with NVIDIA and Google infrastructure

Gemma 4 VLA Demo for Jetson Orin Nano Super

Snowflake extends technical and mainstream AI platforms

Gemma 4 VLA Demo for Jetson Orin Nano Super

Snowflake extends technical and mainstream AI platforms

Quality First Arabic LLM Leaderboard

Quality First Arabic LLM Leaderboard

Bobyard 2.0 offers improved takeoff and integrated AI for estimation

Run VLM on Intel CPUs in 3 easy steps

Most Popular

Quality First Arabic LLM Leaderboard

Bobyard 2.0 offers improved takeoff and integrated AI for estimation

Run VLM on Intel CPUs in 3 easy steps

Don't Miss

Reduce AI inference costs with NVIDIA and Google infrastructure

Gemma 4 VLA Demo for Jetson Orin Nano Super

Snowflake extends technical and mainstream AI platforms

Subscribe to Updates

What's Hot

Reduce AI inference costs with NVIDIA and Google infrastructure

Sovereign data governance and cloud security requirements

Operational overhead for agent AI training

Legacy architecture integration and physics simulation

Impact on the entire accelerated computing ecosystem

Related Posts