Soundhound gives its AI the power of vision

Already a leading voice assistant player, Soundhound AI is now giving its technology a pair of eyes.

Passing the landmark, without asking the car without pulling out a phone, you get an instant answer, “What is the building over there?” That’s what Soundhound AI is building.

With the launch of Vision AI, Soundhound’s new system combines vision and sound to create a smarter, more natural way to interact with technology. The idea is to mimic how we operate as humans. We don’t just listen to someone, we also see their gestures and what they see.

By bringing this same contextual understanding to AI, Soundhound wants to smooth out the clumsy and often frustrating experiences we have with many of today’s smart devices. The company is targeting real-world applications where this combination feels can make a huge difference in the next car, restaurant drive-thru, and factory floors.

“We’re excited to announce that we’re a great place to go,” said Keyvan Mohajer, CEO of Soundhound AI. “In Soundhound, the AI future is not just multimodal, it is deeply integrated, responsive and built for real-world impact.

“With Vision AI, we are expanding our leadership with voice and conversational AI to redefine how humans interact with the products and services offered and used by businesses.”

So, how does it work? Vision AI takes a live feed from the camera and blends it with the company’s audio technology. This is great for understanding already natural speech. By processing what sounds exactly as long as it is watching, the system can grasp the user’s true intentions in a way that a simple voice assistant can never do.

Think of a mechanic wearing smart glasses that can simply look at engine parts and ask for instructions. Receive instant visual and audio guidance without putting up any tools. The shop allows staff to scan shelves by looking at them to get real-time inventory counts. For the rest of us, it might mean a drive-through kiosk that visually confirms on-screen orders the moment we say it.

One of the biggest technical issues when creating such a system is ensuring that the audio and visual elements are perfectly synchronized. Any delay will shatter the illusion of natural conversation.

Pranav Singh, Vice President of Engineering at Soundhound AI, commented: “With Vision AI, visual recognition and speech intelligence are fused into a single synchronous flow. Every frame, every utterance, every intention is interpreted within the same ecosystem.

“This is an innovation at the intersection of intelligence and execution, providing you with the AI that is visible, listen to you and responds to at this point.”

For businesses that employ this technology, the promise is to provide faster service, reduce mistakes, and provide happier customers. It doesn’t feel like a tool that needs to remove friction and let the technology work, it feels like a partner who helps you get things done.

This new visual feature is not the only upgrade feature that Soundhound has deployed. The company recently improved the system’s “brain” with a new update, Amelia 7.1. This enhancement will make AI agents faster and more accurate, giving them more control and transparency about how companies work.

By combining vision and sound, Soundhound aims to bring you closer to a world where interactions with AI can feel as easy and intuitive as talking to others.

(Photo by Christian Lu)

See: Alan Turing Institute: The Humanities are the Key to the Future of AI

Want to learn more about AI and big data from industry leaders? Check out the AI & Big Data Expo in Amsterdam, California and London. The comprehensive event will be held in collaboration with other major events, including the Intelligent Automation Conference, Blockx, Digital Transformation Week, and Cyber Security & Cloud Expo.

Check out other upcoming Enterprise Technology events and webinars with TechForge here.

versatileai

See Full Bio

What's Hot

Introducing Gemini Omni

IMDA updates AI framework, OpenAI opens Singapore AI Lab

Nemotron-Labs Towards light-speed text generation using a diffuse language model

Introducing Gemini Omni

IMDA updates AI framework, OpenAI opens Singapore AI Lab

Nemotron-Labs Towards light-speed text generation using a diffuse language model

Edimakor V4.2.0 unveils AI video tools at VEO 3

Pillar Security raises $9 million to create AI security guardrails for businesses

10 Best AI for PowerPoint presentations

Most Popular

Edimakor V4.2.0 unveils AI video tools at VEO 3

Pillar Security raises $9 million to create AI security guardrails for businesses

10 Best AI for PowerPoint presentations

Don't Miss

Introducing Gemini Omni

IMDA updates AI framework, OpenAI opens Singapore AI Lab

Nemotron-Labs Towards light-speed text generation using a diffuse language model

Subscribe to Updates

What's Hot

Soundhound gives its AI the power of vision

Related Posts