Baidu’s latest ERNIE model is a super-efficient multimodal AI that outperforms GPT and Gemini on key benchmarks and targets enterprise data that is often ignored by text-centric models.
For many companies, valuable insights are locked away in engineering drawings, factory floor video feeds, medical scans, and logistics dashboards. Baidu’s new model ERNIE-4.5-VL-28B-A3B-Thinking is designed to fill this gap.
What’s interesting to enterprise architects is not just its multimodal capabilities, but its architecture. It is described as a “lightweight” model, with only 3 billion parameters enabled during operation. This approach targets the high inference costs that often stall AI scaling projects. Baidu is betting on efficiency as the path to adoption, training its systems as the basis for “multimodal agents” that can not just recognize, but reason and act.
Complex visual data analysis capabilities supported by AI benchmarks
Baidu’s multimodal ERNIE AI model excels at handling dense non-textual data. For example, you can interpret the “Peak Hours Reminder” chart to find the best visiting time. This task reflects resource scheduling challenges in logistics and retail.
ERNIE 4.5 also works in technical areas, such as applying Ohm’s Law and Kirchhoff’s Law to solve bridge circuit diagrams. In research and development or engineering departments, future assistants may validate designs or explain complex schematics to new employees.
This feature is supported by Baidu benchmarks, which show ERNIE-4.5-VL-28B-A3B-Thinking to outperform competitors such as GPT-5-High and Gemini 2.5 Pro in several key tests.
MathVista: ERNIE (82.5) vs Gemini (82.3) and GPT (81.3)ChartQA: ERNIE (87.1) vs Gemini (76.3) and GPT (78.2)VLM is blind: ERNIE (77.3) vs Gemini (76.5) and GPT (69.6)
Of course, AI benchmarks provide guidance, but it’s worth noting that they can have flaws. Before deploying AI models in mission-critical applications, be sure to perform internal testing based on your needs.
Baidu moves from recognition to automation with latest ERNIE AI model
The main hurdle for enterprise AI is moving from awareness (“What is this?”) to automation (“What is going on now?”). ERNIE 4.5 claims to address this issue by integrating visual grounding and tool usage.
If you ask the multimodal AI to find all the people wearing suits in an image and return their coordinates in JSON format, it will work. This model produces structured data. This is a feature that can be easily transferred to a production line for visual inspection or to a system that audits site images for safety compliance.
This model also manages external tools and can autonomously enlarge photos and read small text. When faced with an unknown object, it can trigger an image search to identify it. This represents a less passive form of AI, allowing agents to not only flag errors in the data center, but also zoom in on the code, search internal knowledge bases, and suggest fixes.
Unlock business intelligence with multimodal AI
Baidu’s latest ERNIE AI model also targets enterprise video archives, from training sessions and meetings to security footage. You can extract all on-screen subtitles and map them to precise timestamps.
It also exhibits temporal awareness, finding specific scenes (such as a scene “shot on a bridge”) by analyzing visual cues. The clear end goal is to make the vast video library searchable, allowing employees to find the exact moment a particular topic was discussed, even though they may have fallen asleep several times during a two-hour webinar.
Baidu provides deployment guidance for several paths including Transformers, vLLM, and FastDeploy. However, hardware requirements are a major barrier. Single card deployment requires 80GB of GPU memory. This is not a tool for casual experimentation, but for organizations with existing high-performance AI infrastructure.
Those with hardware can use Baidu’s ERNIEKit toolkit to fine-tune their own data. Required for most high-value use cases. Baidu provides the latest ERNIE AI model with an Apache 2.0 license that allows commercial use, which is essential for adoption.
The market is finally moving toward multimodal AI that can see, read, and act within a specific business context, and benchmarks suggest it’s doing so with great prowess. The challenge now is to identify high-value visual reasoning jobs within your business and weigh them against the significant costs of hardware and governance.
SEE ALSO: Wiz: Security flaws emerge amid global AI race
Want to learn more about AI and big data from industry leaders? Check out the AI & Big Data Expos in Amsterdam, California, and London. This comprehensive event is part of TechEx and co-located with other major technology events including Cyber Security Expo. Click here for more information.
AI News is brought to you by TechForge Media. Learn about other upcoming enterprise technology events and webinars.

