The value of systems capable of performing human tasks, particularly AI agents, which have opportunities for productivity improvements, are clear. However, performance of large-scale language models (LLMS) can hinder effective agent deployment. Salesforce’s AI research is trying to address that issue.
Also: 60% of AI agents work in IT departments – this is what they do every day
On Thursday, Salesforce launched its first Salesforce AI Research in a review report, highlighting innovations in high-tech companies, including new basic developments and research papers from the past quarter. Salesforce hopes that these works will help support the development of trusted and capable AI agents that may work well in a business environment.
“At Salesforce, we call these “boring breakthroughs.” Not because they are inconspicuous, but because they are quietly capable, certainly scalable, and built to withstand,” says Salvio Savarese. “They’re very seamless and some people may take them for granted.”
Also: 4 types of people interested in AI agents – and what businesses can learn from them
Dive into some of the biggest breakthroughs and takeouts from the report.
Problem: Intelligence with Jug
If you’ve ever used AI models for simple tasks every day, you might be surprised by the rudimentary nature of their mistakes. What’s even more inexplicable is that the same model that got the basic question wrong worked very well beyond benchmarks that tested its functionality on very complex topics like mathematics, STEM, coding, and more. This paradox is what Salesforce calls “Gagged Intelligence.”
Salesforce points out that this “anxiety,” or the contradiction between LLM’s raw intelligence and consistent real-world performance, is particularly challenging for businesses that require consistent operational performance, especially in unpredictable environments. However, addressing the problem means quantifying it first, which highlights another problem.
“AI today is jagged, so we need to work on that, but can we work on something without measuring first?” says Shelby Heinecke, AI Senior Research Manager at Salesforce.
Also: Why is it such a dangerous business to ignore AI ethics and how to do AI right?
That’s exactly the problem that Salesforce’s new simple benchmark addresses.
benchmark
Salesforce’s simple public datasets are easy for humans to answer, but AI is easy for them to benchmark or quantify because of LLM’s Jaganness. To give you an idea of how basic the questions are, Face’s dataset card describes the problem as “at least 10% of high school students who are given pens, unlimited paper and an hour of time can be solved.”
Although there is no testing for supercomplex tasks, simple benchmarks should help individuals understand how models can infer in their real environments and applications, especially when developing enterprise general intelligence (EGI). These competent AI systems ensure that you handle business applications.
Also: 60% of AI agents work in IT departments – this is what they do every day
Another advantage of benchmarks is that they have a better idea of model performance consistency, which should lead to higher trust from business leaders about implementing AI systems, such as AI agents, into their business.
Another benchmark developed by Salesforce is ContextJudgeBench. This evaluates AI-enabled judges rather than the model itself and takes a different approach. AI model benchmarks often use evaluations from other AI models. ContextualJudgeBench focuses on LLMS that evaluates other models with the idea that if the rater is reliable, that evaluation will be. The benchmark test exceeds 2,000 response pairs.
crmarena
In the past quarter, Salesforce launched Crmarena, an agent benchmark framework. This framework evaluates how AI agents perform CRM (Customer Relationship Management) tasks. This will make Commerce recommendations, such as how AI can summarise sales emails and transcripts.
“These agents don’t need to solve theorems, they don’t need to turn my prose into Shakespeare’s poems.
Also: How “Agent Internet” helps to work with AIS
Crmarena aims to address issues with organizations that don’t know how well a model works in real business tasks. Beyond comprehensive testing, this framework should help improve AI agents development and performance.
Other notable mentions
The complete report includes further research to help improve the efficiency and reliability of AI models. Here is a very simple summary of some of these highlights:
SFR-embedding: Salesforce has enhanced its SFR embedding model that converts text-based information into structured data from AI systems such as agents. The company has also added SFR-Embedded Code, a specialized code embedding family of models. SFR-Guard: A family of models trained with data to assess the performance of AI agents in key business areas, including toxicity detection and rapid injection. XLAM: Salesforce has updated its Xlam (Large Action Model) family with “multi-turn conversation support and wider smaller models for improved accessibility.” TACO: The multimodal family of this model generates a chain of thoughts and actions (COTA) to tackle complex multi-step problems.