Close Menu
Versa AI hub
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools
  • Resources

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

What's Hot

Discovering new solutions to centuries-old problems in fluid mechanics

January 24, 2026

Anthropic usage statistics paint a detailed picture of AI success

January 24, 2026

YouTube vows to fight ‘AI slop’ in 2026

January 23, 2026
Facebook X (Twitter) Instagram
Versa AI hubVersa AI hub
Sunday, January 25
Facebook X (Twitter) Instagram
Login
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools
  • Resources
Versa AI hub
Home»Research»Salesforce AI Research Introduces CodeXEmbed (SFR-Embedding-Code): A Family of Code Search Models Ranked #1 in CoIR Benchmarks and Supports 12 Programming Languages
Research

Salesforce AI Research Introduces CodeXEmbed (SFR-Embedding-Code): A Family of Code Search Models Ranked #1 in CoIR Benchmarks and Supports 12 Programming Languages

By January 19, 2025No Comments5 Mins Read
Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
Share
Facebook Twitter LinkedIn Pinterest Email

Code search has become essential in modern software development, allowing you to efficiently access relevant code snippets and documentation. Unlike traditional text search, which effectively handles natural language queries, code search must deal with unique challenges such as changing structure of programming languages, dependencies, and contextual relevance. As tools like GitHub Copilot grow in popularity, advanced code search systems become increasingly important to increase productivity and reduce errors.

Existing search models often struggle to capture programming-specific nuances such as syntax, control flow, and variable dependencies. These limitations hinder problem solving in code summarization, debugging, and translation between languages. Although text search models have made significant progress, they do not meet the specific requirements of code search, highlighting the need for specialized models that improve accuracy and efficiency across a variety of programming tasks. Models such as CodeBERT, CodeGPT, and UniXcoder have addressed aspects of code search using pre-trained architectures. Still, their small size and task-specificity limit their extensibility and versatility. Although Voyage-Code introduced extensive functionality, its closed-source nature has limited widespread adoption. This highlights the critical need for open source, scalable code search systems that generalize across multiple tasks.

Researchers at Salesforce AI Research have introduced CodeXEmbed, an open source family of embedded models specifically designed for code and text search. These models are released in three sizes and 7 billion parameters: SFR-Embedding-Code-400M_R, SFR-Embedding-Code-2B_R, to accommodate a variety of programming languages ​​and search tasks. CodeXEmbed’s innovative training pipeline unifies 12 programming languages ​​and transforms 5 different code search categories into a unified framework. This model expands the scope that search systems can achieve by supporting a variety of tasks such as text-to-code search, code-to-text search, and hybrid search, providing unprecedented flexibility and performance. I will.

CodeXEmbed takes an innovative approach to converting code-related tasks into a unified query and answer framework, enabling versatility in a variety of scenarios. Text-to-code retrieval maps natural language queries to relevant code snippets, streamlining tasks such as code generation and debugging. Code-to-text search generates code descriptions and summaries to enhance documentation and knowledge sharing. Hybrid search integrates text and code data to effectively address complex queries that require technical and descriptive insights. Model training leverages contrastive loss to optimize the consistency of queries and answers while mitigating the influence of irrelevant data. Advanced techniques such as low-rank adaptation and token pooling increase efficiency without sacrificing performance.

Tests have been evaluated across a variety of benchmarks. On the CoIR benchmark, a comprehensive code search evaluation dataset covering 10 subsets and over 2 million entries, the 7 billion parameter model outperformed the previous state-of-the-art Voyage-Code model by more than 20%. Achieved improvement. . Notably, the 400-billion-parameter model and the 2-billion-parameter model also outperform Voyage-Code, demonstrating the scalability of the architecture across different sizes. CodeXEmbed also excels in text search tasks, with a 7 billion parameter model achieving an average score of 60 on the BEIR benchmark. The BEIR benchmark is a collection of 15 datasets covering a variety of search tasks, including question answering and fact checking.

This model can capture code and power end-to-end acquisition augmentation generation (RAG) systems. For example, when applied to repository-level tasks such as code completion and problem solving, the 7 billion parameter model achieved remarkable results in benchmarks such as RepoEval and SWE-Bench-Lite. RepoEval, which focuses on repository-level code completion, improved top-1 accuracy when the model retrieved context-relevant snippets. CodeXEmbed outperformed traditional search systems on SWE-Bench-Lite, a curated dataset for solving GitHub problems.

Key takeaways from the study highlight the contribution and impact of CodeXEmbed in the advancement of code search.

The 7 billion parameter model achieved state-of-the-art performance with over 20% improvement on the CoIR benchmark and competitive results on BEIR. Demonstrated versatility across code and text tasks. The 400 million and 2 billion parameter models provide practical alternatives for environments with limited computational resources. This model addresses a wide range of code-related applications by integrating 12 programming languages ​​and 5 search categories. Unlike closed systems such as Voyage-Code, CodeXEmbed fosters community-driven research and innovation. Integration with search extension generation systems improves outcomes for tasks such as code completion and problem solving. The use of contrastive loss and token pooling optimizes the retrieval accuracy and model adaptability.

In conclusion, code search has advanced with the introduction of the CodeXEmbed family by Salesforce. These models demonstrate unparalleled versatility and scalability by achieving state-of-the-art performance on CoIR benchmarks and excelling on text retrieval tasks. A multilingual and multitasking integrated framework that supports 12 programming languages ​​positions CodeXEmbed as a vital tool for developers and researchers. Open source accessibility bridges the gap between natural language and code search while fostering community-driven innovation.

Check out the paper, 400M model, and 2B model. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram channel and LinkedIn group. Don’t forget to join the 65,000+ ML SubReddit.

🚨 Recommended open source platform: Parlant is a framework that transforms the way AI agents make decisions in customer-facing scenarios. (promotion)

Asif Razzaq is the CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of artificial intelligence for social good. His latest endeavor is the launch of Marktechpost, an artificial intelligence media platform. It stands out for its thorough coverage of machine learning and deep learning news that is technically sound and easily understood by a wide audience. The platform boasts over 2 million views per month, which shows its popularity among viewers.

📄 Introducing Height: The Only Autonomous Project Management Tool (Sponsored)

author avatar
See Full Bio
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
Previous ArticleStanford University Researchers Introduce BIOMEDICA: A Scalable AI Framework for Powering Biomedical Visual Language Models Using Large Multimodal Datasets
Next Article Making cosmetics sustainable with generative AI

Related Posts

Research

New AI research clarifies the origins of Papua New Guineans

July 22, 2025
Research

AI helps prevent medical errors in real clinics

July 22, 2025
Research

No one is surprised, and a new study says that AI overview causes a significant drop in search clicks

July 22, 2025
Add A Comment

Comments are closed.

Top Posts

Gemini achieves gold medal level at International University Programming Contest World Finals

January 21, 20267 Views

Bridging the gap between AI agent benchmarks and industrial reality

January 22, 20266 Views

AI image generation tools will revolutionize content creation in 2026: Latest insights from Sawyer Merritt | AI News Details

January 21, 20266 Views
Stay In Touch
  • YouTube
  • TikTok
  • Twitter
  • Instagram
  • Threads
Latest Reviews

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

Most Popular

Gemini achieves gold medal level at International University Programming Contest World Finals

January 21, 20267 Views

Bridging the gap between AI agent benchmarks and industrial reality

January 22, 20266 Views

AI image generation tools will revolutionize content creation in 2026: Latest insights from Sawyer Merritt | AI News Details

January 21, 20266 Views
Don't Miss

Discovering new solutions to centuries-old problems in fluid mechanics

January 24, 2026

Anthropic usage statistics paint a detailed picture of AI success

January 24, 2026

YouTube vows to fight ‘AI slop’ in 2026

January 23, 2026
Service Area
X (Twitter) Instagram YouTube TikTok Threads RSS
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms and Conditions
  • Disclaimer
© 2026 Versa AI Hub. All Rights Reserved.

Type above and press Enter to search. Press Esc to cancel.

Sign In or Register

Welcome Back!

Login to your account below.

Lost password?