Close Menu
Versa AI hub
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

What's Hot

ClarityCut ​​AI unveils a new creative engine for branded videos

June 7, 2025

The most comprehensive evaluation suite for GUI agents!

June 7, 2025

Japan’s innovative approach to artificial intelligence law – gktoday

June 7, 2025
Facebook X (Twitter) Instagram
Versa AI hubVersa AI hub
Saturday, June 7
Facebook X (Twitter) Instagram
Login
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools
Versa AI hub
Home»Research»Salesforce AI Research Introduces CodeXEmbed (SFR-Embedding-Code): A Family of Code Search Models Ranked #1 in CoIR Benchmarks and Supports 12 Programming Languages
Research

Salesforce AI Research Introduces CodeXEmbed (SFR-Embedding-Code): A Family of Code Search Models Ranked #1 in CoIR Benchmarks and Supports 12 Programming Languages

By January 19, 2025No Comments5 Mins Read
Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
Share
Facebook Twitter LinkedIn Pinterest Email

Code search has become essential in modern software development, allowing you to efficiently access relevant code snippets and documentation. Unlike traditional text search, which effectively handles natural language queries, code search must deal with unique challenges such as changing structure of programming languages, dependencies, and contextual relevance. As tools like GitHub Copilot grow in popularity, advanced code search systems become increasingly important to increase productivity and reduce errors.

Existing search models often struggle to capture programming-specific nuances such as syntax, control flow, and variable dependencies. These limitations hinder problem solving in code summarization, debugging, and translation between languages. Although text search models have made significant progress, they do not meet the specific requirements of code search, highlighting the need for specialized models that improve accuracy and efficiency across a variety of programming tasks. Models such as CodeBERT, CodeGPT, and UniXcoder have addressed aspects of code search using pre-trained architectures. Still, their small size and task-specificity limit their extensibility and versatility. Although Voyage-Code introduced extensive functionality, its closed-source nature has limited widespread adoption. This highlights the critical need for open source, scalable code search systems that generalize across multiple tasks.

Researchers at Salesforce AI Research have introduced CodeXEmbed, an open source family of embedded models specifically designed for code and text search. These models are released in three sizes and 7 billion parameters: SFR-Embedding-Code-400M_R, SFR-Embedding-Code-2B_R, to accommodate a variety of programming languages ​​and search tasks. CodeXEmbed’s innovative training pipeline unifies 12 programming languages ​​and transforms 5 different code search categories into a unified framework. This model expands the scope that search systems can achieve by supporting a variety of tasks such as text-to-code search, code-to-text search, and hybrid search, providing unprecedented flexibility and performance. I will.

CodeXEmbed takes an innovative approach to converting code-related tasks into a unified query and answer framework, enabling versatility in a variety of scenarios. Text-to-code retrieval maps natural language queries to relevant code snippets, streamlining tasks such as code generation and debugging. Code-to-text search generates code descriptions and summaries to enhance documentation and knowledge sharing. Hybrid search integrates text and code data to effectively address complex queries that require technical and descriptive insights. Model training leverages contrastive loss to optimize the consistency of queries and answers while mitigating the influence of irrelevant data. Advanced techniques such as low-rank adaptation and token pooling increase efficiency without sacrificing performance.

Tests have been evaluated across a variety of benchmarks. On the CoIR benchmark, a comprehensive code search evaluation dataset covering 10 subsets and over 2 million entries, the 7 billion parameter model outperformed the previous state-of-the-art Voyage-Code model by more than 20%. Achieved improvement. . Notably, the 400-billion-parameter model and the 2-billion-parameter model also outperform Voyage-Code, demonstrating the scalability of the architecture across different sizes. CodeXEmbed also excels in text search tasks, with a 7 billion parameter model achieving an average score of 60 on the BEIR benchmark. The BEIR benchmark is a collection of 15 datasets covering a variety of search tasks, including question answering and fact checking.

This model can capture code and power end-to-end acquisition augmentation generation (RAG) systems. For example, when applied to repository-level tasks such as code completion and problem solving, the 7 billion parameter model achieved remarkable results in benchmarks such as RepoEval and SWE-Bench-Lite. RepoEval, which focuses on repository-level code completion, improved top-1 accuracy when the model retrieved context-relevant snippets. CodeXEmbed outperformed traditional search systems on SWE-Bench-Lite, a curated dataset for solving GitHub problems.

Key takeaways from the study highlight the contribution and impact of CodeXEmbed in the advancement of code search.

The 7 billion parameter model achieved state-of-the-art performance with over 20% improvement on the CoIR benchmark and competitive results on BEIR. Demonstrated versatility across code and text tasks. The 400 million and 2 billion parameter models provide practical alternatives for environments with limited computational resources. This model addresses a wide range of code-related applications by integrating 12 programming languages ​​and 5 search categories. Unlike closed systems such as Voyage-Code, CodeXEmbed fosters community-driven research and innovation. Integration with search extension generation systems improves outcomes for tasks such as code completion and problem solving. The use of contrastive loss and token pooling optimizes the retrieval accuracy and model adaptability.

In conclusion, code search has advanced with the introduction of the CodeXEmbed family by Salesforce. These models demonstrate unparalleled versatility and scalability by achieving state-of-the-art performance on CoIR benchmarks and excelling on text retrieval tasks. A multilingual and multitasking integrated framework that supports 12 programming languages ​​positions CodeXEmbed as a vital tool for developers and researchers. Open source accessibility bridges the gap between natural language and code search while fostering community-driven innovation.

Check out the paper, 400M model, and 2B model. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram channel and LinkedIn group. Don’t forget to join the 65,000+ ML SubReddit.

🚨 Recommended open source platform: Parlant is a framework that transforms the way AI agents make decisions in customer-facing scenarios. (promotion)

Asif Razzaq is the CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of artificial intelligence for social good. His latest endeavor is the launch of Marktechpost, an artificial intelligence media platform. It stands out for its thorough coverage of machine learning and deep learning news that is technically sound and easily understood by a wide audience. The platform boasts over 2 million views per month, which shows its popularity among viewers.

📄 Introducing Height: The Only Autonomous Project Management Tool (Sponsored)

author avatar
See Full Bio
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
Previous ArticleStanford University Researchers Introduce BIOMEDICA: A Scalable AI Framework for Powering Biomedical Visual Language Models Using Large Multimodal Datasets
Next Article Making cosmetics sustainable with generative AI

Related Posts

Research

JMU Education Professor was awarded for AI Research

June 3, 2025
Research

Intelligent Automation, Nvidia and Enterprise AI

June 2, 2025
Research

Can AI be your therapist? New research reveals major risks

June 2, 2025
Add A Comment
Leave A Reply Cancel Reply

Top Posts

Deepseek’s latest AI model is a “big step back” for free speech

May 31, 20255 Views

From California to Kentucky: Tracking the rise of state AI laws in 2025 | White & Case LLP

May 29, 20255 Views

Gemini 2.5 Pro Preview: Even better coding performance

May 13, 20254 Views
Stay In Touch
  • YouTube
  • TikTok
  • Twitter
  • Instagram
  • Threads
Latest Reviews

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

Most Popular

Deepseek’s latest AI model is a “big step back” for free speech

May 31, 20255 Views

From California to Kentucky: Tracking the rise of state AI laws in 2025 | White & Case LLP

May 29, 20255 Views

Gemini 2.5 Pro Preview: Even better coding performance

May 13, 20254 Views
Don't Miss

ClarityCut ​​AI unveils a new creative engine for branded videos

June 7, 2025

The most comprehensive evaluation suite for GUI agents!

June 7, 2025

Japan’s innovative approach to artificial intelligence law – gktoday

June 7, 2025
Service Area
X (Twitter) Instagram YouTube TikTok Threads RSS
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms and Conditions
  • Disclaimer
© 2025 Versa AI Hub. All Rights Reserved.

Type above and press Enter to search. Press Esc to cancel.

Sign In or Register

Welcome Back!

Login to your account below.

Lost password?