For decades, software developers have designed methodologies, processes, and tools that improve code quality and increase productivity. For example, agile test-driven development, code reviews, and CI/CD are now staples of the software industry.
Google reports in How Google Tests Software (Addison-Wesley, 2012) that it costs 1000 times more to fix bugs during the final testing stage, system testing, than to fix bugs during unit testing. This puts a lot of pressure on the first link in the chain, the developer, to produce high quality code from the beginning.
Despite all the hype surrounding generative AI, code generation seems like a promising way to help developers deliver better code faster. In fact, early research shows that managed services like GitHub Copilot and Amazon CodeWhisperer can help improve developer productivity.
However, these services rely on a closed-source model that cannot be customized to fit your technical culture or processes. Hugging Face released SafeCoder a few weeks ago to fix this. SafeCoder is a code assistant solution built for enterprises that offers cutting-edge models, transparency, customizability, IT flexibility, and privacy.
In this post, we compare SafeCoder to closed-source services and highlight the benefits you can expect from our solution.
cutting edge model
SafeCoder is currently built on top of StarCoder models, a family of open source models designed and trained within the BigCode collaborative project.
StarCoder is a 15.5 billion parameter model trained for code generation in over 80 programming languages. Increase throughput and reduce latency using innovative architectural concepts such as Multi-Query Attention (MQA). This technology is also present in the Falcon and has been adapted to the LLaMa 2 model.
StarCoder has a context window of 8192 tokens that helps you generate new code with more consideration for your code. You can also fill in the middle, or insert new code within your code, rather than simply adding it to the end.
Finally, like HuggingChat, SafeCoder introduces new cutting-edge models over time, providing a seamless upgrade path.
Unfortunately, closed-source code assistance services do not share information about the underlying model, its features, or training data.
transparency
SafeCoder, which follows the Chinchilla Scaling Law, is a compute-optimized model trained on 1 trillion (10,000 billion) code tokens. These tokens are extracted from The Stack, a 2.7 terabyte dataset built from permissioned open source repositories. Every effort has been made to honor opt-out requests and we have built tools that allow repository owners to check whether their code is part of the dataset.
In the spirit of transparency, our research paper discloses our model architecture, training process, and detailed metrics.
Unfortunately, closed-source services stick to vague information like “(the model) was trained with billions of lines of code.” To our knowledge, there are no metrics available.
customization
StarCoder models are specifically designed to be customizable, and we’ve already built many different versions.
StarCoderBase: Original models trained in 80+ languages from The Stack. StarCoder: StarCoderBase was further trained in Python. StarCoder+: StarCoderBase was further trained on English web data to code conversations.
I also shared the tweaked code on GitHub.
Every company has a preferred language and coding guidelines, how to write inline documentation and unit tests, and do’s and don’ts regarding security and performance. SafeCoder can help you train models that learn the peculiarities of software engineering processes. Our team will help you prepare high-quality datasets and fine-tune StarCoder on your infrastructure. Your data will not be exposed to anyone.
Unfortunately, closed-source services cannot be customized.
IT flexibility
SafeCoder relies on Docker containers for fine-tuning and deployment. Easily run any container management service on-premises or in the cloud.
Additionally, SafeCoder includes the Optimum hardware acceleration library. Whether you’re using a CPU, GPU, or AI accelerator, Optimum automatically launches to save you time and money on training and inference. Because you control the underlying hardware, you can also tailor the infrastructure cost/performance ratio to your needs.
Unfortunately, closed-source services are only available as managed services.
Security and privacy
Security is always a top concern, especially when source code is involved. Intellectual property and privacy must be protected at all costs.
Whether you run it on-premises or in the cloud, SafeCoder is under full administrative control. Apply and monitor security checks to maintain strong and consistent compliance across your IT platform.
SafeCoder does not spy on your data. Your prompts and suggestions are unique to you. SafeCoder does not call your home or send any telemetry data to Hugging Face or anyone else. No one but you needs to know when and how you are using SafeCoder. SafeCoder also does not require an internet connection. It can (and should) be run completely air-gapped.
Closed-source services rely on the security of the underlying cloud. Whether this works for your compliance posture is up to you. For corporate users, prompts and suggestions are not saved (for personal users). However, we regret to point out that GitHub collects “user engagement data” without the possibility of opting out. AWS does the same by default, but you can opt out.
conclusion
We are very excited about the future of SafeCoder, and so are our customers. No need to compromise on cutting-edge code generation, transparency, customization, IT flexibility, security, or privacy. We believe that SafeCoder provides all of that, and we will continue to strive to make it even better.
If your company is interested in SafeCoder, please contact us. Our team will contact you shortly to learn more about your use case and discuss your requirements.
Thank you for reading!

