Close Menu
Versa AI hub
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

What's Hot

Media and Entertainment Market Report Generation AI

June 24, 2025

Sglang transformer backend integration

June 24, 2025

Why the Turing Test is the Best Benchmark to Evaluate AI

June 23, 2025
Facebook X (Twitter) Instagram
Versa AI hubVersa AI hub
Tuesday, June 24
Facebook X (Twitter) Instagram
Login
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools
Versa AI hub
Home»Tools»Sglang transformer backend integration
Tools

Sglang transformer backend integration

versatileaiBy versatileaiJune 24, 2025No Comments4 Mins Read
Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
#image_title
Share
Facebook Twitter LinkedIn Pinterest Email

Hugging the Face Transformers Library is the standard for manipulating cutting-edge models, from cutting-edge research experiments to fine-tune your custom data. Its simplicity, flexibility, and vast model zoos make them a powerful tool for rapid development.

However, once you’re ready to move from notebook to production, your inference performance will become mission-critical. That’s where Sglang comes in.

Designed for high-throughput, low-latency inference, Sglang now offers seamless integration with transformers as a backend. This means that you can combine the flexibility of a trance with the raw performance of Sglang.

Let’s explain how this integration can be made possible and can be used.

Sglang supports hugging Face Transformers as a backend, allowing you to run a transformer compatible model with fast inference out of the box.

Import sglang As sgl llm = sgl.engine(“Metalama/llama-3.2-1b-instruct”,empl =“transformer”))
printing(llm.generate((()“That’s what the French capital is like.”), {“max_new_tokens”: 20}) (0)))

No native support required – Sglang will automatically return to the transformer if necessary. Alternatively, you can explicitly set empl=”Transformers”.

To compare both approaches, let’s proceed through a simple text generation example using metalama/llama-3.2-1b-instruct.

transformer

The Transformers Library is ideal for experiments, small tasks, and training, but is not optimized for large or low-rise scenarios.

from transformer Import Pipeline Pipe = Pipeline (“Text Generation”model =“Metalama/llama-3.2-1b-instruct”)generate_kwargs = {
“TOP_P”: 0.95,
“TOP_K”: 20,
“temperature”: 0.8,
“max_new_tokens”: 256
} result = pipe(“The Future of AI”,** Generate_kwargs)
printing(result(0) ()“generated_text”)))

sglang

Sglang takes another track and prioritizes efficiency with features such as Radixattention (a memory-efficient attention mechanism). Inference to Sglang is noticeably faster and more resource efficient, especially under load. Here’s the same task in Sglang using the offline engine:

Import sglang As SGL

if __NAME__ == ‘__Major__’:llm = sgl.engine(model_path =“Metalama/llama-3.2-1b-instruct”) Prompt = (“The Future of AI”)sampling_params = {
“TOP_P”: 0.95,
“TOP_K”: 20,
“temperature”: 0.8,
“max_new_tokens”: 256
} outputs = llm.generate(prompts, sampling_params)
printing(output(0)))

Alternatively, you can spin the server and send a request.

python3 -m sglang.launch_server \ – model-path meta-llama/llama-3.2-1b-instruct \ – host 0.0.0.0 \ -port 30000 Response = requests.post(
“http:// localhost:30000/generate”json = {
“Sentence”: “The Future of AI”,
“sampling_params”:{
“TOP_P”: 0.95,
“TOP_K”: 20,
“temperature”: 0.8,
“max_new_tokens”: 256
},},)
printing(Response.json())

Please note that Sglang also offers Openai compatible APIs, which are drop-in exchanges for external services.

The new Transformers BackEnd integration allows Sglang to automatically return to using unsupported transformers models. This actually is:

Instant access to new models added to transformers Support for custom models from a hugging face hub

This unlocks faster inference and optimized deployments (e.g., enabling radixattention) without sacrificing the simplicity and versatility of the trans ecosystem.

Usage

llm = sgl.engine(model_path =“Metalama/llama-3.2-1b-instruct”,empl =“transformer”))

Note that specifying the ImpL parameter is optional. If the model is not natively supported by Sglang, it switches to a transformer implementation on its own.

Models of hugging facehubs that work using transformers using Trust_remote_code = true are compatible with sglang for caution. You can find the exact requirements in the official documentation. If your custom model meets these criteria, all you need to do is set trust_remote_code = true when loaded.

llm = sgl.engine(model_path =“New-Custom-Transformers-Model”,empl =“transformer”trust_remote_code =truth))

example

Helium on the Kyushu team is not yet natively supported by Sglang. This is where the transformer backend shines, allowing for optimized inference without waiting for native support.

python3 -m sglang.launch_server \ – model-path kyutai/helium-1-preview-2b \ -impl transformers \ – host 0.0.0.0 \ – port 30000 response = request (request) (post ()
“http:// localhost:30000/generate”json = {
“Sentence”: “That’s what the French capital is like.”,
“sampling_params”:{
“TOP_P”: 0.95,
“TOP_K”: 20,
“temperature”: 0.8,
“max_new_tokens”: 256
},},)
printing(Response.json())

There are several important areas that we are actively working to strengthen this integration.

Performance improvements: Trans models are currently lagging behind native integration in terms of performance. Our main purpose is to optimize and narrow this gap.

Lola Support

VLM Integration: We are also working to expand the scope of our capabilities and use cases by adding support for the Visual Language Model (VLM).

author avatar
versatileai
See Full Bio
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
Previous ArticleWhy the Turing Test is the Best Benchmark to Evaluate AI
Next Article Media and Entertainment Market Report Generation AI
versatileai

Related Posts

Tools

Unified interface for robotics zero-shot vision models

June 23, 2025
Tools

Brings serverless GPU reasoning to hug face users

June 22, 2025
Tools

Fast set-fit reasoning with optimal Intel on Xeon

June 22, 2025
Add A Comment

Comments are closed.

Top Posts

New Star: Discover why 보니 is the future of AI art

February 26, 20253 Views

How to build an MCP server with Gradio

April 30, 20251 Views

A family business built on trust, now supported by AI.

April 17, 20251 Views
Stay In Touch
  • YouTube
  • TikTok
  • Twitter
  • Instagram
  • Threads
Latest Reviews

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

Most Popular

New Star: Discover why 보니 is the future of AI art

February 26, 20253 Views

How to build an MCP server with Gradio

April 30, 20251 Views

A family business built on trust, now supported by AI.

April 17, 20251 Views
Don't Miss

Media and Entertainment Market Report Generation AI

June 24, 2025

Sglang transformer backend integration

June 24, 2025

Why the Turing Test is the Best Benchmark to Evaluate AI

June 23, 2025
Service Area
X (Twitter) Instagram YouTube TikTok Threads RSS
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms and Conditions
  • Disclaimer
© 2025 Versa AI Hub. All Rights Reserved.

Type above and press Enter to search. Press Esc to cancel.

Sign In or Register

Welcome Back!

Login to your account below.

Lost password?