Sglang transformer backend integration

Hugging the Face Transformers Library is the standard for manipulating cutting-edge models, from cutting-edge research experiments to fine-tune your custom data. Its simplicity, flexibility, and vast model zoos make them a powerful tool for rapid development.

However, once you’re ready to move from notebook to production, your inference performance will become mission-critical. That’s where Sglang comes in.

Designed for high-throughput, low-latency inference, Sglang now offers seamless integration with transformers as a backend. This means that you can combine the flexibility of a trance with the raw performance of Sglang.

Let’s explain how this integration can be made possible and can be used.

Sglang supports hugging Face Transformers as a backend, allowing you to run a transformer compatible model with fast inference out of the box.

Import sglang As sgl llm = sgl.engine(“Metalama/llama-3.2-1b-instruct”,empl =“transformer”))
printing(llm.generate((()“That’s what the French capital is like.”), {“max_new_tokens”: 20}) (0)))

No native support required – Sglang will automatically return to the transformer if necessary. Alternatively, you can explicitly set empl=”Transformers”.

To compare both approaches, let’s proceed through a simple text generation example using metalama/llama-3.2-1b-instruct.

transformer

The Transformers Library is ideal for experiments, small tasks, and training, but is not optimized for large or low-rise scenarios.

from transformer Import Pipeline Pipe = Pipeline (“Text Generation”model =“Metalama/llama-3.2-1b-instruct”)generate_kwargs = {
“TOP_P”: 0.95,
“TOP_K”: 20,
“temperature”: 0.8,
“max_new_tokens”: 256
} result = pipe(“The Future of AI”,** Generate_kwargs)
printing(result(0) ()“generated_text”)))

sglang

Sglang takes another track and prioritizes efficiency with features such as Radixattention (a memory-efficient attention mechanism). Inference to Sglang is noticeably faster and more resource efficient, especially under load. Here’s the same task in Sglang using the offline engine:

Import sglang As SGL

if __NAME__ == ‘__Major__’:llm = sgl.engine(model_path =“Metalama/llama-3.2-1b-instruct”) Prompt = (“The Future of AI”)sampling_params = {
“TOP_P”: 0.95,
“TOP_K”: 20,
“temperature”: 0.8,
“max_new_tokens”: 256
} outputs = llm.generate(prompts, sampling_params)
printing(output(0)))

Alternatively, you can spin the server and send a request.

python3 -m sglang.launch_server \ – model-path meta-llama/llama-3.2-1b-instruct \ – host 0.0.0.0 \ -port 30000 Response = requests.post(
“http:// localhost:30000/generate”json = {
“Sentence”: “The Future of AI”,
“sampling_params”:{
“TOP_P”: 0.95,
“TOP_K”: 20,
“temperature”: 0.8,
“max_new_tokens”: 256
},},)
printing(Response.json())

Please note that Sglang also offers Openai compatible APIs, which are drop-in exchanges for external services.

The new Transformers BackEnd integration allows Sglang to automatically return to using unsupported transformers models. This actually is:

Instant access to new models added to transformers Support for custom models from a hugging face hub

This unlocks faster inference and optimized deployments (e.g., enabling radixattention) without sacrificing the simplicity and versatility of the trans ecosystem.

Usage

llm = sgl.engine(model_path =“Metalama/llama-3.2-1b-instruct”,empl =“transformer”))

Note that specifying the ImpL parameter is optional. If the model is not natively supported by Sglang, it switches to a transformer implementation on its own.

Models of hugging facehubs that work using transformers using Trust_remote_code = true are compatible with sglang for caution. You can find the exact requirements in the official documentation. If your custom model meets these criteria, all you need to do is set trust_remote_code = true when loaded.

llm = sgl.engine(model_path =“New-Custom-Transformers-Model”,empl =“transformer”trust_remote_code =truth))

example

Helium on the Kyushu team is not yet natively supported by Sglang. This is where the transformer backend shines, allowing for optimized inference without waiting for native support.

python3 -m sglang.launch_server \ – model-path kyutai/helium-1-preview-2b \ -impl transformers \ – host 0.0.0.0 \ – port 30000 response = request (request) (post ()
“http:// localhost:30000/generate”json = {
“Sentence”: “That’s what the French capital is like.”,
“sampling_params”:{
“TOP_P”: 0.95,
“TOP_K”: 20,
“temperature”: 0.8,
“max_new_tokens”: 256
},},)
printing(Response.json())

There are several important areas that we are actively working to strengthen this integration.

Performance improvements: Trans models are currently lagging behind native integration in terms of performance. Our main purpose is to optimize and narrow this gap.

Lola Support

VLM Integration: We are also working to expand the scope of our capabilities and use cases by adding support for the Visual Language Model (VLM).

versatileai

See Full Bio

What's Hot

Media and Entertainment Market Report Generation AI

Sglang transformer backend integration

Why the Turing Test is the Best Benchmark to Evaluate AI

Unified interface for robotics zero-shot vision models

Brings serverless GPU reasoning to hug face users

Fast set-fit reasoning with optimal Intel on Xeon

New Star: Discover why 보니 is the future of AI art

How to build an MCP server with Gradio

A family business built on trust, now supported by AI.

Most Popular

New Star: Discover why 보니 is the future of AI art

How to build an MCP server with Gradio

A family business built on trust, now supported by AI.

Don't Miss

Media and Entertainment Market Report Generation AI

Sglang transformer backend integration

Why the Turing Test is the Best Benchmark to Evaluate AI

Subscribe to Updates

What's Hot

Sglang transformer backend integration

transformer

sglang

Usage

example

Related Posts