You can launch a private OpenAI-compatible LLM endpoint on your Hugging Face infrastructure with a single command. There are no servers or Kubernetes to provision, and you pay per second. Once launched, you can run queries from your laptop, notebook, or anywhere else.
This is the easiest way to launch a model for testing, evaluation, or batch generation. (Alternatively, if you’re looking for a production-ready managed service, inference endpoints are a good fit for that purpose. We’ll discuss which one you ultimately choose in more detail.)
We will explain everything from end to end here.
Prerequisites
Payment method or positive prepaid credit balance (jobs are billed per minute based on hardware usage). Hug Face Hub >= 1.20.0: pip install -U “Hug Face Hub >= 1.20.0”. Login locally: hf auth login.
start the server
hf jobs run is a docker run for HF infrastructure. Use the official vllm/vllm-openai image, request the GPU with –flavor, and expose the vLLM port with –expose.
n/a work run –flavor a10g-large –expose 8000 —timeout 2h \ vllm/vllm-openai:latest \ vllmserve Qwen/Qwen3-4B –host 0.0.0.0 –port 8000
–expose 8000 routes the container’s port through HF’s public job proxy (see the Serve Models guide for a complete reference). This command outputs the URL where the server can be reached.
✓ Started job ID: 6a381ca1953ed90bfb947332 url: https://huggingface.co/jobs/qgallouedec/6a381ca1953ed90bfb947332 Tip: The public port is reachable at (requires an HF token with read access to the job): https://6a381ca1953ed90bfb947332–8000.hf.jobs
6a381ca1953ed90bfb947332 is the job ID. Please record it as you will need it. I’ll use it as a placeholder for the rest of this post.
It will take a few minutes for the weights to download and launch. If you see “Application startup completed” in the log, it’s up and running.
Query from anywhere
vLLM speaks the OpenAI API and all requests require an HF token as a bearer token. The easiest way to hit this is curl.
curl https://–8000.hf.jobs/v1/chat/completions \ -H “Authority: Owner” $(hf authentication token)” \ -H “Content type: application/json” \ -d ‘{
“Model”: “Kwen/Kwen 3-4B”,
“Message”: ({“Role”: “User”, “Content”: “Hello!”}),
“chat_template_kwargs”: {“enable_ Thinking”: false}
}’
This returns regular OpenAI-style JSON with choices(0).message.content containing “Hello! How can I help you today? 😊”.
Alternatively, from Python, specify the published URL to your OpenAI client and pass the token as the API key.
from hug face hub import Get token
from open night import OpenAI client = OpenAI(base_url=“https://–8000.hf.jobs/v1”api_key=get_token(), ) resp = client.chat.completions.create( model=“Kwen/Kwen 3-4B”message =({“role”: “user”, “content”: “Hello!”}), extra_body={“chat_template_kwargs”: {“Make thinking possible”: error}},)
print(Each option(0).message.content) Hello! How can I help you today? 😊
A quick health check before starting: curl https://–8000.hf.jobs/v1/models -H The model will be listed under “Authorization: Bearer $(hf auth token)”.
🔐 Endpoints are gated and not public. All requests must include an HF token with read access to the job’s namespace. Access to regular browsers is denied. In effect, a job proxy is an API gate. The scope of access is limited to you (and your organization). There is no problem with personal use, but please handle the URL appropriately. Don’t share with the expectation that it will be published or paste the token somewhere you don’t trust. If you need more granular or public access, front a suitable gateway instead. Or see HF jobs or inference endpoints. Down below.
cleaning
Jobs are billed per second, so stop the server when you’re done.
n/a work cancel
The –timeout you set is a safety net (it will stop automatically), but it is cheaper to cancel it explicitly. a10g-large runs for $1.50 per hour. Check out hf job hardware for a complete price list and choose the smallest flavor that fits your model.
Go further: larger models
The same command extends to larger models. Choose a stronger –flavor and use –tensor-Parallel-size to tell vLLM to shard the model across GPUs. For example, a 122B Qwen3.5 expert mixture model with 2× H200 would be:
n/a work run –flavor h200x2 –expose 8000 —timeout 2h \ vllm/vllm-openai:latest \ vllmserve Qwen/Qwen3.5-122B-A10B \ –host 0.0.0.0 –port 8000 –tensor-Parallel-size 2 \ –max-model-len 32768 –max-num-seqs 256
–tensor-Parallel-size must match the number of GPUs in the flavor (h200x2 → 2, h200x8 → 8). Run the hf job hardware to see what’s available, and increase the –timeout for large models as they take longer to download and load. For larger models, the H200 flavor is usually the best value.
–max-model-len 32768 –max-num-seqs 256 flags are specific to this model. Qwen3.5-122B is a hybrid Mamba/attention architecture with a default context of 256K tokens, which does not leave enough memory for vLLM’s default batch settings. It is maintained in the memory of the GPU by placing an upper limit on the length of the context and the number of simultaneous sequences. If your model fails to start due to out of memory or cache block errors, these are the two errors you will try first. Everything else (published URL, OpenAI client, token authentication) remains exactly the same.
Go further: Chat in the UI
Prefer chat windows over Curl? Several lines in Gradio point to the same endpoint. Add –reasoning-parser deepseek_r1 to the vllmserve command so that Qwen3’s thoughts are returned as a separate field (not required, but helpful). Then run this code locally (only the job ID is required).
import gladio as grams
from gladio import chat message
from hug face hub import Get token
from open night import OpenAI client = OpenAI(base_url=“https://–8000.hf.jobs/v1”api_key=get_token())
surely chat(messages, history): message = ({“role”:m(“role”), “content”:m(“content”)} for meter in history if do not have m.get(“Metadata”))messages.append({“role”: “user”, “content”: message}) stream = client.chat.completions.create(model=“Kwen/Kwen 3-4B”message = message, stream =truth) Think and answer = “”, “”
for lump in Stream: delta = chunk.choices(0).delta thinking += delta.model_extra.get(“Deduction”, “”) answer += delta.content or “”
out = ()
if Thinking.strip(): status = “end” if Answer.strip() Other than that “Pending”
out.append(ChatMessage(role=“assistant”content=thoughts, metadata={“title”: “💭Thinking”, “situation”: situation}))
if Answer.strip(): out.append(ChatMessage(role=“assistant”content = answer))
yield out gr.ChatInterface(Chat).launch()
Run it and open http://127.0.0.1:7860 to chat. The reasoning flows into a collapsible panel with the answer below.
Go further: SSH into a running server
Need to debug startup failures, monitor GPU memory, or track logs interactively? You can open a shell and access running jobs directly. Start with –ssh and make sure your public key is registered at Huggingface.co/settings/keys.
n/a work run –flavor a10g-large –expose 8000 —timeout 2h –ssh \ vllm/vllm-openai:latest \ vllmserve Qwen/Qwen3-4B –host 0.0.0.0 –port 8000
Then connect using your job ID.
n/a work ssh
Now you can go inside the container and run nvidia-smi, inspect processes, and manipulate the model directly. This makes debugging and monitoring much easier than reading logs externally. SSH support requires huggingface_hub >= 1.20.0.
Go further: Use as a coding agent backend on Pi
The same endpoint can support terminal coding agents. Pi is a provider-independent agent harness. Specifying a job will run a read/write/edit/Bash agent in your own self-hosted model.
The first thing you need to configure is that the agent drives models through tool calls, and vLLM accepts them only if the server is started with tool calls enabled. So restart with –enable-auto-tool-choice and –tool-call-parser that match your model family (hermes in Qwen3). This is a good place to introduce larger models, as agents also benefit from more powerful models.
n/a work run –flavor h200x2 –expose 8000 —timeout 2h \ vllm/vllm-openai:latest \ vllmserve Qwen/Qwen3.5-122B-A10B \ –host 0.0.0.0 –port 8000 –tensor-Parallel-size 2 \ –max-model-len 32768 –max-num-seqs 256 \ –reasoning-parser deepseek_r1 \ –enable-auto-tool-choice –tool-call-parser Hermes
Next, add the job as a custom provider to ~/.pi/agent/models.json.
{
“Provider”: {
“hf job”: {
“Base URL”: “https://–8000.hf.jobs/v1”,
“Api”: “openai-completions”,
“API key”: “!hf authentication token”,
“Model”: (
{ “ID”: “Qwen/Qwen3.5-122B-A10B” }
)
}
}
}
Then start the agent for it.
Pi
Run a few commands to create a model and drive an interactive coding agent in the terminal.
HF job or inference endpoint?
HF Jobs is not the only way to service models on Hugging Face. Inference endpoints are managed products for the same job, and which one is right for you depends on your purpose.
If you want maximum flexibility and control, look no further than HF Job. It just runs Docker on your HF infrastructure, so you choose your image, your exact vllm serve flags, and hardware, and pay by the second as long as your job runs. So it’s perfect for experimentation, one-time evaluation, batch generation, or just trying out a model before committing to something.
If you want something more production-ready, use an inference endpoint. These provide additional operational benefits needed for long-term service. This means fine-grained access control (endpoints can be public, protected, or private), scale to zero, and no charges for periods of inactivity. If you’re standing up durable endpoints rather than running jobs, that’s the tool to leverage.
Read more
Although this post is specific to vLLM, the same port exposure pattern also works for OpenAI compatible servers. To serve GGUF using llama.cpp or run SGLang instead, see the Serving Models in Jobs guide that describes these backends.

