NOW LET US – AI RAG SaaS Studio TP.HCM
NOW LET US
Digital Product Studio
Back to news
DEV-TOOLS...5 min read

Run a vLLM Server on HF Jobs in One Command

Share
NOW LET US Article – Run a vLLM Server on HF Jobs in One Command

Learn how to quickly spin up a vLLM server on Hugging Face Jobs using a single command, complete with security, scaling, and debugging tips.

It's the quickest way to stand up a model for tests, evals, or batch generation. (If you're after a managed, production-ready service instead, that's what Inference Endpoints are for — more on when to pick which at the end.)

Here's the whole thing end to end.

Prerequisites

  • A payment method or a positive prepaid credit balance (Jobs is billed per‑minute by hardware usage).
  • huggingface_hub >= 1.20.0 installed:
    pip install -U "huggingface_hub>=1.20.0"
    
  • Logged in locally:
    hf auth login
    

Run a vLLM Server in One Command

hf jobs run is docker run for HF infrastructure. We use the official vllm/vllm-openai image, ask for a GPU with --flavor, and expose vLLM's port with --expose:

hf jobs run --flavor a10g-large --expose 8000 --timeout 2h \
vllm/vllm-openai:latest \
vllm serve Qwen/Qwen3-4B --host 0.0.0.0 --port 8000

--expose 8000 routes the container's port through HF's public jobs proxy (see the Serve Models guide for the full reference). The command prints the URL your server is reachable at:

✓ Job started
id: 6a381ca1953ed90bfb947332
url: https://huggingface.co/jobs/qgallouedec/6a381ca1953ed90bfb947332
Hint: Exposed ports are reachable at (requires an HF token with read access to the job):
https://6a381ca1953ed90bfb947332--8000.hf.jobs

6a381ca1953ed90bfb947332 is your job ID. Keep track of it, we'll need it. We'll use <job_id> as a placeholder for it in the rest of the post.

Give it a couple of minutes to download weights and boot. When the logs show Application startup complete, you're live.

Accessing the OpenAI-Compatible API

vLLM speaks the OpenAI API, and every request just needs your HF token as a bearer token. The quickest way to hit it is curl:

curl https://<job_id>--8000.hf.jobs/v1/chat/completions \
-H "Authorization: Bearer $(hf auth token)" \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-4B",
"messages": [{"role": "user", "content": "Hello!"}],
"chat_template_kwargs": {"enable_thinking": false}
}'

which returns the usual OpenAI-style JSON, with choices[0].message.content holding "Hello! How can I assist you today? 😊".

Or, from Python, point the OpenAI client at the exposed URL and pass the token as the API key:

from huggingface_hub import get_token
from openai import OpenAI
client = OpenAI(
base_url="https://<job_id>--8000.hf.jobs/v1",
api_key=get_token(),
)
resp = client.chat.completions.create(
model="Qwen/Qwen3-4B",
messages=[{"role": "user", "content": "Hello!"}],
extra_body={"chat_template_kwargs": {"enable_thinking": False}},
)
print(resp.choices[0].message.content)
Hello! How can I assist you today? 😊

Quick health check before you start: curl https://<job_id>--8000.hf.jobs/v1/models -H "Authorization: Bearer $(hf auth token)" should list the model.

Security and Cost Management

🔐 The endpoint is gated, not public. Every request must carry an HF token with read access to the job's namespace. A plain browser visit will be rejected. In effect, the jobs proxy is your API gate: access is scoped to you (and your org). That's fine for private use, but treat the URL accordingly: don't share it expecting it to be open, and don't paste your token into untrusted places. If you need finer-grained or public access, put a proper gateway in front instead.

Jobs are billed per second, so stop the server when you're done:

hf jobs cancel <job_id>

The --timeout you set is a safety net (it'll auto-stop), but cancelling explicitly is cheaper. An a10g-large runs at $1.50/hour — check hf jobs hardware for the full price list and pick the smallest flavor that fits your model.

Scaling to Larger Models

The same command scales to much larger models — pick a beefier --flavor and tell vLLM to shard the model across the GPUs with --tensor-parallel-size. For example, the 122B Qwen3.5 mixture-of-experts model on 2× H200:

hf jobs run --flavor h200x2 --expose 8000 --timeout 2h \
vllm/vllm-openai:latest \
vllm serve Qwen/Qwen3.5-122B-A10B \
--host 0.0.0.0 --port 8000 --tensor-parallel-size 2 \
--max-model-len 32768 --max-num-seqs 256

--tensor-parallel-size should match the number of GPUs in the flavor (h200x2 → 2, h200x8 → 8). Run hf jobs hardware to see what's available and give bigger models a longer --timeout, since they take longer to download and load. For large models, H200 flavors are usually the best value.

The --max-model-len 32768 --max-num-seqs 256 flags are specific to this model: Qwen3.5-122B is a hybrid Mamba/attention architecture with a 256K-token default context, which doesn't leave enough memory for vLLM's default batch settings. Capping the context length and concurrent-sequence count keeps it within the GPUs' memory. If a model fails to start with an out-of-memory or cache-block error, dialing these two down is the first thing to try. Everything else (the exposed URL, the OpenAI client, the token auth) stays exactly the same.

Building a Chat UI with Gradio

Prefer a chat window over curl? A few lines of Gradio point at the same endpoint. Add --reasoning-parser deepseek_r1 to the vllm serve command so Qwen3's thinking comes back as a separate field (not necessary, but helpful), then run this code locally (you'll just need the job ID):

import gradio as gr
from gradio import ChatMessage
from huggingface_hub import get_token
from openai import OpenAI
client = OpenAI(base_url="https://<job_id>--8000.hf.jobs/v1", api_key=get_token())
def chat(message, history):
messages = [{"role": m["role"], "content": m["content"]} for m in history if not m.get("metadata")]
messages.append({"role": "user", "content": message})
stream = client.chat.completions.create(model="Qwen/Qwen3-4B", messages=messages, stream=True)
thinking, answer = "", ""
for chunk in stream:
delta = chunk.choices[0].delta
thinking += delta.model_extra.get("reasoning", "")
answer += delta.content or ""
out = []
if thinking.strip():
status = "done" if answer.strip() else "pending"
out.append(ChatMessage(role="assistant", content=thinking, metadata={"title": "💭 Thinking", "status": status}))
if answer.strip():
out.append(ChatMessage(role="assistant", content=answer))
yield out
gr.ChatInterface(chat).launch()

Run it, open http://127.0.0.1:7860, and chat — reasoning streams into the collapsible panel, the answer below.

Interactive Debugging via SSH

Need to debug a startup failure, watch GPU memory, or tail logs interactively? You can open a shell straight into the running job. Launch it with --ssh and make sure your public key is registered at huggingface.co/settings/keys:

hf jobs run --flavor a10g-large --expose 8000 --timeout 2h --ssh \
vllm/vllm-openai:latest \
vllm serve Qwen/Qwen3-4B --host 0.0.0.0 --port 8000

then connect with the job ID:

hf jobs ssh <job_id>

You're now inside the container, where you can run nvidia-smi, inspect the process, or poke at the model directly — which makes debugging and monitoring much easier than reading logs from the outside. SSH support requires huggingface_hub >= 1.20.0.

Integrating with Coding Agents

The same endpoint can back a terminal coding agent. Pi is a provider-agnostic agent harness. Point it at the job and you get a Read/Write/Edit/Bash agent running on your own self-hosted model.

One thing to set up first: agents drive the model through tool calls, and vLLM only accepts those if the server is launched with tool calling enabled. So relaunch with --enable-auto-tool-choice and a --tool-call-parser matching the model family (hermes for Qwen3). Agents also benefit from a stronger model, so this is a good place to bring in the bigger one:

hf jobs run --flavor h200x2 --expose 8000 --timeout 2h \
vllm/vllm-openai:latest \
vllm serve Qwen/Qwen3.5-122B-A10B \
--host 0.0.0.0 --port 8000 --tensor-parallel-size 2 \
--max-model-len 32768 --max-num-seqs 256 \
--reasoning-parser deepseek_r1 \
--enable-auto-tool-choice --tool-call-parser hermes

Then add the job as a custom provider in ~/.pi/agent/models.json:

{
  "providers": {
    "hf-jobs": {
      "baseUrl": "https://<job_id>--8000.hf.jobs/v1",
      "apiKey": "YOUR_HF_TOKEN"
    }
  }
}
© 2026 Now Let Us. All rights reserved.

Source: Hugging Face Blog

Advertisement
Ad slot ready: 5887729102

More in this category

NOW LET US Related – Valve open source the Steam Machine e-ink screen so you can make your own

dev-tools

Valve open source the Steam Machine e-ink screen so you can make your own

While Valve will not be making their own e-ink display for the Steam Machine, they have open-sourced the design under the name 'Inkterface' so anyone can build their own.

NOW LET US Related – PostgreSQL and the OOM Killer: Why We Use Strict Memory Overcommit

dev-tools

PostgreSQL and the OOM Killer: Why We Use Strict Memory Overcommit

Our team explains how strict memory overcommit protects PostgreSQL from catastrophic OOM kills, and shares a deep dive into a kernel bug that caused phantom memory allocation issues.

NOW LET US Related – Commodore 64 Basic for PostgreSQL

dev-tools

Commodore 64 Basic for PostgreSQL

PL/CBMBASIC is a unique procedural language extension for PostgreSQL that executes functions using the legendary Commodore 64 BASIC V2 interpreter from 1982. By statically recompiling the 6502 ROM into C, it runs 1,000 times faster than the original hardware, bringing a nostalgic programming experience to modern databases.

NOW LET US Related – Wordgard: The new in-browser rich-text editor from the creator of ProseMirror

dev-tools

Wordgard: The new in-browser rich-text editor from the creator of ProseMirror

The creator of ProseMirror has introduced Wordgard, a new toolset for building highly customizable in-browser rich-text editors with a focus on structured content control.

NOW LET US Related – Half-Baked Product

dev-tools

Half-Baked Product

A satirical yet realistic story of a hardware startup's journey. From flawless Excel spreadsheets and multi-million dollar VC pitches to the harsh reality of engineering compromises and enterprise demands.

NOW LET US Related – The Safari MCP server for web developers

dev-tools

The Safari MCP server for web developers

Apple has introduced the Safari MCP server in Safari Technology Preview 247, allowing AI agents to connect directly to the browser for automated debugging. This tool helps developers optimize performance, check compatibility, and test accessibility right from their terminal.

EXPLORE TOPICS

Discover All Categories

Deep dive into the specific technology sectors that matter most to you.