Server Setup Guide¶

A step-by-step guide to setting up a vLLM inference server on your own hardware — from driver installation to systemd configuration and performance tuning.

Reference Documentation

This guide walks through the setup process step by step with explanations. For a concise reference of the server configuration, see the Server Setup Reference.

Before You Start¶

Hardware Requirements¶

vLLM requires an NVIDIA GPU with sufficient VRAM to hold the quantized model plus the KV cache. Here's what different GPUs can handle:

GPU	VRAM	Max Model Size (4-bit)	Expected Speed
RTX 3060	12GB	~7B parameters	30-40 tok/s
RTX 3080	10GB	~7B parameters	35-45 tok/s
RTX 3090	24GB	~24B parameters	45-55 tok/s
RTX 4090	24GB	~24B parameters	60-65 tok/s
RTX 5090	32GB	~32B parameters	70+ tok/s

Zorac's default model (Mistral-Small-24B-AWQ) requires a 24GB card. If you have less VRAM, use a smaller model like Mistral-7B.

Minimum system requirements:

NVIDIA GPU with 10+ GB VRAM (24GB recommended)
NVIDIA drivers version 550.x or higher
16GB+ system RAM
~30GB disk space (for model download and vLLM installation)
Linux recommended (Ubuntu 22.04 or 24.04 LTS)

Software Prerequisites¶

You need three things installed before setting up vLLM:

NVIDIA drivers — Proprietary drivers (not Nouveau)
Python 3.10+ — Ubuntu 24.04 ships with Python 3.12
uv (recommended) or pip — For managing the Python environment

Choosing a Model¶

Your model choice depends on your VRAM budget:

VRAM	Recommended Model	Parameters	Quantization
24GB	Mistral-Small-24B-AWQ	24B	AWQ 4-bit
16GB	Mistral-7B-AWQ	7B	AWQ 4-bit
12GB	Mistral-7B-AWQ	7B	AWQ 4-bit
10GB	Qwen2.5-7B-AWQ	7B	AWQ 4-bit

Look for models on Hugging Face with "AWQ" in the name — these are pre-quantized and ready for vLLM. The model will be downloaded automatically the first time you start the server.

Installation¶

NVIDIA Drivers¶

Verify that NVIDIA drivers are installed and working:

nvidia-smi

You should see output showing your GPU model, driver version (550.x+), and CUDA version (12.x). If the command is not found:

# Ubuntu
sudo apt install ubuntu-drivers-common
sudo ubuntu-drivers autoinstall
sudo reboot

After rebooting, run nvidia-smi again to confirm.

Python Environment with uv¶

We recommend using uv for managing the Python environment. It's faster than pip and handles virtual environments cleanly.

# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh

# Or via Homebrew
brew install uv

Create a dedicated directory and virtual environment for vLLM:

mkdir -p ~/vllm-serve && cd ~/vllm-serve
python3 -m venv venv
source venv/bin/activate

Installing vLLM¶

Install vLLM inside the virtual environment:

pip install vllm

For Mistral models with tool calling support, also install the Mistral tokenizer:

pip install mistral-common

Verify the installation:

python -c "import vllm; print(vllm.__version__)"

Configuration¶

The vLLM Serve Command¶

The simplest way to start vLLM:

vllm serve "stelterlab/Mistral-Small-24B-Instruct-2501-AWQ" \
    --quantization awq_marlin \
    --max-model-len 16384 \
    --gpu-memory-utilization 0.85 \
    --max-num-seqs 32 \
    --host 0.0.0.0 \
    --port 8000

The first run will download the model from Hugging Face (~14GB). Subsequent starts load from the local cache.

Key Flags Explained¶

Flag	Purpose	Why This Value
`--quantization awq_marlin`	Use Marlin-optimized AWQ kernels	10x faster than generic AWQ on RTX 30/40 series
`--max-model-len 16384`	Maximum context length	16k tokens balances context size vs VRAM usage
`--gpu-memory-utilization 0.85`	VRAM fraction to use	Leaves 15% headroom for OS/display driver
`--max-num-seqs 32`	Max concurrent request slots	Prevents OOM from vLLM's memory pre-allocation
`--host 0.0.0.0`	Listen on all network interfaces	Required for remote access from other machines
`--port 8000`	HTTP port	vLLM's default, matches Zorac's default URL

Don't Forget awq_marlin

Using --quantization awq (without _marlin) falls back to a generic kernel that's ~10x slower. This is the single most impactful configuration flag.

Creating a Systemd Service¶

For a server that starts automatically on boot and restarts on crashes, create a systemd service:

sudo nano /etc/systemd/system/vllm.service

[Unit]
Description=vLLM Inference Server
After=network.target

[Service]
User=your-username
WorkingDirectory=/home/your-username/vllm-serve
Environment="PATH=/home/your-username/vllm-serve/venv/bin:/usr/bin"
Environment="CUDA_DEVICE_ORDER=PCI_BUS_ID"
Environment="CUDA_VISIBLE_DEVICES=0"
Environment="PYTHONUNBUFFERED=1"

ExecStart=/home/your-username/vllm-serve/venv/bin/vllm serve \
    "stelterlab/Mistral-Small-24B-Instruct-2501-AWQ" \
    --quantization awq_marlin \
    --max-model-len 16384 \
    --gpu-memory-utilization 0.85 \
    --max-num-seqs 32 \
    --host 0.0.0.0 \
    --port 8000

Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target

Replace your-username with your actual username, and adjust CUDA_VISIBLE_DEVICES if you have multiple GPUs (use the GPU ID shown by nvidia-smi).

Enable and start the service:

sudo systemctl daemon-reload
sudo systemctl enable vllm
sudo systemctl start vllm

Environment Variables¶

Key environment variables in the systemd service:

Variable	Purpose
`CUDA_DEVICE_ORDER=PCI_BUS_ID`	Ensures GPU numbering matches `nvidia-smi` output. Critical for multi-GPU systems.
`CUDA_VISIBLE_DEVICES=0`	Pins vLLM to a specific GPU. Change the number for multi-GPU setups.
`PYTHONUNBUFFERED=1`	Ensures log output appears immediately in `journalctl`.

Performance Tuning¶

GPU Memory Utilization¶

The --gpu-memory-utilization flag controls what fraction of total VRAM vLLM can use:

Value	Available VRAM (24GB card)	Use Case
`0.95`	22.8 GB	Dedicated server (no desktop) — may crash on init
`0.90`	21.6 GB	Headless server recommended
`0.85`	20.4 GB	Server with desktop environment (Zorac default)
`0.80`	19.2 GB	Conservative, leaves room for other GPU tasks

Don't Set Too High

Setting --gpu-memory-utilization above 0.90 on a system with a desktop environment often causes initialization crashes. The display driver needs some VRAM.

Max Model Length vs VRAM¶

Longer context windows consume more VRAM for the KV cache. For Mistral-Small-24B on a 24GB card:

Context Length	KV Cache Size	Fits with 0.85 util?
8,192	~4 GB	Comfortable
16,384	~8 GB	Good balance (default)
32,768	~16 GB	Tight — may OOM under load

If you're hitting OOM errors during long conversations, reduce --max-model-len to 8192.

Concurrent Request Limits¶

vLLM's V1 engine pre-allocates memory for concurrent request slots at startup. The default (256) assumes a production server with substantial VRAM headroom. On a 24GB consumer card, this can cause an OOM crash before the server even starts.

Setting --max-num-seqs 32 reduces this pre-allocation to a reasonable level for personal use. For a single-user Zorac setup, even --max-num-seqs 8 would be sufficient.

Monitoring with nvidia-smi¶

Monitor GPU usage during operation:

# Real-time monitoring (updates every second)
watch -n1 nvidia-smi

# Show just memory usage
nvidia-smi --query-gpu=memory.used,memory.total,utilization.gpu \
    --format=csv -l 1

Healthy metrics during Zorac chat: - Memory used: ~20-22 GB (model + KV cache) - GPU utilization: Spikes to 80-100% during generation, drops to 0% between messages - Temperature: 50-70C under load (check your cooling)

Connecting Zorac¶

Setting VLLM_BASE_URL¶

If vLLM is running on the same machine as Zorac, the default URL works:

VLLM_BASE_URL=http://localhost:8000/v1

For a remote server on your local network:

VLLM_BASE_URL=http://192.168.1.100:8000/v1

Set this in Zorac's .env file, via the /config command, or as an environment variable:

# Environment variable
VLLM_BASE_URL=http://192.168.1.100:8000/v1 zorac

# Or via /config in Zorac
/config set VLLM_BASE_URL http://192.168.1.100:8000/v1

Testing the Connection¶

Before starting Zorac, verify the server is responding:

curl http://localhost:8000/v1/models

You should see a JSON response listing the loaded model. When Zorac starts, it also runs a connection check automatically and reports the result.

Remote Server Access¶

If the vLLM server is on a different machine (e.g., a homelab server), ensure:

vLLM is listening on all interfaces — Use --host 0.0.0.0, not --host 127.0.0.1
Firewall allows port 8000 — sudo ufw allow 8000 on Ubuntu
Network is reachable — The client machine can reach the server's IP

No Authentication Required

vLLM doesn't require API keys by default. Zorac uses VLLM_API_KEY=EMPTY as a placeholder. If you're exposing the server to a broader network, consider adding a reverse proxy with authentication.

Troubleshooting¶

OOM Crashes on Startup¶

Symptom: vLLM crashes immediately with "CUDA out of memory" during initialization.

Common causes:

--max-num-seqs too high — Reduce to 32 or lower
--gpu-memory-utilization too high — Try 0.85 or 0.80
--max-model-len too large — Reduce to 8192
Another process using GPU memory — Check with nvidia-smi

Fix: Start with conservative values and increase gradually:

vllm serve "model" \
    --quantization awq_marlin \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.80 \
    --max-num-seqs 8

Slow Inference (Wrong Kernel)¶

Symptom: Generation speed is ~6 tok/s instead of the expected 60+ tok/s.

Cause: Using the wrong quantization flag. The generic AWQ kernel is much slower than the Marlin-optimized version.

Fix: Ensure you're using --quantization awq_marlin (not --quantization awq):

# Wrong (slow)
vllm serve "model" --quantization awq

# Right (fast)
vllm serve "model" --quantization awq_marlin

Connection Refused¶

Symptom: Zorac shows "Connection failed!" or curl returns "Connection refused."

Checklist:

Is vLLM running? — systemctl status vllm or check for the process
Is it listening on the right port? — ss -tlnp | grep 8000
Is it bound to the right interface? — --host 0.0.0.0 for remote access, not --host 127.0.0.1
Is the model loaded? — Check logs with journalctl -u vllm -f. Model loading can take 30-60 seconds.
Firewall? — sudo ufw status and ensure port 8000 is allowed