Complete Guide: Self-Host Mistral-24B on RTX 4090¶
Run a ChatGPT-class LLM on your own hardware - zero monthly costs, complete privacy
This guide shows you how to set up a high-performance, self-hosted LLM inference server on consumer hardware using my own AI workstation as an example.
Building your own is perfect for: - Homelab enthusiasts running private AI infrastructure - AI engineers wanting local AI coding assistants, or building agents - Privacy-conscious developers who want data to stay local - Anyone with a gaming PC (RTX 3090/4090/5090)
Total cost: $0/month after initial setup. No API fees, unlimited queries.
This repository documents the complete configuration for a vLLM inference server on Ubuntu 24.04 LTS, specifically tuned for NVIDIA RTX 4090 (24GB) serving Mistral-Small-24B for autonomous agentic workflows (LangChain/LangGraph) as well as how to use the dual GPU setup for fine-tuning.
Hardware Specifications¶
The system utilizes a split-role GPU strategy to manage mixed architectures (Ampere + Ada Lovelace) and power constraints.
- OS: Ubuntu 24.04.3 LTS
- Motherboard: Asus WS x299 Sage
- CPU: Intel® Core™ i9-10940X × 28 (14 Cores / 28 Threads, AVX-512)
- RAM: 256GB
- System Storage: 2TB SSD
- RAID 16TB NVMe RAID 0 (Highpoint SSD7101A)
- PSU: Corsair HX1500i (1500W Platinum)
GPU Roles¶
| ID (PCI) | Model | VRAM | Architecture | Role | Notes |
|---|---|---|---|---|---|
| GPU 0 | RTX 3090 Ti | 24GB | Ampere | Training Pool | High transient power spikes (~650W). Idle during inference. |
| GPU 1 | RTX 4090 | 24GB | Ada Lovelace | Inference Host | Dedicated to vLLM. Uses awq_marlin kernels. |
Critical Power Warning: The combined peak transient load of a 3090 Ti + 4090 can exceed 1600W, risking a PSU OCP trip. Strict power limits must be applied before engaging both cards simultaneously (Training Mode).
1. Prerequisites & System Prep¶
BIOS Settings (ASUS WS X299 Sage)¶
Required for addressing 48GB total VRAM and maximizing PCIe throughput.
* Above 4G Decoding: Enabled
* Re-Size BAR Support: Enabled
* Launch CSM: Disabled
Prerequisites¶
- NVIDIA Drivers: Version 550.x+
- Python: 3.12 (System Default on Ubuntu 24.04)
- uv: Fast Python package manager (
curl -LsSf https://astral.sh/uv/install.sh | sh)
GPU Drivers¶
Ensure proprietary NVIDIA drivers are installed and loaded.
If commands are missing:Python Environment¶
Ubuntu 24.04 ships with Python 3.12. Do not attempt to use Python 3.11 or older via PPAs, as this causes conflict with system headers.
# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Or using Homebrew (macOS/Linux)
brew install uv
2. Inference Service (vLLM)¶
vLLM is configured as a systemd service, pinned strictly to GPU 1 (RTX 4090) to utilize the awq_marlin quantization kernel for maximum throughput (~60-65 t/s).
Installation¶
mkdir -p ~/vllm-serve && cd ~/vllm-serve
python3 -m venv venv
source venv/bin/activate
pip install vllm mistral-common
Note: mistral-common is required for proper Tool Calling support with Mistral models.
Configuration (Systemd)¶
- File: /etc/systemd/system/vllm.service
Key Configuration Details:¶
Getting a 24B model to run efficiently on a 24GB card requires specific tuning.
CUDA_DEVICE_ORDER=PCI_BUS_ID: Critical. Fixes vLLM crash when using UUIDs.CUDA_VISIBLE_DEVICES=1: Pins the process to the RTX 4090.--quantization awq_marlin: Optimized 4-bit kernel for Ada Lovelace.--max-num-seqs 32: Prevents OOM on startup by reducing concurrency buffer.--tokenizer-mode mistral: Enables native Tool Parsing capabilities (do not use custom chat template with this mode).PYTHONUNBUFFERED=1: Ensures real-time logging.
[Unit]
Description=vLLM Inference Server
After=network.target
[Service]
# UPDATE 'User' and 'WorkingDirectory' to match your actual user
User=commander
WorkingDirectory=/home/commander/Sandbox/vllm-serve
Environment="PATH=/home/commander/Sandbox/vllm-serve/venv/bin:/usr/bin"
# --- GPU PINNING (Force 4090) ---
Environment="CUDA_DEVICE_ORDER=PCI_BUS_ID"
Environment="CUDA_VISIBLE_DEVICES=1"
# --- LOGGING CONFIG ---
Environment="PYTHONUNBUFFERED=1"
Environment="VLLM_LOGGING_LEVEL=INFO"
ExecStart=/home/commander/Sandbox/vllm-serve/venv/bin/vllm serve \
"stelterlab/Mistral-Small-24B-Instruct-2501-AWQ" \
--tokenizer "stelterlab/Mistral-Small-24B-Instruct-2501-AWQ" \
--tokenizer-mode mistral \
--quantization awq_marlin \
--dtype auto \
--max-model-len 16384 \
--gpu-memory-utilization 0.85 \
--max-num-seqs 32 \
--host 0.0.0.0 \
--port 8000 \
--enable-auto-tool-choice \
--tool-call-parser mistral \
--enable-log-requests \
--enable-log-outputs \
--disable-log-stats
Restart=always
RestartSec=5
[Install]
WantedBy=multi-user.target
Explanation of Flags (Tuning Opportunities)¶
--quantization awq_marlin: The standard awq kernel is slow on 40-series cards.awq_marlinboosts speed from ~6 t/s to ~60-65 t/s.--max-model-len 16384: Limits context to 16k tokens. 32k is possible but risks OOM (Out Of Memory) crashes on a 24GB card when the KV cache fills up.--gpu-memory-utilization 0.85: Leaves ~15% VRAM for the Desktop Environment and overhead. Setting this to 0.95 will crash during initialization.--max-num-seqs 32: The vLLM V1 engine attempts to pre-allocate memory for 256 concurrent users, causing an OOM crash on startup. Reducing this to 32 frees up enough VRAM to load the model.--enable-auto-tool-choice&--tool-call-parser mistral: Required for LangChain/LangGraph compatibility. Without these, the API returns a 400 Error when the agent tries to use tools.
Management¶
# Apply changes
sudo systemctl daemon-reload
# Start/Restart
sudo systemctl restart vllm
# View Logs (Real-time)
sudo journalctl -u vllm -f -o cat
Model Selection: The Perfect Fit for RTX 4090¶
We are using Mistral-Small-24B-Instruct-2501. This model represents the "sweet spot" for a single RTX 4090: it is the most intelligent model that fits comfortably within the 24GB VRAM limit while maintaining high-speed generation.
- Unquantized (FP16): Requires ~48GB VRAM. This would require dual GPUs or stepping down to a less capable 7B model.
- GGUF: While efficient for RAM, it is not optimized for vLLM's tensor parallelism; results in CPU-like speeds (~6 t/s).
- AWQ (4-bit): Compresses the model to ~14GB VRAM, leaving ~10GB for the KV Cache (Context Window). This allows us to run a 24B model at 60-65 tokens/sec, a feat previously impossible on single consumer cards.
Target Model: stelterlab/Mistral-Small-24B-Instruct-2501-AWQ
3. Fine-Tuning Environment (Multi-GPU)¶
Training uses both cards to pool VRAM (48GB Total). This requires a separate environment and strict safety protocols.
Power Safety Protocol (Mandatory)¶
Before starting any training run, you must cap the GPUs to prevent transient spikes from tripping the PSU.
Installation (via uv)¶
For detailed multi-GPU training setup, environment configuration, and test scripts, see TEST_TRAINING.md.
Running a Job¶
- Stop Inference: sudo systemctl stop vllm (Frees up the 4090).
- Cap Power: sudo nvidia-smi -pl 350.
- Launch: accelerate launch train_script.py.
4. Monitoring & Troubleshooting¶
GPU Monitoring (btop)¶
The standard Ubuntu snap for btop does not support NVIDIA GPUs. A custom binary was compiled from source (v1.3.2) to enable this.
- Run:
btop - Toggle GPU View: Press
5on the keyboard.
Web Dashboard (Cockpit)¶
- URL:
https://<SERVER_IP>:9090 - Usage: Used for viewing system logs (
journalctl), storage health, and terminal access without SSH.
5. Client Usage¶
For instructions on how to use the interactive chat client, please refer to README.md.