Multi-GPU Training¶
How to use multiple consumer GPUs for fine-tuning LLMs — covering power management, environment setup, and running training jobs alongside an inference server.
Reference Documentation
This guide explains the concepts and reasoning behind multi-GPU training on consumer hardware. For the specific commands and configuration used in Zorac's setup, see the Server Setup Reference.
Why Fine-Tune?¶
When Fine-Tuning Makes Sense¶
Pre-trained models like Mistral-Small-24B are general-purpose — they perform well across a wide range of tasks. But sometimes you need a model that excels at a specific domain:
- Custom writing style — Training on your own writing to match your voice
- Domain-specific knowledge — Legal documents, medical terminology, internal codebases
- Specialized tasks — Structured data extraction, specific code patterns, custom output formats
- Behavior alignment — Adjusting how the model responds to certain types of queries
Fine-tuning takes a pre-trained model and continues training on your specific data, adapting its behavior without starting from scratch.
LoRA vs Full Fine-Tuning¶
There are two main approaches to fine-tuning:
| Approach | Memory Required | Training Time | Quality |
|---|---|---|---|
| Full fine-tuning | ~4x model size | Hours to days | Best, but risks overfitting |
| LoRA (Low-Rank Adaptation) | ~1.2x model size | Minutes to hours | Very good for most tasks |
Full fine-tuning updates all model parameters. For a 24B model at FP16, this requires ~96GB of memory for weights + gradients + optimizer states — far more than what fits on consumer GPUs even with multi-GPU setups.
LoRA freezes the original model weights and adds small trainable "adapter" layers (typically 0.1-1% of the total parameters). This dramatically reduces memory requirements:
Full fine-tuning (24B FP16): ~96 GB minimum
LoRA fine-tuning (24B FP16): ~48 GB (fits on 2x 24GB GPUs)
LoRA + QLoRA (24B 4-bit): ~24 GB (fits on a single 24GB GPU)
For consumer hardware, LoRA (or QLoRA, which combines LoRA with quantization) is the practical choice.
Hardware Requirements¶
Multi-GPU fine-tuning pools VRAM from multiple cards. For Zorac's setup:
GPU 0: RTX 3090 Ti (24GB) — Training pool
GPU 1: RTX 4090 (24GB) — Training pool (normally inference)
───────────────────────────
Total: 48GB VRAM
With 48GB pooled VRAM, you can fine-tune models up to ~24B parameters using LoRA at FP16, or full fine-tune models up to ~7B parameters.
You Don't Need Two GPUs
QLoRA (quantized LoRA) can fine-tune a 24B model on a single 24GB GPU. Multi-GPU just gives you more headroom and faster training. Many fine-tuning tasks work perfectly well with a single card.
Power Safety¶
Understanding Transient Power Spikes¶
Consumer GPUs are designed for gaming, where power draw is relatively steady. Training workloads create transient power spikes — brief surges that can exceed the GPU's rated TDP by 50% or more.
A single RTX 4090 has a 450W TDP but can spike to ~600W during training. An RTX 3090 Ti (350W TDP) can spike to ~650W. Running both simultaneously:
RTX 3090 Ti transient peak: ~650W
RTX 4090 transient peak: ~600W
CPU + system: ~200W
──────────────────────────────────
Worst case total: ~1450W
A 1500W PSU has overcurrent protection (OCP) that trips when power draw exceeds its rated capacity. Without power limits, simultaneous training on both GPUs will trip the PSU, causing an immediate system shutdown.
Setting Power Limits with nvidia-smi¶
Before starting any multi-GPU training run, cap both GPUs:
This limits each GPU's sustained power draw to 350W, keeping the combined load well within PSU capacity:
2x GPUs at 350W cap: 700W
CPU + system: 200W
─────────────────────────
Total: 900W (safe on 1500W PSU)
Always Set Power Limits Before Training
This is not optional. Failing to set power limits on a multi-GPU consumer system risks PSU trips that can corrupt data and potentially damage hardware. Make it the first step of every training session.
To check current power limits:
PSU Sizing for Multi-GPU¶
If you're building a multi-GPU training system, size your PSU with transient spikes in mind:
| GPU Configuration | Minimum PSU | Recommended PSU |
|---|---|---|
| Single RTX 4090 | 850W | 1000W |
| RTX 3090 + RTX 4090 | 1200W | 1500W |
| 2x RTX 4090 | 1300W | 1600W |
Always use a high-quality 80+ Platinum or Titanium rated PSU. Lower-rated PSUs have tighter OCP thresholds and are more likely to trip under transient loads.
Environment Setup¶
Separate Virtual Environments¶
Training dependencies (PyTorch, Transformers, PEFT, Accelerate) are different from inference dependencies (vLLM). Keep them in separate virtual environments to avoid version conflicts:
# Inference environment (already set up)
~/vllm-serve/venv/
# Training environment (new)
mkdir -p ~/training && cd ~/training
python3 -m venv venv
source venv/bin/activate
Installing Training Dependencies¶
Install the core training stack:
| Package | Purpose |
|---|---|
torch |
PyTorch — the deep learning framework |
transformers |
Hugging Face model loading and training utilities |
datasets |
Dataset loading and preprocessing |
accelerate |
Multi-GPU training orchestration |
peft |
Parameter-Efficient Fine-Tuning (LoRA, QLoRA) |
bitsandbytes |
4-bit quantization for QLoRA |
Accelerate Configuration¶
Hugging Face's accelerate library handles multi-GPU coordination. Run the configuration wizard:
Answer the prompts:
- Compute environment: This machine
- Machine type: multi-GPU
- Number of GPUs: 2
- Mixed precision: fp16 (or bf16 if supported)
This creates a configuration file at ~/.cache/huggingface/accelerate/default_config.yaml that tells accelerate how to distribute work across your GPUs.
To verify the setup:
This should show both GPUs and the selected configuration.
Running Training Jobs¶
Stopping the Inference Server¶
Training uses all available GPU memory. The vLLM inference server must be stopped first to free GPU 1 (the RTX 4090):
Verify both GPUs are free:
Launching with Accelerate¶
Use accelerate launch to distribute training across both GPUs:
# Activate the training environment
source ~/training/venv/bin/activate
# Set power limits FIRST
sudo nvidia-smi -pl 350
# Launch training
accelerate launch train_script.py \
--model_name "mistralai/Mistral-Small-24B-Instruct-2501" \
--dataset "your-dataset" \
--output_dir "./output" \
--num_epochs 3 \
--learning_rate 2e-5 \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 4
accelerate launch automatically:
- Detects both GPUs from your accelerate config
- Spawns a training process on each GPU
- Handles gradient synchronization between GPUs
- Manages mixed precision (FP16) training
Monitoring Training Progress¶
During training, monitor GPU utilization and power:
# Terminal 1: GPU monitoring
watch -n1 nvidia-smi
# Terminal 2: Training logs (if using wandb or tensorboard)
tensorboard --logdir ./output/runs
Healthy training metrics:
- GPU memory: Both GPUs at 80-95% utilization
- Power draw: Both under your set power limit (350W)
- Temperature: 70-85C (check cooling if above 85C)
- GPU utilization: 90-100% during forward/backward pass, brief drops during data loading
Training Speed
Fine-tuning a 24B model with LoRA across two 24GB GPUs processes roughly 1-5 samples per second depending on sequence length and batch size. A small dataset (1000 samples, 3 epochs) typically completes in under an hour.
Returning to Inference¶
Restarting vLLM¶
After training completes, restart the inference server:
# Remove power limits (return to default)
sudo nvidia-smi -pl 450 # RTX 4090 default TDP
# Restart vLLM
sudo systemctl start vllm
# Verify it's running
sudo journalctl -u vllm -f -o cat
Wait for the log message indicating the model has loaded and the server is ready (typically 30-60 seconds).
Testing the Fine-Tuned Model¶
If you fine-tuned with LoRA, you have adapter weights that can be merged with the base model or loaded separately:
Option 1: Merge and serve the merged model
from peft import PeftModel
from transformers import AutoModelForCausalLM
base_model = AutoModelForCausalLM.from_pretrained("base-model-name")
model = PeftModel.from_pretrained(base_model, "./output/checkpoint-final")
merged = model.merge_and_unload()
merged.save_pretrained("./merged-model")
Then serve the merged model with vLLM:
Option 2: Update the Zorac model configuration
If you're serving a new model, update Zorac to point to it:
Test the fine-tuned model with prompts from your training domain to verify the fine-tuning worked as expected.