mirror of
https://github.com/NVIDIA/dgx-spark-playbooks.git
synced 2026-04-27 12:23:51 +00:00
253 lines
8.4 KiB
Markdown
253 lines
8.4 KiB
Markdown
|
|
# NVIDIA MPS Guide for Ollama GPU Optimization
|
||
|
|
|
||
|
|
## 🚀 Overview
|
||
|
|
|
||
|
|
NVIDIA Multi-Process Service (MPS) is a game-changing technology that enables multiple processes to share a single GPU context, eliminating expensive context switching overhead and dramatically improving concurrent workload performance.
|
||
|
|
|
||
|
|
This guide documents our discovery: **MPS transforms the DGX Spark from a single-threaded bottleneck into a high-throughput powerhouse**, achieving **3x concurrent performance** with near-perfect scaling.
|
||
|
|
|
||
|
|
## 📊 Performance Results Summary
|
||
|
|
|
||
|
|
### Triple Extraction Benchmark (llama3.1:8b)
|
||
|
|
|
||
|
|
| System | Mode | Individual Performance | Aggregate Throughput | Scaling Efficiency |
|
||
|
|
|--------|------|----------------------|---------------------|-------------------|
|
||
|
|
| **RTX 5090** | Single | ~300 tok/s | 300 tok/s | 100% (baseline) |
|
||
|
|
| **Mac M4 Pro** | Single | ~45 tok/s | 45 tok/s | 100% (baseline) |
|
||
|
|
| **DGX Spark** | Single (MPS) | 33.3 tok/s | 33.3 tok/s | 100% (baseline) |
|
||
|
|
| **DGX Spark** | 2x Concurrent | ~33.2 tok/s each | **66.4 tok/s** | **97% efficiency** |
|
||
|
|
| **DGX Spark** | 3x Concurrent | ~33.1 tok/s each | **99.4 tok/s** | **99% efficiency** |
|
||
|
|
|
||
|
|
### 🏆 Key Achievement
|
||
|
|
**DGX Spark + MPS delivers 2.2x higher aggregate throughput than RTX 5090 in multi-request scenarios!**
|
||
|
|
|
||
|
|
## 🛠️ MPS Setup Instructions
|
||
|
|
|
||
|
|
### 1. Start MPS Server
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Set MPS directory
|
||
|
|
export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps
|
||
|
|
mkdir -p /tmp/nvidia-mps
|
||
|
|
|
||
|
|
# Start MPS control daemon
|
||
|
|
sudo env "CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps" nvidia-cuda-mps-control -d
|
||
|
|
```
|
||
|
|
|
||
|
|
### 2. Restart Ollama with MPS Support
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Stop current Ollama
|
||
|
|
cd /path/to/ollama
|
||
|
|
docker compose down
|
||
|
|
|
||
|
|
# Start Ollama with MPS environment
|
||
|
|
sudo env "CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps" docker compose up -d
|
||
|
|
```
|
||
|
|
|
||
|
|
### 3. Verify MPS is Working
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Check MPS processes
|
||
|
|
ps aux | grep mps
|
||
|
|
|
||
|
|
# Expected output:
|
||
|
|
# root nvidia-cuda-mps-control -d
|
||
|
|
# root nvidia-cuda-mps-server -force-tegra
|
||
|
|
|
||
|
|
# Check Ollama processes show M+C flag
|
||
|
|
nvidia-smi
|
||
|
|
# Look for M+C in the Type column for Ollama processes
|
||
|
|
```
|
||
|
|
|
||
|
|
### 4. Stop MPS (when needed)
|
||
|
|
|
||
|
|
```bash
|
||
|
|
sudo nvidia-cuda-mps-control quit
|
||
|
|
```
|
||
|
|
|
||
|
|
## 🔬 Technical Architecture
|
||
|
|
|
||
|
|
### CUDA MPS Architecture
|
||
|
|
```
|
||
|
|
┌─────────────────────────────────────────┐
|
||
|
|
│ GPU (Single CUDA Context) │
|
||
|
|
│ ├── MPS Server (Resource Manager) │
|
||
|
|
│ ├── Ollama Process 1 ──┐ │
|
||
|
|
│ ├── Ollama Process 2 ──┼── Shared │
|
||
|
|
│ └── Ollama Process 3 ──┘ Context │
|
||
|
|
└─────────────────────────────────────────┘
|
||
|
|
```
|
||
|
|
|
||
|
|
### Traditional Multi-Process Architecture
|
||
|
|
```
|
||
|
|
┌─────────────────────────────────────────┐
|
||
|
|
│ GPU │
|
||
|
|
│ ├── Process 1 (Context 1) ─────────────│
|
||
|
|
│ ├── Process 2 (Context 2) ─────────────│
|
||
|
|
│ └── Process 3 (Context 3) ─────────────│
|
||
|
|
│ ↑ Context Switching Overhead │
|
||
|
|
└─────────────────────────────────────────┘
|
||
|
|
```
|
||
|
|
|
||
|
|
## ⚖️ MPS vs Multiple API Servers Comparison
|
||
|
|
|
||
|
|
### 🚀 CUDA MPS Advantages
|
||
|
|
|
||
|
|
**Performance:**
|
||
|
|
- ✅ No context switching overhead (single shared context)
|
||
|
|
- ✅ Concurrent kernel execution from different processes
|
||
|
|
- ✅ Lower latency for small requests
|
||
|
|
- ✅ Better GPU utilization (kernels can overlap)
|
||
|
|
|
||
|
|
**Memory Efficiency:**
|
||
|
|
- ✅ Shared GPU memory management
|
||
|
|
- ✅ No duplicate driver overhead per process
|
||
|
|
- ✅ More efficient memory allocation
|
||
|
|
- ✅ Can fit more models in same memory
|
||
|
|
|
||
|
|
**Resource Management:**
|
||
|
|
- ✅ Single point of GPU resource control
|
||
|
|
- ✅ Automatic load balancing across processes
|
||
|
|
- ✅ Better thermal management
|
||
|
|
- ✅ Unified monitoring and debugging
|
||
|
|
|
||
|
|
### 🏢 Multiple API Servers Advantages
|
||
|
|
|
||
|
|
**Isolation & Reliability:**
|
||
|
|
- ✅ Process isolation (one crash doesn't affect others)
|
||
|
|
- ✅ Independent scaling per service
|
||
|
|
- ✅ Different models can have different configurations
|
||
|
|
- ✅ Easier to update/restart individual services
|
||
|
|
|
||
|
|
**Flexibility:**
|
||
|
|
- ✅ Different frameworks (vLLM, TensorRT-LLM, etc.)
|
||
|
|
- ✅ Per-service optimization
|
||
|
|
- ✅ Independent monitoring and logging
|
||
|
|
- ✅ Service-specific resource limits
|
||
|
|
|
||
|
|
**Operational:**
|
||
|
|
- ✅ Standard container orchestration (K8s, Docker)
|
||
|
|
- ✅ Familiar DevOps patterns
|
||
|
|
- ✅ Load balancing at HTTP level
|
||
|
|
- ✅ Rolling updates and deployments
|
||
|
|
|
||
|
|
## 🎯 Decision Framework
|
||
|
|
|
||
|
|
### Use CUDA MPS When:
|
||
|
|
- 🏆 Maximum GPU utilization is critical
|
||
|
|
- ⚡ Low latency is paramount
|
||
|
|
- 💰 Cost optimization (more models per GPU)
|
||
|
|
- 🔄 Same framework/runtime (e.g., all Ollama)
|
||
|
|
- 📊 Predictable, homogeneous workloads
|
||
|
|
- 🎮 Single-tenant environments
|
||
|
|
|
||
|
|
### Use Multiple API Servers When:
|
||
|
|
- 🛡️ High availability/fault tolerance required
|
||
|
|
- 🔧 Different models need different optimizations
|
||
|
|
- 📈 Independent scaling per service needed
|
||
|
|
- 🌐 Multi-tenant production environments
|
||
|
|
- 🔄 Frequent model updates/deployments
|
||
|
|
- 👥 Different teams managing different models
|
||
|
|
|
||
|
|
## 📊 Performance Impact Analysis
|
||
|
|
|
||
|
|
| Metric | CUDA MPS | Multiple Servers |
|
||
|
|
|--------|----------|------------------|
|
||
|
|
| Context Switch Overhead | ~0% | ~5-15% |
|
||
|
|
| Memory Efficiency | ~95% | ~80-85% |
|
||
|
|
| Latency (small requests) | Lower | Higher |
|
||
|
|
| Throughput (concurrent) | Higher | Lower |
|
||
|
|
| Fault Isolation | Lower | Higher |
|
||
|
|
| Operational Complexity | Lower | Higher |
|
||
|
|
|
||
|
|
## 🔍 Memory Capacity Analysis
|
||
|
|
|
||
|
|
### Model Memory Requirements
|
||
|
|
- **llama3.1:8b (Q4_K_M)**: ~4.9GB per instance
|
||
|
|
|
||
|
|
### System Comparison
|
||
|
|
| System | Total Memory | Theoretical Max | Practical Max |
|
||
|
|
|--------|--------------|----------------|---------------|
|
||
|
|
| **RTX 5090** | 24GB VRAM | 4-5 models | 2-3 models |
|
||
|
|
| **DGX Spark** | 120GB Unified | 20+ models | 10+ models |
|
||
|
|
|
||
|
|
### RTX 5090 Limitations:
|
||
|
|
- ❌ Limited to 24GB VRAM (hard ceiling)
|
||
|
|
- ❌ Driver overhead reduces available memory
|
||
|
|
- ❌ Memory fragmentation issues
|
||
|
|
- ❌ Thermal throttling under concurrent load
|
||
|
|
- ❌ Context switching still expensive
|
||
|
|
|
||
|
|
### DGX Spark Advantages:
|
||
|
|
- ✅ 5x more memory capacity (120GB vs 24GB)
|
||
|
|
- ✅ Unified memory architecture
|
||
|
|
- ✅ Better thermal design for sustained loads
|
||
|
|
- ✅ Can scale to 10+ concurrent models
|
||
|
|
- ✅ No VRAM bottleneck
|
||
|
|
|
||
|
|
## 🧪 Testing Concurrent Performance
|
||
|
|
|
||
|
|
### Single Instance Baseline
|
||
|
|
```bash
|
||
|
|
curl -X POST http://localhost:11434/api/chat \
|
||
|
|
-H "Content-Type: application/json" \
|
||
|
|
-d '{
|
||
|
|
"model": "llama3.1:8b",
|
||
|
|
"messages": [{"role": "user", "content": "Your prompt here"}],
|
||
|
|
"stream": false
|
||
|
|
}'
|
||
|
|
```
|
||
|
|
|
||
|
|
### Concurrent Testing
|
||
|
|
```bash
|
||
|
|
# Run multiple requests simultaneously
|
||
|
|
curl [request1] & curl [request2] & curl [request3] & wait
|
||
|
|
```
|
||
|
|
|
||
|
|
### Expected Results with MPS:
|
||
|
|
- **1 instance**: 33.3 tok/s
|
||
|
|
- **2 concurrent**: ~66.4 tok/s total (97% efficiency)
|
||
|
|
- **3 concurrent**: ~99.4 tok/s total (99% efficiency)
|
||
|
|
|
||
|
|
## 🎯 Recommendations
|
||
|
|
|
||
|
|
### For Triple Extraction Workloads:
|
||
|
|
**MPS is the optimal choice because:**
|
||
|
|
1. **Homogeneous workload** - same model (llama3.1:8b)
|
||
|
|
2. **Performance critical** - maximum throughput needed
|
||
|
|
3. **Cost optimization** - more concurrent requests per GPU
|
||
|
|
4. **Predictable usage** - biomedical triple extraction
|
||
|
|
|
||
|
|
### Hybrid Approach:
|
||
|
|
Consider running:
|
||
|
|
- **MPS in production** for maximum throughput
|
||
|
|
- **Separate dev/test servers** for experimentation
|
||
|
|
- **Different models** on separate instances when needed
|
||
|
|
|
||
|
|
## 🚨 Important Notes
|
||
|
|
|
||
|
|
1. **MPS requires careful setup** - ensure proper environment variables
|
||
|
|
2. **Monitor GPU temperature** under heavy concurrent loads
|
||
|
|
3. **Test thoroughly** before production deployment
|
||
|
|
4. **Have fallback plan** to standard single-process mode
|
||
|
|
5. **Consider workload patterns** - MPS excels with consistent concurrent requests
|
||
|
|
|
||
|
|
## 🔗 Related Files
|
||
|
|
|
||
|
|
- `docker-compose.yml` - Ollama service configuration
|
||
|
|
- `ollama_gpu_benchmark.py` - Performance testing script
|
||
|
|
- `clear_cache_and_restart.sh` - Memory optimization script
|
||
|
|
- `gpu_memory_monitor.sh` - GPU monitoring script
|
||
|
|
|
||
|
|
## 📚 Additional Resources
|
||
|
|
|
||
|
|
- [NVIDIA MPS Documentation](https://docs.nvidia.com/deploy/mps/index.html)
|
||
|
|
- [CUDA Multi-Process Service Guide](https://docs.nvidia.com/cuda/mps/index.html)
|
||
|
|
- [Ollama Documentation](https://ollama.ai/docs)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
**Last Updated**: October 2, 2025
|
||
|
|
**Tested On**: DGX Spark with 120GB unified memory, CUDA 13.0, Ollama latest
|