mirror of
https://github.com/NVIDIA/dgx-spark-playbooks.git
synced 2026-04-22 18:13:52 +00:00
chore: Regenerate all playbooks
This commit is contained in:
parent
6278ed97a3
commit
65d43d3ae6
@ -39,6 +39,7 @@ Each playbook includes prerequisites, step-by-step instructions, troubleshooting
|
||||
- [Open WebUI with Ollama](nvidia/open-webui/)
|
||||
- [Fine-tune with Pytorch](nvidia/pytorch-fine-tune/)
|
||||
- [RAG Application in AI Workbench](nvidia/rag-ai-workbench/)
|
||||
- [SGLang Inference Server](nvidia/sglang/)
|
||||
- [Speculative Decoding](nvidia/speculative-decoding/)
|
||||
- [Set up Tailscale on Your Spark](nvidia/tailscale/)
|
||||
- [TRT LLM for Inference](nvidia/trt-llm/)
|
||||
|
||||
230
nvidia/sglang/README.md
Normal file
230
nvidia/sglang/README.md
Normal file
@ -0,0 +1,230 @@
|
||||
# SGLang Inference Server
|
||||
|
||||
> Install and use SGLang on DGX Spark
|
||||
|
||||
## Table of Contents
|
||||
|
||||
- [Overview](#overview)
|
||||
- [Time & risk](#time-risk)
|
||||
- [Instructions](#instructions)
|
||||
- [Troubleshooting](#troubleshooting)
|
||||
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
## Basic Idea
|
||||
|
||||
SGLang is a fast serving framework for large language models and vision language models that makes
|
||||
your interaction with models faster and more controllable by co-designing the backend runtime and
|
||||
frontend language. This setup uses the optimized NVIDIA SGLang NGC Container on a single NVIDIA
|
||||
Spark device with Blackwell architecture, providing GPU-accelerated inference with all dependencies
|
||||
pre-installed.
|
||||
|
||||
## What you'll accomplish
|
||||
|
||||
You'll deploy SGLang in both server and offline inference modes on your NVIDIA Spark device,
|
||||
enabling high-performance LLM serving with support for text generation, chat completion, and
|
||||
vision-language tasks using models like DeepSeek-V2-Lite.
|
||||
|
||||
## What to know before starting
|
||||
|
||||
- Working in a terminal environment on Linux systems
|
||||
- Basic understanding of Docker containers and container management
|
||||
- Familiarity with NVIDIA GPU drivers and CUDA toolkit concepts
|
||||
- Experience with HTTP API endpoints and JSON request/response handling
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- NVIDIA Spark device with Blackwell architecture
|
||||
- Docker Engine installed and running: `docker --version`
|
||||
- NVIDIA GPU drivers installed: `nvidia-smi`
|
||||
- NVIDIA Container Toolkit configured: `docker run --rm --gpus all lmsysorg/sglang:spark nvidia-smi`
|
||||
- Sufficient disk space (>20GB available): `df -h`
|
||||
- Network connectivity for pulling NGC containers: `ping nvcr.io`
|
||||
|
||||
## Ancillary files
|
||||
|
||||
- An offline inference python script [found here on GitHub](https://github.com/NVIDIA/dgx-spark-playbooks/blob/main/nvidia/sglang/assets/offline-inference.py)
|
||||
|
||||
### Time & risk
|
||||
|
||||
* **Estimated time:** 30 minutes for initial setup and validation
|
||||
* **Risk level:** Low - Uses pre-built, validated SGLang container with minimal configuration
|
||||
* **Rollback:** Stop and remove containers with `docker stop` and `docker rm` commands
|
||||
* **Last Updated:** 11/25/2025
|
||||
* First Publication
|
||||
|
||||
## Instructions
|
||||
|
||||
## Step 1. Verify system prerequisites
|
||||
|
||||
Check that your NVIDIA Spark device meets all requirements before proceeding. This step runs on
|
||||
your host system and ensures Docker, GPU drivers, and container toolkit are properly configured.
|
||||
|
||||
```bash
|
||||
## Verify Docker installation
|
||||
docker --version
|
||||
|
||||
## Check NVIDIA GPU drivers
|
||||
nvidia-smi
|
||||
|
||||
## Verify Docker GPU support
|
||||
docker run --rm --gpus all lmsysorg/sglang:spark nvidia-smi
|
||||
|
||||
## Check available disk space
|
||||
df -h /
|
||||
```
|
||||
|
||||
## Step 2. Pull the SGLang Container
|
||||
|
||||
Download the latest SGLang container. This step runs on the host and may take
|
||||
several minutes depending on your network connection.
|
||||
|
||||
|
||||
```bash
|
||||
## Pull the SGLang container
|
||||
docker pull lmsysorg/sglang:spark
|
||||
|
||||
## Verify the image was downloaded
|
||||
docker images | grep sglang
|
||||
```
|
||||
|
||||
## Step 3. Launch SGLang container for server mode
|
||||
|
||||
Start the SGLang container in server mode to enable HTTP API access. This runs the inference
|
||||
server inside the container, exposing it on port 30000 for client connections.
|
||||
|
||||
```bash
|
||||
## Launch container with GPU support and port mapping
|
||||
docker run --gpus all -it --rm \
|
||||
-p 30000:30000 \
|
||||
-v /tmp:/tmp \
|
||||
lmsysorg/sglang:spark \
|
||||
bash
|
||||
```
|
||||
|
||||
## Step 4. Start the SGLang inference server
|
||||
|
||||
Inside the container, launch the HTTP inference server with a supported model. This step runs
|
||||
inside the Docker container and starts the SGLang server daemon.
|
||||
|
||||
```bash
|
||||
## Start the inference server with DeepSeek-V2-Lite model
|
||||
python3 -m sglang.launch_server \
|
||||
--model-path deepseek-ai/DeepSeek-V2-Lite \
|
||||
--host 0.0.0.0 \
|
||||
--port 30000 \
|
||||
--trust-remote-code \
|
||||
--tp 1 \
|
||||
--attention-backend flashinfer \
|
||||
--mem-fraction-static 0.75 &
|
||||
|
||||
## Wait for server to initialize
|
||||
sleep 30
|
||||
|
||||
## Check server status
|
||||
curl http://localhost:30000/health
|
||||
```
|
||||
|
||||
## Step 5. Test client-server inference
|
||||
|
||||
From a new terminal on your host system, test the SGLang server API to ensure it's working
|
||||
correctly. This validates that the server is accepting requests and generating responses.
|
||||
|
||||
```bash
|
||||
## Test with curl
|
||||
curl -X POST http://localhost:30000/generate \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"text": "What does NVIDIA love?",
|
||||
"sampling_params": {
|
||||
"temperature": 0.7,
|
||||
"max_new_tokens": 100
|
||||
}
|
||||
}'
|
||||
```
|
||||
|
||||
## Step 6. Test Python client API
|
||||
|
||||
Create a simple Python script to test programmatic access to the SGLang server. This runs on
|
||||
the host system and demonstrates how to integrate SGLang into applications.
|
||||
|
||||
```python
|
||||
import requests
|
||||
|
||||
## Send prompt to server
|
||||
response = requests.post('http://localhost:30000/generate', json={
|
||||
'text': 'What does NVIDIA love?',
|
||||
'sampling_params': {
|
||||
'temperature': 0.7,
|
||||
'max_new_tokens': 100,
|
||||
},
|
||||
})
|
||||
|
||||
print(f"Response: {response.json()['text']}")
|
||||
```
|
||||
|
||||
## Step 7. Validate installation
|
||||
|
||||
Confirm that both server and offline modes are working correctly. This step verifies the
|
||||
complete SGLang setup and ensures reliable operation.
|
||||
|
||||
```bash
|
||||
## Check server mode (from host)
|
||||
curl http://localhost:30000/health
|
||||
curl -X POST http://localhost:30000/generate -H "Content-Type: application/json" \
|
||||
-d '{"text": "Hello", "sampling_params": {"max_new_tokens": 10}}'
|
||||
|
||||
## Check container logs
|
||||
docker ps
|
||||
docker logs <CONTAINER_ID>
|
||||
```
|
||||
|
||||
## Step 8. Cleanup and rollback
|
||||
|
||||
Stop and remove containers to clean up resources. This step returns your system to its
|
||||
original state.
|
||||
|
||||
> [!WARNING]
|
||||
> This will stop all SGLang containers and remove temporary data.
|
||||
|
||||
```bash
|
||||
## Stop all SGLang containers
|
||||
docker ps | grep sglang | awk '{print $1}' | xargs docker stop
|
||||
|
||||
## Remove stopped containers
|
||||
docker container prune -f
|
||||
|
||||
## Remove SGLang images (optional)
|
||||
docker rmi lmsysorg/sglang:spark
|
||||
```
|
||||
|
||||
## Step 9. Next steps
|
||||
|
||||
With SGLang successfully deployed, you can now:
|
||||
|
||||
- Integrate the HTTP API into your applications using the `/generate` endpoint
|
||||
- Experiment with different models by changing the `--model-path` parameter
|
||||
- Scale up using multiple GPUs by adjusting the `--tp` (tensor parallel) setting
|
||||
- Deploy production workloads using the container orchestration platform of your choice
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
Common issues and their resolutions:
|
||||
|
||||
| Symptom | Cause | Fix |
|
||||
|---------|-------|-----|
|
||||
| Container fails to start with GPU errors | NVIDIA drivers/toolkit missing | Install nvidia-container-toolkit, restart Docker |
|
||||
| Server responds with 404 or connection refused | Server not fully initialized | Wait 60 seconds, check container logs |
|
||||
| Out of memory errors during model loading | Insufficient GPU memory | Use smaller model or increase --tp parameter |
|
||||
| Model download fails | Network connectivity issues | Check internet connection, retry download |
|
||||
| Permission denied accessing /tmp | Volume mount issues | Use full path: -v /tmp:/tmp or create dedicated directory |
|
||||
|
||||
> [!NOTE]
|
||||
> DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU.
|
||||
> With many applications still updating to take advantage of UMA, you may encounter memory issues even when within
|
||||
> the memory capacity of DGX Spark. If that happens, manually flush the buffer cache with:
|
||||
```bash
|
||||
sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
|
||||
```
|
||||
30
nvidia/sglang/assets/offline-inference.py
Normal file
30
nvidia/sglang/assets/offline-inference.py
Normal file
@ -0,0 +1,30 @@
|
||||
#
|
||||
# SPDX-FileCopyrightText: Copyright (c) 1993-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
|
||||
# SPDX-License-Identifier: Apache-2.0
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
#
|
||||
|
||||
import sglang as sgl
|
||||
|
||||
def main():
|
||||
llm = sgl.Engine(model_path="deepseek-ai/DeepSeek-V2-Lite", trust_remote_code=True)
|
||||
|
||||
prompt = "What does NVIDIA love?"
|
||||
sampling_params = {"temperature": 0.7, "max_new_tokens": 100}
|
||||
|
||||
output = llm.generate(prompt, sampling_params)
|
||||
print(f"Output: {output}")
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
||||
Loading…
Reference in New Issue
Block a user