diff --git a/README.md b/README.md index c870460..c13d1ab 100644 --- a/README.md +++ b/README.md @@ -39,6 +39,7 @@ Each playbook includes prerequisites, step-by-step instructions, troubleshooting - [Open WebUI with Ollama](nvidia/open-webui/) - [Fine-tune with Pytorch](nvidia/pytorch-fine-tune/) - [RAG Application in AI Workbench](nvidia/rag-ai-workbench/) +- [SGLang Inference Server](nvidia/sglang/) - [Speculative Decoding](nvidia/speculative-decoding/) - [Set up Tailscale on Your Spark](nvidia/tailscale/) - [TRT LLM for Inference](nvidia/trt-llm/) diff --git a/nvidia/sglang/README.md b/nvidia/sglang/README.md new file mode 100644 index 0000000..a7c97b7 --- /dev/null +++ b/nvidia/sglang/README.md @@ -0,0 +1,230 @@ +# SGLang Inference Server + +> Install and use SGLang on DGX Spark + +## Table of Contents + +- [Overview](#overview) + - [Time & risk](#time-risk) +- [Instructions](#instructions) +- [Troubleshooting](#troubleshooting) + +--- + +## Overview + +## Basic Idea + +SGLang is a fast serving framework for large language models and vision language models that makes +your interaction with models faster and more controllable by co-designing the backend runtime and +frontend language. This setup uses the optimized NVIDIA SGLang NGC Container on a single NVIDIA +Spark device with Blackwell architecture, providing GPU-accelerated inference with all dependencies +pre-installed. + +## What you'll accomplish + +You'll deploy SGLang in both server and offline inference modes on your NVIDIA Spark device, +enabling high-performance LLM serving with support for text generation, chat completion, and +vision-language tasks using models like DeepSeek-V2-Lite. + +## What to know before starting + +- Working in a terminal environment on Linux systems +- Basic understanding of Docker containers and container management +- Familiarity with NVIDIA GPU drivers and CUDA toolkit concepts +- Experience with HTTP API endpoints and JSON request/response handling + +## Prerequisites + +- NVIDIA Spark device with Blackwell architecture +- Docker Engine installed and running: `docker --version` +- NVIDIA GPU drivers installed: `nvidia-smi` +- NVIDIA Container Toolkit configured: `docker run --rm --gpus all lmsysorg/sglang:spark nvidia-smi` +- Sufficient disk space (>20GB available): `df -h` +- Network connectivity for pulling NGC containers: `ping nvcr.io` + +## Ancillary files + +- An offline inference python script [found here on GitHub](https://github.com/NVIDIA/dgx-spark-playbooks/blob/main/nvidia/sglang/assets/offline-inference.py) + +### Time & risk + +* **Estimated time:** 30 minutes for initial setup and validation +* **Risk level:** Low - Uses pre-built, validated SGLang container with minimal configuration +* **Rollback:** Stop and remove containers with `docker stop` and `docker rm` commands +* **Last Updated:** 11/25/2025 + * First Publication + +## Instructions + +## Step 1. Verify system prerequisites + +Check that your NVIDIA Spark device meets all requirements before proceeding. This step runs on +your host system and ensures Docker, GPU drivers, and container toolkit are properly configured. + +```bash +## Verify Docker installation +docker --version + +## Check NVIDIA GPU drivers +nvidia-smi + +## Verify Docker GPU support +docker run --rm --gpus all lmsysorg/sglang:spark nvidia-smi + +## Check available disk space +df -h / +``` + +## Step 2. Pull the SGLang Container + +Download the latest SGLang container. This step runs on the host and may take +several minutes depending on your network connection. + + +```bash +## Pull the SGLang container +docker pull lmsysorg/sglang:spark + +## Verify the image was downloaded +docker images | grep sglang +``` + +## Step 3. Launch SGLang container for server mode + +Start the SGLang container in server mode to enable HTTP API access. This runs the inference +server inside the container, exposing it on port 30000 for client connections. + +```bash +## Launch container with GPU support and port mapping +docker run --gpus all -it --rm \ + -p 30000:30000 \ + -v /tmp:/tmp \ + lmsysorg/sglang:spark \ + bash +``` + +## Step 4. Start the SGLang inference server + +Inside the container, launch the HTTP inference server with a supported model. This step runs +inside the Docker container and starts the SGLang server daemon. + +```bash +## Start the inference server with DeepSeek-V2-Lite model +python3 -m sglang.launch_server \ + --model-path deepseek-ai/DeepSeek-V2-Lite \ + --host 0.0.0.0 \ + --port 30000 \ + --trust-remote-code \ + --tp 1 \ + --attention-backend flashinfer \ + --mem-fraction-static 0.75 & + +## Wait for server to initialize +sleep 30 + +## Check server status +curl http://localhost:30000/health +``` + +## Step 5. Test client-server inference + +From a new terminal on your host system, test the SGLang server API to ensure it's working +correctly. This validates that the server is accepting requests and generating responses. + +```bash +## Test with curl +curl -X POST http://localhost:30000/generate \ + -H "Content-Type: application/json" \ + -d '{ + "text": "What does NVIDIA love?", + "sampling_params": { + "temperature": 0.7, + "max_new_tokens": 100 + } + }' +``` + +## Step 6. Test Python client API + +Create a simple Python script to test programmatic access to the SGLang server. This runs on +the host system and demonstrates how to integrate SGLang into applications. + +```python +import requests + +## Send prompt to server +response = requests.post('http://localhost:30000/generate', json={ + 'text': 'What does NVIDIA love?', + 'sampling_params': { + 'temperature': 0.7, + 'max_new_tokens': 100, + }, +}) + +print(f"Response: {response.json()['text']}") +``` + +## Step 7. Validate installation + +Confirm that both server and offline modes are working correctly. This step verifies the +complete SGLang setup and ensures reliable operation. + +```bash +## Check server mode (from host) +curl http://localhost:30000/health +curl -X POST http://localhost:30000/generate -H "Content-Type: application/json" \ + -d '{"text": "Hello", "sampling_params": {"max_new_tokens": 10}}' + +## Check container logs +docker ps +docker logs +``` + +## Step 8. Cleanup and rollback + +Stop and remove containers to clean up resources. This step returns your system to its +original state. + +> [!WARNING] +> This will stop all SGLang containers and remove temporary data. + +```bash +## Stop all SGLang containers +docker ps | grep sglang | awk '{print $1}' | xargs docker stop + +## Remove stopped containers +docker container prune -f + +## Remove SGLang images (optional) +docker rmi lmsysorg/sglang:spark +``` + +## Step 9. Next steps + +With SGLang successfully deployed, you can now: + +- Integrate the HTTP API into your applications using the `/generate` endpoint +- Experiment with different models by changing the `--model-path` parameter +- Scale up using multiple GPUs by adjusting the `--tp` (tensor parallel) setting +- Deploy production workloads using the container orchestration platform of your choice + +## Troubleshooting + +Common issues and their resolutions: + +| Symptom | Cause | Fix | +|---------|-------|-----| +| Container fails to start with GPU errors | NVIDIA drivers/toolkit missing | Install nvidia-container-toolkit, restart Docker | +| Server responds with 404 or connection refused | Server not fully initialized | Wait 60 seconds, check container logs | +| Out of memory errors during model loading | Insufficient GPU memory | Use smaller model or increase --tp parameter | +| Model download fails | Network connectivity issues | Check internet connection, retry download | +| Permission denied accessing /tmp | Volume mount issues | Use full path: -v /tmp:/tmp or create dedicated directory | + +> [!NOTE] +> DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU. +> With many applications still updating to take advantage of UMA, you may encounter memory issues even when within +> the memory capacity of DGX Spark. If that happens, manually flush the buffer cache with: +```bash +sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches' +``` diff --git a/nvidia/sglang/assets/offline-inference.py b/nvidia/sglang/assets/offline-inference.py new file mode 100644 index 0000000..3b91543 --- /dev/null +++ b/nvidia/sglang/assets/offline-inference.py @@ -0,0 +1,30 @@ +# +# SPDX-FileCopyrightText: Copyright (c) 1993-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +import sglang as sgl + +def main(): + llm = sgl.Engine(model_path="deepseek-ai/DeepSeek-V2-Lite", trust_remote_code=True) + + prompt = "What does NVIDIA love?" + sampling_params = {"temperature": 0.7, "max_new_tokens": 100} + + output = llm.generate(prompt, sampling_params) + print(f"Output: {output}") + +if __name__ == '__main__': + main() \ No newline at end of file