chore: Regenerate all playbooks

2026-06-19 12:59:34 +00:00 · 2025-11-25 20:28:09 +00:00 · 2025-11-25 20:28:09 +00:00 · 65d43d3ae6
commit 65d43d3ae6
parent 6278ed97a3
3 changed files with 261 additions and 0 deletions
--- a/README.md
+++ b/README.md
@ -39,6 +39,7 @@ Each playbook includes prerequisites, step-by-step instructions, troubleshooting
 - [Open WebUI with Ollama](nvidia/open-webui/)
 - [Fine-tune with Pytorch](nvidia/pytorch-fine-tune/)
 - [RAG Application in AI Workbench](nvidia/rag-ai-workbench/)
+- [SGLang Inference Server](nvidia/sglang/)
 - [Speculative Decoding](nvidia/speculative-decoding/)
 - [Set up Tailscale on Your Spark](nvidia/tailscale/)
 - [TRT LLM for Inference](nvidia/trt-llm/)
--- a/nvidia/sglang/README.md
+++ b/nvidia/sglang/README.md
@ -0,0 +1,230 @@
+# SGLang Inference Server
+
+> Install and use SGLang on DGX Spark
+
+## Table of Contents
+
+- [Overview](#overview)
+  - [Time & risk](#time-risk)
+- [Instructions](#instructions)
+- [Troubleshooting](#troubleshooting)
+
+---
+
+## Overview
+
+## Basic Idea
+
+SGLang is a fast serving framework for large language models and vision language models that makes
+your interaction with models faster and more controllable by co-designing the backend runtime and
+frontend language. This setup uses the optimized NVIDIA SGLang NGC Container on a single NVIDIA
+Spark device with Blackwell architecture, providing GPU-accelerated inference with all dependencies
+pre-installed.
+
+## What you'll accomplish
+
+You'll deploy SGLang in both server and offline inference modes on your NVIDIA Spark device,
+enabling high-performance LLM serving with support for text generation, chat completion, and
+vision-language tasks using models like DeepSeek-V2-Lite.
+
+## What to know before starting
+
+- Working in a terminal environment on Linux systems
+- Basic understanding of Docker containers and container management
+- Familiarity with NVIDIA GPU drivers and CUDA toolkit concepts
+- Experience with HTTP API endpoints and JSON request/response handling
+
+## Prerequisites
+
+- NVIDIA Spark device with Blackwell architecture
+- Docker Engine installed and running: `docker --version`
+- NVIDIA GPU drivers installed: `nvidia-smi`
+- NVIDIA Container Toolkit configured: `docker run --rm --gpus all lmsysorg/sglang:spark nvidia-smi`
+- Sufficient disk space (>20GB available): `df -h`
+- Network connectivity for pulling NGC containers: `ping nvcr.io`
+
+## Ancillary files
+
+- An offline inference python script [found here on GitHub](https://github.com/NVIDIA/dgx-spark-playbooks/blob/main/nvidia/sglang/assets/offline-inference.py)
+
+### Time & risk
+
+* **Estimated time:** 30 minutes for initial setup and validation
+* **Risk level:** Low - Uses pre-built, validated SGLang container with minimal configuration
+* **Rollback:** Stop and remove containers with `docker stop` and `docker rm` commands
+* **Last Updated:** 11/25/2025
+    * First Publication
+
+## Instructions
+
+## Step 1. Verify system prerequisites
+
+Check that your NVIDIA Spark device meets all requirements before proceeding. This step runs on
+your host system and ensures Docker, GPU drivers, and container toolkit are properly configured.
+
+```bash
+## Verify Docker installation
+docker --version
+
+## Check NVIDIA GPU drivers
+nvidia-smi
+
+## Verify Docker GPU support
+docker run --rm --gpus all lmsysorg/sglang:spark nvidia-smi
+
+## Check available disk space
+df -h /
+```
+
+## Step 2. Pull the SGLang Container
+
+Download the latest SGLang container. This step runs on the host and may take
+several minutes depending on your network connection.
+
+
+```bash
+## Pull the SGLang container
+docker pull lmsysorg/sglang:spark
+
+## Verify the image was downloaded
+docker images | grep sglang
+```
+
+## Step 3. Launch SGLang container for server mode
+
+Start the SGLang container in server mode to enable HTTP API access. This runs the inference
+server inside the container, exposing it on port 30000 for client connections.
+
+```bash
+## Launch container with GPU support and port mapping
+docker run --gpus all -it --rm \
+  -p 30000:30000 \
+  -v /tmp:/tmp \
+  lmsysorg/sglang:spark \
+  bash
+```
+
+## Step 4. Start the SGLang inference server
+
+Inside the container, launch the HTTP inference server with a supported model. This step runs
+inside the Docker container and starts the SGLang server daemon.
+
+```bash
+## Start the inference server with DeepSeek-V2-Lite model
+python3 -m sglang.launch_server \
+  --model-path deepseek-ai/DeepSeek-V2-Lite \
+  --host 0.0.0.0 \
+  --port 30000 \
+  --trust-remote-code \
+  --tp 1 \
+  --attention-backend flashinfer \
+  --mem-fraction-static 0.75 &
+
+## Wait for server to initialize
+sleep 30
+
+## Check server status
+curl http://localhost:30000/health
+```
+
+## Step 5. Test client-server inference
+
+From a new terminal on your host system, test the SGLang server API to ensure it's working
+correctly. This validates that the server is accepting requests and generating responses.
+
+```bash
+## Test with curl
+curl -X POST http://localhost:30000/generate \
+  -H "Content-Type: application/json" \
+  -d '{
+      "text": "What does NVIDIA love?",
+      "sampling_params": {
+          "temperature": 0.7,
+          "max_new_tokens": 100
+      }
+  }'
+```
+
+## Step 6. Test Python client API
+
+Create a simple Python script to test programmatic access to the SGLang server. This runs on
+the host system and demonstrates how to integrate SGLang into applications.
+
+```python
+import requests
+
+## Send prompt to server
+response = requests.post('http://localhost:30000/generate', json={
+  'text': 'What does NVIDIA love?',
+  'sampling_params': {
+      'temperature': 0.7,
+      'max_new_tokens': 100,
+  },
+})
+
+print(f"Response: {response.json()['text']}")
+```
+
+## Step 7. Validate installation
+
+Confirm that both server and offline modes are working correctly. This step verifies the
+complete SGLang setup and ensures reliable operation.
+
+```bash
+## Check server mode (from host)
+curl http://localhost:30000/health
+curl -X POST http://localhost:30000/generate -H "Content-Type: application/json" \
+  -d '{"text": "Hello", "sampling_params": {"max_new_tokens": 10}}'
+
+## Check container logs
+docker ps
+docker logs <CONTAINER_ID>
+```
+
+## Step 8. Cleanup and rollback
+
+Stop and remove containers to clean up resources. This step returns your system to its
+original state.
+
+> [!WARNING]
+> This will stop all SGLang containers and remove temporary data.
+
+```bash
+## Stop all SGLang containers
+docker ps | grep sglang | awk '{print $1}' | xargs docker stop
+
+## Remove stopped containers
+docker container prune -f
+
+## Remove SGLang images (optional)
+docker rmi lmsysorg/sglang:spark
+```
+
+## Step 9. Next steps
+
+With SGLang successfully deployed, you can now:
+
+- Integrate the HTTP API into your applications using the `/generate` endpoint
+- Experiment with different models by changing the `--model-path` parameter
+- Scale up using multiple GPUs by adjusting the `--tp` (tensor parallel) setting
+- Deploy production workloads using the container orchestration platform of your choice
+
+## Troubleshooting
+
+Common issues and their resolutions:
+
+| Symptom | Cause | Fix |
+|---------|-------|-----|
+| Container fails to start with GPU errors | NVIDIA drivers/toolkit missing | Install nvidia-container-toolkit, restart Docker |
+| Server responds with 404 or connection refused | Server not fully initialized | Wait 60 seconds, check container logs |
+| Out of memory errors during model loading | Insufficient GPU memory | Use smaller model or increase --tp parameter |
+| Model download fails | Network connectivity issues | Check internet connection, retry download |
+| Permission denied accessing /tmp | Volume mount issues | Use full path: -v /tmp:/tmp or create dedicated directory |
+
+> [!NOTE]
+> DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU. 
+> With many applications still updating to take advantage of UMA, you may encounter memory issues even when within 
+> the memory capacity of DGX Spark. If that happens, manually flush the buffer cache with:
+```bash
+sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
+```
--- a/nvidia/sglang/assets/offline-inference.py
+++ b/nvidia/sglang/assets/offline-inference.py
@ -0,0 +1,30 @@
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import sglang as sgl
+
+def main():
+    llm = sgl.Engine(model_path="deepseek-ai/DeepSeek-V2-Lite", trust_remote_code=True)
+
+    prompt = "What does NVIDIA love?"
+    sampling_params = {"temperature": 0.7, "max_new_tokens": 100}
+
+    output = llm.generate(prompt, sampling_params)
+    print(f"Output: {output}")
+
+if __name__ == '__main__':
+    main()