mirror of https://github.com/NVIDIA/dgx-spark-playbooks.git synced 2026-06-18 12:32:23 +00:00

History

GitLab CI c20b49d138 chore: Regenerate all playbooks		2025-10-06 13:35:52 +00:00
..
README.md	chore: Regenerate all playbooks	2025-10-06 13:35:52 +00:00

README.md

Use a NIM on Spark

Run an LLM NIM on Spark

Overview
Instructions
- Step 2. Configure NGC authentication

Overview

Basic Idea

NVIDIA Inference Microservices (NIMs) provide optimized containers for deploying large language models with simplified APIs. This playbook demonstrates how to run LLM NIMs on DGX Spark devices, enabling GPU-accelerated inference through Docker containers. You'll set up authentication with NVIDIA's registry, launch a containerized LLM service, and perform basic inference testing to verify functionality.

What you'll accomplish

You'll deploy an LLM NIM container on your DGX Spark device, configure it for GPU acceleration, and establish a working inference endpoint that responds to HTTP API calls with generated text completions.

What to know before starting

Working in a terminal environment
Using Docker commands and GPU-enabled containers
Basic familiarity with REST APIs and curl commands
Understanding of NVIDIA GPU environments and CUDA

Prerequisites

DGX Spark device with NVIDIA drivers installed
```
nvidia-smi
```
Docker with NVIDIA Container Toolkit configured, instructions here: https://******.nvidia.com/dgx-docs/review/621/dgx-spark/latest/nvidia-container-runtime-for-docker.html
```
docker run -it --gpus=all nvcr.io/nvidia/cuda:13.0.1-devel-ubuntu24.04 nvidia-smi
```
NGC account with API key from https://ngc.nvidia.com/setup/api-key
```
echo $NGC_API_KEY | grep -E '^[a-zA-Z0-9]{86}=='
```
Sufficient disk space for model caching (varies by model, typically 10-50GB)
```
df -h ~
```

Time & risk

Estimated time: 15-30 minutes for setup and validation

Risks:

Large model downloads may take significant time depending on network speed
GPU memory requirements vary by model size
Container startup time depends on model loading

Rollback: Stop and remove containers with docker stop <CONTAINER_NAME> && docker rm <CONTAINER_NAME>. Remove cached models from ~/.cache/nim if disk space recovery is needed.

Instructions

Step 1. Verify environment prerequisites

Check that your system meets the basic requirements for running GPU-enabled containers.

nvidia-smi
docker --version
docker run --rm --gpus all nvidia/cuda:12.0-base-ubuntu20.04 nvidia-smi

Step 2. Configure NGC authentication

Set up access to NVIDIA's container registry using your NGC API key.

export NGC_API_KEY="<YOUR_NGC_API_KEY>"
echo "$NGC_API_KEY" | docker login nvcr.io --username '$oauthtoken' --password-stdin

Step 3. Select and configure NIM container

Choose a specific LLM NIM from NGC and set up local caching for model assets.

export CONTAINER_NAME="nim-llm-demo"
export IMG_NAME="nvcr.io/nim/meta/llama-3.1-8b-instruct-dgx-spark:latest"
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"
chmod -R a+w "$LOCAL_NIM_CACHE"

Step 4. Launch NIM container

Start the containerized LLM service with GPU acceleration and proper resource allocation.

docker run -it --rm --name=$CONTAINER_NAME \
  --runtime=nvidia \
  --gpus all \
  --shm-size=16GB \
  -e NGC_API_KEY=$NGC_API_KEY \
  -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
  -u $(id -u) \
  -p 8000:8000 \
  $IMG_NAME

The container will download the model on first run and may take several minutes to start. Look for startup messages indicating the service is ready.

Step 5. Validate inference endpoint

Test the deployed service with a basic completion request to verify functionality. Run the following curl command in a new terminal.

curl -X 'POST' \
    'http://0.0.0.0:8000/v1/chat/completions' \
    -H 'accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '{
      "model": "meta/llama-3.1-8b-instruct",
      "messages": [
        {
          "role":"system",
          "content":"detailed thinking on"
        },
        {
          "role":"user",
          "content":"Can you write me a song?"
        }
      ],
      "top_p": 1,
      "n": 1,
      "max_tokens": 15,
      "frequency_penalty": 1.0,
      "stop": ["hello"]

    }'

Expected output should be a JSON response containing a completion field with generated text.

Step 6. Troubleshooting

Symptom	Cause	Fix
Container fails to start with GPU error	NVIDIA Container Toolkit not configured	Install nvidia-container-toolkit and restart Docker
"Invalid credentials" during docker login	Incorrect NGC API key format	Verify API key from NGC portal, ensure no extra whitespace
Model download hangs or fails	Network connectivity or insufficient disk space	Check internet connection and available disk space in cache directory
API returns 404 or connection refused	Container not fully started or wrong port	Wait for container startup completion, verify port 8000 is accessible
runtime not found	NVIDIA Container Toolkit not properly configured	Run `sudo nvidia-ctk runtime configure --runtime=docker` and restart Docker

Step 8. Cleanup and rollback

Remove the running container and optionally clean up cached model files.

Warning: Removing cached models will require re-downloading on next run.

docker stop $CONTAINER_NAME
docker rm $CONTAINER_NAME

To remove cached models and free disk space:

rm -rf "$LOCAL_NIM_CACHE"

Step 7. Next steps

With a working NIM deployment, you can:

Integrate the API endpoint into your applications using the OpenAI-compatible interface
Experiment with different models available in the NGC catalog
Scale the deployment using container orchestration tools
Monitor resource usage and optimize container resource allocation

Test the integration with your preferred HTTP client or SDK to begin building applications.