mirror of
https://github.com/NVIDIA/dgx-spark-playbooks.git
synced 2026-04-25 11:23:52 +00:00
47 lines
3.1 KiB
Markdown
47 lines
3.1 KiB
Markdown
---
|
|
name: dgx-spark-ollama
|
|
description: "Install Ollama on an NVIDIA DGX Spark and expose its API to a local laptop via NVIDIA Sync SSH tunnel. Use when a user wants to run LLM inference on DGX Spark hardware and call the API from their laptop on localhost:11434 without exposing ports on their network."
|
|
---
|
|
|
|
<!-- GENERATED:BEGIN from nvidia/ollama/README.md -->
|
|
# Ollama
|
|
|
|
> Install and use Ollama
|
|
|
|
This playbook demonstrates how to set up remote access to an Ollama server running on your NVIDIA
|
|
Spark device using NVIDIA Sync's Custom Apps feature. You'll install Ollama on your Spark device,
|
|
configure NVIDIA Sync to create an SSH tunnel, and access the Ollama API from your local machine.
|
|
This eliminates the need to expose ports on your network while enabling AI inference from your
|
|
laptop through a secure SSH tunnel.
|
|
|
|
**Outcome**: You will have Ollama running on your NVIDIA Spark with Blackwell architecture and accessible via
|
|
API calls from your local laptop. This setup allows you to build applications or use tools on your
|
|
local machine that communicate with the Ollama API for large language model inference, leveraging
|
|
the powerful GPU capabilities of your Spark device without complex network configuration.
|
|
|
|
Duration: 10-15 minutes for initial setup, 2-3 minutes for model download (varies by model size) · Risk: Low - No system-level changes, easily reversible by stopping the custom app
|
|
|
|
**Full playbook**: `/home/runner/work/dgx-spark-playbooks/dgx-spark-playbooks/nvidia/ollama/README.md`
|
|
<!-- GENERATED:END -->
|
|
|
|
## When to use this skill
|
|
- User has an NVIDIA DGX Spark with NVIDIA Sync installed on their laptop
|
|
- Wants Ollama running on Spark, API accessible from their laptop
|
|
- Wants an easy-to-use inference runtime (vs. the complexity of vLLM or TRT-LLM)
|
|
|
|
## Key decisions to confirm before executing
|
|
- **Model choice** — default in the playbook is `qwen2.5:32b` (~18GB, optimized for Blackwell). Ask the user if they want a smaller model (`qwen2.5:7b`, `llama3.1:8b`, `phi3.5:3.8b`) for lower VRAM or faster download.
|
|
- **Check first** — run `ollama --version` on the Spark before installing; skip installation if already present.
|
|
|
|
## Non-obvious gotchas
|
|
- The SSH tunnel must be re-activated after NVIDIA Sync restarts — `localhost:11434` only works while the "Ollama Server" custom app is active in Sync.
|
|
- Uninstall is destructive: `sudo rm -rf /usr/share/ollama` removes all downloaded models (often tens of GB). Confirm with user before running cleanup.
|
|
- Streaming responses (`"stream": true`) behave differently than non-streaming — use `curl -N` to see them.
|
|
|
|
## Related skills
|
|
- **Prerequisite**: `dgx-spark-connect-to-your-spark` — NVIDIA Sync + local network access basics. If the user hasn't set this up yet, do it first.
|
|
- **Composes with**: `dgx-spark-open-webui` — web chat UI on top of Ollama. Most common follow-up.
|
|
- **Alternative**: `dgx-spark-lm-studio` — GUI-based model management instead of Ollama's CLI.
|
|
- **Alternative**: `dgx-spark-llama-cpp` — lower-level control over inference.
|
|
- **Upgrade path**: `dgx-spark-vllm` — when the user needs higher throughput or is serving multiple concurrent users.
|