dgx-spark-playbooks/skills/dgx-spark-ollama/SKILL.md
2026-04-19 09:25:00 +00:00

47 lines
3.1 KiB
Markdown

---
name: dgx-spark-ollama
description: "Install Ollama on an NVIDIA DGX Spark and expose its API to a local laptop via NVIDIA Sync SSH tunnel. Use when a user wants to run LLM inference on DGX Spark hardware and call the API from their laptop on localhost:11434 without exposing ports on their network."
---
<!-- GENERATED:BEGIN from nvidia/ollama/README.md -->
# Ollama
> Install and use Ollama
This playbook demonstrates how to set up remote access to an Ollama server running on your NVIDIA
Spark device using NVIDIA Sync's Custom Apps feature. You'll install Ollama on your Spark device,
configure NVIDIA Sync to create an SSH tunnel, and access the Ollama API from your local machine.
This eliminates the need to expose ports on your network while enabling AI inference from your
laptop through a secure SSH tunnel.
**Outcome**: You will have Ollama running on your NVIDIA Spark with Blackwell architecture and accessible via
API calls from your local laptop. This setup allows you to build applications or use tools on your
local machine that communicate with the Ollama API for large language model inference, leveraging
the powerful GPU capabilities of your Spark device without complex network configuration.
Duration: 10-15 minutes for initial setup, 2-3 minutes for model download (varies by model size) · Risk: Low - No system-level changes, easily reversible by stopping the custom app
**Full playbook**: `/home/runner/work/dgx-spark-playbooks/dgx-spark-playbooks/nvidia/ollama/README.md`
<!-- GENERATED:END -->
## When to use this skill
- User has an NVIDIA DGX Spark with NVIDIA Sync installed on their laptop
- Wants Ollama running on Spark, API accessible from their laptop
- Wants an easy-to-use inference runtime (vs. the complexity of vLLM or TRT-LLM)
## Key decisions to confirm before executing
- **Model choice** — default in the playbook is `qwen2.5:32b` (~18GB, optimized for Blackwell). Ask the user if they want a smaller model (`qwen2.5:7b`, `llama3.1:8b`, `phi3.5:3.8b`) for lower VRAM or faster download.
- **Check first** — run `ollama --version` on the Spark before installing; skip installation if already present.
## Non-obvious gotchas
- The SSH tunnel must be re-activated after NVIDIA Sync restarts — `localhost:11434` only works while the "Ollama Server" custom app is active in Sync.
- Uninstall is destructive: `sudo rm -rf /usr/share/ollama` removes all downloaded models (often tens of GB). Confirm with user before running cleanup.
- Streaming responses (`"stream": true`) behave differently than non-streaming — use `curl -N` to see them.
## Related skills
- **Prerequisite**: `dgx-spark-connect-to-your-spark` — NVIDIA Sync + local network access basics. If the user hasn't set this up yet, do it first.
- **Composes with**: `dgx-spark-open-webui` — web chat UI on top of Ollama. Most common follow-up.
- **Alternative**: `dgx-spark-lm-studio` — GUI-based model management instead of Ollama's CLI.
- **Alternative**: `dgx-spark-llama-cpp` — lower-level control over inference.
- **Upgrade path**: `dgx-spark-vllm` — when the user needs higher throughput or is serving multiple concurrent users.