dgx-spark-playbooks/nvidia/cli-coding-agent/README.md
2026-06-17 21:19:47 +00:00

15 KiB

CLI Coding Agent

Build local CLI coding agents with Ollama

Table of Contents


Overview

Basic idea

Use Ollama on DGX Spark to run a local coding model and connect a CLI coding agent. This playbook supports three options: Claude Code, OpenCode, and Codex CLI. Each agent is wired up with Ollama's built-in launch method (ollama launch <agent>), so you can work without environment variables, provider config files, or external cloud APIs.

Choose your CLI agent

Pick the tab that matches the CLI agent you want to use:

  • Claude Code: Fastest path to a working CLI agent with a local Ollama model.
  • OpenCode: Open-source CLI launched directly from Ollama.
  • Codex CLI: OpenAI Codex CLI launched directly from Ollama against the local model.

What you'll accomplish

You will run a local coding model (Qwen3.6) on your DGX Spark with Ollama, launch your chosen CLI agent against it with a single command, and complete a small coding task end-to-end.

What to know before starting

  • Comfort with Linux command line basics
  • Experience running terminal-based tools and editors
  • Familiarity with Python for the short coding task

Prerequisites

  • DGX Spark access with NVIDIA DGX OS 7.3.1 (Ubuntu 24.04.3 LTS base)
  • Internet access to download model weights
  • Ollama v0.15 or newer (required for ollama launch)
  • GPU memory depends on the Qwen3.6 variant you choose:
    • qwen3.6:latest (35B-a3b, MoE) — ~24GB, 256K context
    • qwen3.6:35b-a3b-nvfp4 — ~22GB, NVIDIA FP4 build tuned for Blackwell (DGX Spark)
    • qwen3.6:35b-a3b-q8_0 — ~39GB, higher-quality quant
    • qwen3.6:35b-a3b-bf16 — ~71GB, full precision (fits Spark's unified memory)

Time & risk

  • Duration: ~15-25 minutes (mostly model download time)
  • Risk level: Low
    • Large model downloads can fail if network connectivity is unstable
    • Ollama versions older than 0.15 do not support ollama launch
  • Rollback: Stop Ollama and delete the downloaded model from ~/.ollama/models
  • Last Updated: 04/16/2026
    • Switched to ollama launch method and upgraded the default model to Qwen3.6

Claude Code

Step 1. Confirm your environment

Description: Verify the OS version and GPU are visible before installing anything.

cat /etc/os-release | head -n 2
nvidia-smi

Expected output should show Ubuntu 24.04.3 LTS (DGX OS 7.3.1 base) and a detected GPU.

Step 2. Install or update Ollama

Description: Install Ollama or ensure it is recent enough to support ollama launch.

curl -fsSL https://ollama.com/install.sh | sh
ollama --version

If Ollama is already installed, just verify the version:

ollama --version

Expected output should show Ollama v0.15 or newer.

Step 3. Pull Qwen3.6

Description: Download the Qwen3.6 model weights to your Spark node.

ollama pull qwen3.6

Optional variants if you want different memory footprints or precision:

ollama pull qwen3.6:35b-a3b-nvfp4   # NVIDIA FP4 build tuned for Blackwell (~22GB)
ollama pull qwen3.6:35b-a3b-q8_0    # Higher-quality 8-bit quant (~39GB)
ollama pull qwen3.6:35b-a3b-bf16    # Full precision (~71GB)

Expected output should show qwen3.6 (and any optional variants) in ollama list.

Step 4. Test local inference (optional)

Description: Run a quick prompt to confirm the model loads.

ollama run qwen3.6

Try a prompt like:

Write a short README checklist for a Python project.

Expected output should show the model responding in the terminal. When you are done, type /bye or press Ctrl+D to exit the interactive session before continuing.

Step 5. Install and launch Claude Code with Ollama

Description: Install Claude Code, then use Ollama's built-in launch method to start Claude Code against your local model. No environment variables or config files are required.

curl -fsSL https://claude.ai/install.sh | bash
claude --version

If Claude Code is already installed, just verify the version:

claude --version
ollama launch claude --model qwen3.6

Expected output should show Claude Code starting and using the local Qwen3.6 model. Qwen3.6 ships with a 256K context window by default; adjust context length through Ollama's settings if you need to tune it further.

Step 6. Complete a small coding task

Description: Create a tiny repo and let Claude Code implement a function and tests.

mkdir -p ~/cli-agent-demo
cd ~/cli-agent-demo

printf 'def add(a, b):\n    """Return the sum of a and b."""\n    pass\n' > math_utils.py
printf 'import math_utils\n\n\ndef test_add():\n    assert math_utils.add(1, 2) == 3\n' > test_math_utils.py

If you do not already have pytest installed:

python3 -m venv .venv
source .venv/bin/activate
python3 -m pip install -U pytest

In Claude Code:

Please implement add() in math_utils.py and make sure the test passes.

Run the test:

python3 -m pytest -q

Expected output should show the test passing. When you are done, run deactivate to exit the virtual environment.

Step 7. Cleanup and rollback

Description: Remove the model and stop services if you no longer need them.

To stop the service:

sudo systemctl stop ollama

Warning

This will delete the downloaded model files.

ollama rm qwen3.6

Step 8. Next steps

  • Try the qwen3.6:35b-a3b-nvfp4 or bf16 variants for different quality/VRAM tradeoffs
  • Use Claude Code on multi-file refactors or test-generation tasks
  • Explore the full 256K context window on larger codebases

OpenCode

Step 1. Confirm your environment

Description: Verify the OS version and GPU are visible before installing anything.

cat /etc/os-release | head -n 2
nvidia-smi

Expected output should show Ubuntu 24.04.3 LTS (DGX OS 7.3.1 base) and a detected GPU.

Step 2. Install or update Ollama

Description: Install Ollama or ensure it is recent enough to support ollama launch.

curl -fsSL https://ollama.com/install.sh | sh
ollama --version

If Ollama is already installed, just verify the version:

ollama --version

Expected output should show Ollama v0.15 or newer.

Step 3. Pull Qwen3.6

Description: Download the Qwen3.6 model weights to your Spark node.

ollama pull qwen3.6

Optional variants if you want different memory footprints or precision:

ollama pull qwen3.6:35b-a3b-nvfp4   # NVIDIA FP4 build tuned for Blackwell (~22GB)
ollama pull qwen3.6:35b-a3b-q8_0    # Higher-quality 8-bit quant (~39GB)
ollama pull qwen3.6:35b-a3b-bf16    # Full precision (~71GB)

Expected output should show qwen3.6 in ollama list.

Step 4. Test local inference (optional)

Description: Run a quick prompt to confirm the model loads.

ollama run qwen3.6

Try a prompt like:

Write a short README checklist for a Python project.

Expected output should show the model responding. When you are done, type /bye or press Ctrl+D to exit before continuing.

Step 5. Launch OpenCode with Ollama

Description: Use Ollama's built-in launch method to start OpenCode against your local model. No opencode.json provider configuration is required.

ollama launch opencode --model qwen3.6

If you want to pre-configure OpenCode without launching immediately:

ollama launch opencode --config

Expected output should show OpenCode starting with Ollama preselected as the provider and Qwen3.6 as the model. Qwen3.6 ships with a 256K context window by default.

Step 6. Complete a small coding task

Description: Create a tiny repo and let OpenCode implement a function and tests.

mkdir -p ~/cli-agent-demo
cd ~/cli-agent-demo

printf 'def add(a, b):\n    """Return the sum of a and b."""\n    pass\n' > math_utils.py
printf 'import math_utils\n\n\ndef test_add():\n    assert math_utils.add(1, 2) == 3\n' > test_math_utils.py

If you do not already have pytest installed:

python3 -m venv .venv
source .venv/bin/activate
python3 -m pip install -U pytest

In OpenCode:

Please implement add() in math_utils.py and make sure the test passes.

Run the test:

python3 -m pytest -q

Expected output should show the test passing. When you are done, run deactivate to exit the virtual environment.

Step 7. Cleanup and rollback

Description: Remove the model and stop services if you no longer need them.

To stop the service:

sudo systemctl stop ollama

Warning

This will delete the downloaded model files.

ollama rm qwen3.6

Step 8. Next steps

  • Try the qwen3.6:35b-a3b-nvfp4 or bf16 variants for different quality/VRAM tradeoffs
  • Use OpenCode on multi-file changes or test-generation tasks
  • Explore the full 256K context window on larger codebases

Codex CLI

Step 1. Confirm your environment

Description: Verify the OS version and GPU are visible before installing anything.

cat /etc/os-release | head -n 2
nvidia-smi

Expected output should show Ubuntu 24.04.3 LTS (DGX OS 7.3.1 base) and a detected GPU.

Step 2. Install or update Ollama

Description: Install Ollama or ensure it is recent enough to support ollama launch.

curl -fsSL https://ollama.com/install.sh | sh
ollama --version

If Ollama is already installed, just verify the version:

ollama --version

Expected output should show Ollama v0.15 or newer.

Step 3. Pull Qwen3.6

Description: Download the Qwen3.6 model weights to your Spark node.

ollama pull qwen3.6

Optional variants if you want different memory footprints or precision:

ollama pull qwen3.6:35b-a3b-nvfp4   # NVIDIA FP4 build tuned for Blackwell (~22GB)
ollama pull qwen3.6:35b-a3b-q8_0    # Higher-quality 8-bit quant (~39GB)
ollama pull qwen3.6:35b-a3b-bf16    # Full precision (~71GB)

Expected output should show qwen3.6 in ollama list.

Step 4. Test local inference (optional)

Description: Run a quick prompt to confirm the model loads.

ollama run qwen3.6

Try a prompt like:

Write a short README checklist for a Python project.

Expected output should show the model responding. When you are done, type /bye or press Ctrl+D to exit before continuing.

Step 5. Launch Codex CLI with Ollama

Description: Use Ollama's built-in launch method to start Codex CLI against your local model. No ~/.codex/config.toml and no manual npm install -g @openai/codex are required — Ollama handles the Codex integration.

ollama launch codex --model qwen3.6

Expected output should show Codex CLI starting with Ollama as the provider and Qwen3.6 as the model. Qwen3.6 ships with a 256K context window by default, which is well suited to Codex's agentic workflows.

Step 6. Complete a small coding task

Description: Create a tiny repo and let Codex implement a function and tests.

mkdir -p ~/cli-agent-demo
cd ~/cli-agent-demo

printf 'def add(a, b):\n    """Return the sum of a and b."""\n    pass\n' > math_utils.py
printf 'import math_utils\n\n\ndef test_add():\n    assert math_utils.add(1, 2) == 3\n' > test_math_utils.py

If you do not already have pytest installed:

python3 -m venv .venv
source .venv/bin/activate
python3 -m pip install -U pytest

In Codex:

Please implement add() in math_utils.py and make sure the test passes.

Run the test:

python3 -m pytest -q

Expected output should show the test passing. When you are done, run deactivate to exit the virtual environment.

Step 7. Cleanup and rollback

Description: Remove the model and stop services if you no longer need them.

To stop the service:

sudo systemctl stop ollama

Warning

This will delete the downloaded model files.

ollama rm qwen3.6

Step 8. Next steps

  • Try the qwen3.6:35b-a3b-nvfp4 or bf16 variants for different quality/VRAM tradeoffs
  • Use Codex CLI on multi-file changes or test-generation tasks
  • Explore the full 256K context window on larger codebases

Troubleshooting

Symptom Cause Fix
ollama: command not found Ollama not installed or PATH not updated Rerun curl -fsSL https://ollama.com/install.sh | sh and open a new shell
ollama launch reports unknown command Ollama is older than v0.15 Update Ollama: curl -fsSL https://ollama.com/install.sh | sh
Model load fails with version error or HTTP 412 Ollama version is too old for the model Update Ollama: curl -fsSL https://ollama.com/install.sh | sh
model not found when launching an agent Model was not pulled Run ollama pull qwen3.6 and retry
connection refused to localhost:11434 Ollama service not running Start with ollama serve or sudo systemctl start ollama
ollama launch <agent> exits immediately Agent integration failed to initialize Re-run ollama launch <agent>; if it persists, check journalctl -u ollama
Slow responses or OOM errors Model variant too large for GPU memory Switch to qwen3.6:35b-a3b-nvfp4 or close other GPU workloads
python3 -m pip install -U pytest reports externally-managed-environment Ubuntu 24.04 protects the system Python environment Create and activate a virtual environment first: python3 -m venv .venv && source .venv/bin/activate
ollama pull reports that a model tag is a sharded GGUF The selected model tag is not supported by Ollama Use the Qwen3.6 commands in Step 3 instead of sharded GGUF tags
ollama run fails with CUDA error: context is destroyed on a multi-GPU system Ollama is initializing across a mixed-GPU topology Pin Ollama to one GPU. For a foreground test, run CUDA_VISIBLE_DEVICES=0 ollama serve; for a system service, add Environment="CUDA_VISIBLE_DEVICES=0" to an Ollama systemd drop-in and restart Ollama
A direct Claude Code setup using an Anthropic-compatible Ollama endpoint produces prose but does not edit files Some model/server combinations do not emit tool calls reliably Use ollama launch claude with Qwen3.6 as shown in this playbook

Note

DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU. If you see memory pressure, flush the buffer cache with:

sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'