diff --git a/nvidia/lm-studio/README.md b/nvidia/lm-studio/README.md index 9bf52e7..433cbd9 100644 --- a/nvidia/lm-studio/README.md +++ b/nvidia/lm-studio/README.md @@ -1,12 +1,13 @@ # LM Studio on DGX Spark -> Deploy LM Studio and serve LLMs on a Spark device +> Deploy LM Studio and serve LLMs on a Spark device; use LM Link to access models remotely. + ## Table of Contents - [Overview](#overview) - [Instructions](#instructions) - - [Javascript](#javascript) + - [JavaScript](#javascript) - [Python](#python) - [Bash](#bash) - [Troubleshooting](#troubleshooting) @@ -21,6 +22,8 @@ LM Studio is an application for discovering, running, and serving large language This playbook shows you how to deploy LM Studio on an NVIDIA DGX Spark device to run LLMs locally with GPU acceleration. Running LM Studio on DGX Spark enables Spark to act as your own private, high-performance LLM server. +**LM Link** (optional) lets you use your Spark’s models from another machine as if they were local. You can link your DGX Spark and your laptop (or other devices) over an end-to-end encrypted connection, so you can load and run models on the Spark from your laptop without being on the same LAN or opening network access. See [LM Link](https://lmstudio.ai/link) and Step 3b in the Instructions. + ## What you'll accomplish @@ -29,6 +32,7 @@ You'll deploy LM Studio on an NVIDIA DGX Spark device to run gpt-oss 120B, and u - Install **llmster**, a totally headless, terminal native LM Studio on the Spark - Run LLM inference locally on DGX Spark via API - Interact with models from your laptop using the LM Studio SDK +- Optionally use **LM Link** to connect Spark and laptop over an encrypted link so remote models appear as local (no same-network or bind setup required) ## What to know before starting @@ -50,11 +54,21 @@ You'll deploy LM Studio on an NVIDIA DGX Spark device to run gpt-oss 120B, and u - Laptop and DGX Spark must be on the same local network - Network access to download packages and models +## LM Link (optional) + +[LM Link](https://lmstudio.ai/link) lets you **use your local models remotely**. You link machines (e.g. your DGX Spark and your laptop), then load models on the Spark and use them from the laptop as if they were local. + +- **End-to-end encrypted** — Built on Tailscale mesh VPNs; devices are not exposed to the public internet. +- **Works with the local server** — Any tool that connects to LM Studio’s local API (e.g. `localhost:1234`) can use models from your Link, including Codex, Claude Code, OpenCode, and the LM Studio SDK. +- **Preview** — Free for up to 2 users, 5 devices each (10 devices total). Create your Link at [lmstudio.ai/link](https://lmstudio.ai/link). + +If you use LM Link, you can skip binding the server to `0.0.0.0` and using the Spark’s IP; once devices are linked, point your laptop at `localhost:1234` and remote models appear in the model loader. + ## Ancillary files -All required assets can be found below. These sample scripts can be used in Step 4 of Instructions. +All required assets can be found below. These sample scripts can be used in Step 6 of Instructions. -- [run.js](https://github.com/lmstudio-ai/docs/blob/main/_assets/nvidia-spark-playbook/js/run.js) - Javascript script for sending a test prompt to Spark +- [run.js](https://github.com/lmstudio-ai/docs/blob/main/_assets/nvidia-spark-playbook/js/run.js) - JavaScript script for sending a test prompt to Spark - [run.py](https://github.com/lmstudio-ai/docs/blob/main/_assets/nvidia-spark-playbook/py/run.py) - Python script for sending a test prompt to Spark - [run.sh](https://github.com/lmstudio-ai/docs/blob/main/_assets/nvidia-spark-playbook/bash/run.sh) - Bash script for sending a test prompt to Spark @@ -66,8 +80,8 @@ All required assets can be found below. These sample scripts can be used in Step * **Rollback:** * Downloaded models can be removed manually from the models directory. * Uninstall LM Studio or llmster -* **Last Updated:** 02/06/2026 - * First Publication +* **Last Updated:** 03/12/2026 + * Add instructions for LM Link features ## Instructions @@ -96,7 +110,7 @@ Run the following curl commands in your local terminal to download files require ```bash ## JavaScript -curl -L -O https://raw.githubusercontent.com/lmstudio-ai/docs/main/_assets/nvidia-spark-playbook/js/run.js +curl -L -O https://raw.githubusercontent.com/lmstudio-ai/docs/main/_assets/nvidia-spark-playbook/js/run.js ## Python curl -L -O https://raw.githubusercontent.com/lmstudio-ai/docs/main/_assets/nvidia-spark-playbook/py/run.py @@ -107,23 +121,33 @@ curl -L -O https://raw.githubusercontent.com/lmstudio-ai/docs/main/_assets/nvidi ## Step 3. Start the LM Studio API Server -Use `lms`, LM Studio's CLI to start the server from your terminal. Enable local network access, which allows the LM Studio API server running on your machine to be accessed by all other devices on the same local network (make sure they are trusted devices). To do this, run the following command: +Use `lms`, LM Studio's CLI, to start the server from your terminal. Enable local network access, which allows the LM Studio API server running on your machine to be accessed by all other devices on the same local network (make sure they are trusted devices). To do this, run the following command: ```bash lms server start --bind 0.0.0.0 --port 1234 ``` -Test the connectivity between your laptop and your Spark, run the following command in your local terminal +To test the connectivity between your laptop and your Spark, run the following command in your local terminal ```bash curl http://:1234/api/v1/models ``` -where `` is your device's IP address." You can find your Spark’s IP address by running this on your Spark: +where `` is your device's IP address. You can find your Spark’s IP address by running this on your Spark: ```bash hostname -I ``` +## Step 3b. (Optional) Connect with LM Link + +**LM Link** lets you use your Spark’s models from your laptop (or other devices) as if they were local, over an end-to-end encrypted connection. You don’t need to be on the same local network or bind the server to `0.0.0.0`. + +1. **Create a Link** — Go to [lmstudio.ai/link](https://lmstudio.ai/link) and follow **Create your Link** to set up your private LM Link network. +2. **Link both devices** — On your DGX Spark (llmster) and on your laptop, sign in and join the same Link. LM Link uses Tailscale mesh VPNs; devices communicate without opening ports to the internet. +3. **Use remote models** — On your laptop, open LM Studio (or use the local server). Remote models from your Spark appear in the model loader. Any tool that connects to `localhost:1234` — including the LM Studio SDK, Codex, Claude Code, OpenCode, and the scripts in Step 6 — can use those models without changing the endpoint. + +LM Link is in **Preview** and is free for up to 2 users, 5 devices each. For details and limits, see [LM Link](https://lmstudio.ai/link). + ## Step 4. Download a model to your Spark As an example, let's download and run gpt-oss 120B, one of the best open source models from OpenAI. This model is too large for many laptops due to memory limitations, which makes this a fantastic use case for the Spark. @@ -148,12 +172,12 @@ lms load openai/gpt-oss-120b ## Step 6. Set up a simple program that uses LM Studio SDK on the laptop -Install the LM Studio SDKs and use a simple script to send a prompt to your Spark and validate the response. To get started quickly, we provide simple scripts below for Python, Javascript, and Bash. Download the scripts from the Overview page of this playbook and run the corresponding command from the directory containing it. +Install the LM Studio SDKs and use a simple script to send a prompt to your Spark and validate the response. To get started quickly, we provide simple scripts below for Python, JavaScript, and Bash. Download the scripts from the Overview page of this playbook and run the corresponding command from the directory containing it. > [!NOTE] > Within each script, replace `` with the IP address of your DGX Spark on your local network. -### Javascript +### JavaScript Pre-reqs: User has installed `npm` and `node` @@ -180,12 +204,13 @@ bash run.sh ## Step 7. Next Steps -Try downloading and serving different models from the [LM Studio model catalog](https://lmstudio.ai/models) +- Try downloading and serving different models from the [LM Studio model catalog](https://lmstudio.ai/models). +- Use [LM Link](https://lmstudio.ai/link) to connect more devices and use your Spark’s models from anywhere with end-to-end encryption. ## Step 8. Cleanup and rollback Remove and uninstall LM Studio completely if needed. Note that LM Studio stores models separately from the application. Uninstalling LM Studio will not remove downloaded models unless you explicitly delete them. -If you want to remove the entire LM Studio application, quit LM Studio from the tray first then move the application to trash. +If you want to remove the entire LM Studio application, quit LM Studio from the tray first, then move the application to trash. To uninstall llmster, remove the folder `~/.lmstudio/llmster`. @@ -198,6 +223,7 @@ To remove downloaded models, delete the contents of `~/.lmstudio/models/`. | API returns "model not found" error | Model not downloaded or loaded in LM Studio | Run `lms ls` to verify download status, then load model with `lms load {model-name}` | | `lms` command not found | PATH issue assuming successful installation | Refresh your shell by running `source ~/.bashrc` | | Model load fails - CUDA out of memory | Model too large for available VRAM | Switch to a smaller model or a different quantization | +| LM Link: devices not connecting or remote models not visible | Devices not in same Link, or LM Link not set up on both | Ensure both Spark and laptop are signed in and joined to the same Link at [lmstudio.ai/link](https://lmstudio.ai/link). Restart LM Studio/llmster after joining. See [LM Link](https://lmstudio.ai/link) for how it works. | > [!NOTE] @@ -209,4 +235,4 @@ sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches' ``` -For latest known issues, please review the [DGX Spark User Guide](https://docs.nvidia.com/dgx/dgx-spark/known-issues.html). +For the latest known issues, please review the [DGX Spark User Guide](https://docs.nvidia.com/dgx/dgx-spark/known-issues.html). diff --git a/nvidia/speculative-decoding/README.md b/nvidia/speculative-decoding/README.md index e529994..96d61ec 100644 --- a/nvidia/speculative-decoding/README.md +++ b/nvidia/speculative-decoding/README.md @@ -8,6 +8,16 @@ - [Instructions](#instructions) - [Option 1: EAGLE-3](#option-1-eagle-3) - [Option 2: Draft Target](#option-2-draft-target) +- [Run on Two Sparks](#run-on-two-sparks) + - [Step 1. Configure Docker Permissions](#step-1-configure-docker-permissions) + - [Step 2. Network Setup](#step-2-network-setup) + - [Step 3. Set Container Name Variable](#step-3-set-container-name-variable) + - [Step 4. Start the TRT-LLM Multi-Node Container](#step-4-start-the-trt-llm-multi-node-container) + - [Step 5. Configure OpenMPI Hostfile](#step-5-configure-openmpi-hostfile) + - [Step 6. Launch Eagle3 Speculative Decoding](#step-6-launch-eagle3-speculative-decoding) + - [Step 7. Validate the API](#step-7-validate-the-api) + - [Step 8. Cleanup](#step-8-cleanup) + - [Step 9. Next Steps](#step-9-next-steps) - [Troubleshooting](#troubleshooting) --- @@ -24,6 +34,16 @@ This way, the big model doesn't need to predict every token step-by-step, reduci You'll explore speculative decoding using TensorRT-LLM on NVIDIA Spark using two approaches: EAGLE-3 and Draft-Target. These examples demonstrate how to accelerate large language model inference while maintaining output quality. +## Why two Sparks? + +A single DGX Spark has 128 GB of unified memory shared between the CPU and GPU. This is sufficient to run models like GPT-OSS-120B with EAGLE-3 or Llama-3.3-70B with Draft-Target, as shown in the **Instructions** tab. + +Larger models like **Qwen3-235B-A22B** exceed what a single Spark can hold in memory — even with FP4 quantization, the model weights, KV cache, and Eagle3 draft head together require more than 128 GB. By connecting two Sparks, you double the available memory to 256 GB, making it possible to serve these larger models. + +The **Run on Two Sparks** tab walks through this setup. The two Sparks are connected via QSFP cable and use **tensor parallelism (TP=2)** to split the model — each Spark holds half of every layer's weight matrices and computes its portion of each forward pass. The nodes communicate intermediate results over the high-bandwidth link using NCCL and OpenMPI, so the model operates as a single logical instance across both devices. + +In short: two Sparks let you run models that are too large for one, while speculative decoding (Eagle3) on top further accelerates inference by drafting and verifying multiple tokens in parallel. + ## What to know before starting - Experience with Docker and containerized applications @@ -221,6 +241,263 @@ docker stop - Test with different prompt lengths and generation parameters - Read more on Speculative Decoding [here](https://nvidia.github.io/TensorRT-LLM/advanced/speculative-decoding.html). +## Run on Two Sparks + +### Step 1. Configure Docker Permissions + +**Run on both Spark A and Spark B:** + +```bash +sudo usermod -aG docker $USER +newgrp docker +``` + +### Step 2. Network Setup + +Follow the network setup instructions from the **[Connect Two Sparks](https://build.nvidia.com/spark/connect-two-sparks/stacked-sparks)** playbook. + +> [!NOTE] +> Complete Steps 1-3 from the Connect Two Sparks playbook before proceeding: +> +> - **Step 1**: Ensure same username on both systems +> - **Step 2**: Physical hardware connection (QSFP cable) +> - **Step 3**: Network interface configuration +> - Use **Option 2: Manual IP Assignment with the netplan configure file** +> - Each Spark has two pairs of network ports. When you physically connect a cable between two Sparks, the connected ports will show as **Up**. You can use whichever pair is Up — either **`enp1s0f0np0`** and **`enP2p1s0f0np0`**, or **`enp1s0f1np1`** and **`enP2p1s0f1np1`** +> - This playbook assumes you are using **`enp1s0f1np1`** and **`enP2p1s0f1np1`**. If your Up interfaces are different, substitute your interface names in the commands below + +**For this playbook, we will use the following IP addresses:** + +**Spark A (Node 1):** +- `enp1s0f1np1`: 192.168.200.12/24 +- `enP2p1s0f1np1`: 192.168.200.14/24 + +**Spark B (Node 2):** +- `enp1s0f1np1`: 192.168.200.13/24 +- `enP2p1s0f1np1`: 192.168.200.15/24 + +After completing the Connect Two Sparks setup, return here to continue with the TRT-LLM container setup. + +### Step 3. Set Container Name Variable + +**Run on both Spark A and Spark B:** + +```bash +export TRTLLM_MN_CONTAINER=trtllm-multinode +``` + +### Step 4. Start the TRT-LLM Multi-Node Container + +**Run on both Spark A and Spark B:** + +```bash +docker run -d --rm \ + --name $TRTLLM_MN_CONTAINER \ + --gpus '"device=all"' \ + --network host \ + --ulimit memlock=-1 \ + --ulimit stack=67108864 \ + --device /dev/infiniband:/dev/infiniband \ + -e UCX_NET_DEVICES="enp1s0f1np1,enP2p1s0f1np1" \ + -e NCCL_SOCKET_IFNAME="enp1s0f1np1,enP2p1s0f1np1" \ + -e OMPI_MCA_btl_tcp_if_include="enp1s0f1np1,enP2p1s0f1np1" \ + -e OMPI_MCA_orte_default_hostfile="/etc/openmpi-hostfile" \ + -e OMPI_MCA_rmaps_ppr_n_pernode="1" \ + -e OMPI_ALLOW_RUN_AS_ROOT="1" \ + -e OMPI_ALLOW_RUN_AS_ROOT_CONFIRM="1" \ + -e CPATH="/usr/local/cuda/include" \ + -e TRITON_PTXAS_PATH="/usr/local/cuda/bin/ptxas" \ + -v ~/.cache/huggingface/:/root/.cache/huggingface/ \ + -v ~/.ssh:/tmp/.ssh:ro \ + nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc6 \ + bash -c "curl https://raw.githubusercontent.com/NVIDIA/dgx-spark-playbooks/refs/heads/main/nvidia/trt-llm/assets/trtllm-mn-entrypoint.sh | bash" +``` + +Verify: + +```bash +docker logs -f $TRTLLM_MN_CONTAINER +``` + +Expected output at the end: + +``` +total 56K +drwx------ 2 root root 4.0K Jan 13 05:13 . +drwx------ 1 root root 4.0K Jan 13 05:12 .. +-rw------- 1 root root 100 Jan 13 05:13 authorized_keys +-rw------- 1 root root 45 Jan 13 05:13 config +-rw------- 1 root root 411 Jan 13 05:13 id_ed25519 +-rw-r--r-- 1 root root 102 Jan 13 05:13 id_ed25519.pub +-rw------- 1 root root 411 Jan 13 05:13 id_ed25519_shared +-rw-r--r-- 1 root root 100 Jan 13 05:13 id_ed25519_shared.pub +-rw------- 1 root root 3.4K Jan 13 05:13 id_rsa +-rw-r--r-- 1 root root 743 Jan 13 05:13 id_rsa.pub +-rw------- 1 root root 5.0K Jan 13 05:13 known_hosts +-rw------- 1 root root 3.2K Jan 13 05:13 known_hosts.old +Starting SSH +``` + +### Step 5. Configure OpenMPI Hostfile + +The hostfile tells MPI which nodes participate in distributed execution. Use the IPs from the `enp1s0f1np1` interface configured in Step 2. + +**On both Spark A and Spark B**, create the hostfile: + +```bash +cat > ~/openmpi-hostfile < /tmp/extra-llm-api-config.yml < | Model download fails | Network or authentication issues | Check HuggingFace authentication and network connectivity | | Cannot access gated repo for URL | Certain HuggingFace models have restricted access | Regenerate your [HuggingFace token](https://huggingface.co/docs/hub/en/security-tokens); and request access to the [gated model](https://huggingface.co/docs/hub/en/models-gated#customize-requested-information) on your web browser | | Server doesn't respond | Port conflicts or firewall | Check if port 8000 is available and not blocked | +| `mpirun` fails with SSH connection refused | SSH not configured between containers or nodes | Complete SSH setup from Connect Two Sparks playbook; verify `ssh ` works without password from both nodes | +| `mpirun` hangs or times out connecting to remote node | Hostfile IPs don't match actual node IPs | Verify IPs in `/etc/openmpi-hostfile` match the IPs assigned to network interfaces with `ip addr show` | +| NCCL error: "Socket operation on non-socket" | Wrong network interface specified | Check `ibdev2netdev` output and ensure `NCCL_SOCKET_IFNAME` and `UCX_NET_DEVICES` match the active interfaces `enp1s0f1np1,enP2p1s0f1np1` | +| `Permission denied (publickey)` during mpirun | SSH keys not exchanged between containers | Re-run SSH setup from Connect Two Sparks playbook or manually verify `/root/.ssh/authorized_keys` contains public keys from both nodes | +| Model download fails silently in multi-node setup | HF_TOKEN not propagated to mpirun | Add `-e HF_TOKEN=$HF_TOKEN` to `docker exec` command and `-x HF_TOKEN` to `mpirun` command | > [!NOTE] -> DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU. -> With many applications still updating to take advantage of UMA, you may encounter memory issues even when within +> DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU. +> With many applications still updating to take advantage of UMA, you may encounter memory issues even when within > the memory capacity of DGX Spark. If that happens, manually flush the buffer cache with: ```bash sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'