Speculative decoding speeds up text generation by using a **small, fast model** to draft several tokens ahead, then having the **larger model** quickly verify or adjust them.
A single DGX Spark has 128 GB of unified memory shared between the CPU and GPU. This is sufficient to run models like GPT-OSS-120B with EAGLE-3 or Llama-3.3-70B with Draft-Target, as shown in the **Instructions** tab.
Larger models like **Qwen3-235B-A22B** exceed what a single Spark can hold in memory — even with FP4 quantization, the model weights, KV cache, and Eagle3 draft head together require more than 128 GB. By connecting two Sparks, you double the available memory to 256 GB, making it possible to serve these larger models.
The **Run on Two Sparks** tab walks through this setup. The two Sparks are connected via QSFP cable and use **tensor parallelism (TP=2)** to split the model — each Spark holds half of every layer's weight matrices and computes its portion of each forward pass. The nodes communicate intermediate results over the high-bandwidth link using NCCL and OpenMPI, so the model operates as a single logical instance across both devices.
In short: two Sparks let you run models that are too large for one, while speculative decoding (Eagle3) on top further accelerates inference by drafting and verifying multiple tokens in parallel.
To easily manage containers without sudo, you must be in the `docker` group. If you choose to skip this step, you will need to run Docker commands with sudo.
If you see a permission denied error (something like permission denied while trying to connect to the Docker daemon socket), add your user to the docker group so that you don't need to run the command with sudo .
Once the server is running, test it by making an API call from another terminal:
```bash
## Test completion endpoint
curl -X POST http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "openai/gpt-oss-120b",
"prompt": "Solve the following problem step by step. If a train travels 180 km in 3 hours, and then slows down by 20% for the next 2 hours, what is the total distance traveled? Show all intermediate calculations and provide a final numeric answer.",
- **Simpler deployment** — Instead of managing a separate draft model, EAGLE-3 uses a built-in drafting head that generates speculative tokens internally.
- **Better accuracy** — By fusing features from multiple layers of the model, draft tokens are more likely to be accepted, reducing wasted computation.
- **Faster generation** — Multiple tokens are verified in parallel per forward pass, cutting down the latency of autoregressive inference.
### Option 2: Draft Target
Execute the following command to set up and run draft target speculative decoding:
> - Use **Option 2: Manual IP Assignment with the netplan configure file**
> - Each Spark has two pairs of network ports. When you physically connect a cable between two Sparks, the connected ports will show as **Up**. You can use whichever pair is Up — either **`enp1s0f0np0`** and **`enP2p1s0f0np0`**, or **`enp1s0f1np1`** and **`enP2p1s0f1np1`**
> - This playbook assumes you are using **`enp1s0f1np1`** and **`enP2p1s0f1np1`**. If your Up interfaces are different, substitute your interface names in the commands below
**For this playbook, we will use the following IP addresses:**
**Spark A (Node 1):**
-`enp1s0f1np1`: 192.168.200.12/24
-`enP2p1s0f1np1`: 192.168.200.14/24
**Spark B (Node 2):**
-`enp1s0f1np1`: 192.168.200.13/24
-`enP2p1s0f1np1`: 192.168.200.15/24
After completing the Connect Two Sparks setup, return here to continue with the TRT-LLM container setup.
### Step 3. Set Container Name Variable
**Run on both Spark A and Spark B:**
```bash
export TRTLLM_MN_CONTAINER=trtllm-multinode
```
### Step 4. Start the TRT-LLM Multi-Node Container
Warning: Permanently added '[192.168.200.13]:2233' (ED25519) to the list of known hosts.
spark-afe0
spark-ae11
nvidia@spark-afe0:~$
```
### Step 6. Launch Eagle3 Speculative Decoding
Eagle3 speculative decoding accelerates inference by predicting multiple tokens ahead, then validating them in parallel. This can provide significant speedup compared to standard autoregressive generation.
#### Set your Hugging Face token
```bash
export HF_TOKEN=your_huggingface_token_here
```
#### Download the Eagle3 speculative model on both nodes
#### Launch the server with Eagle3 speculative decoding
**Run on Spark A only.** This starts the TensorRT-LLM API server using the FP4 base model with Eagle3 speculative decoding enabled. The `mpirun` command coordinates execution across both nodes, so it only needs to be launched from Spark A. The maximum token length is set to 1024 (adjust as needed).
| Cannot access gated repo for URL | Certain HuggingFace models have restricted access | Regenerate your [HuggingFace token](https://huggingface.co/docs/hub/en/security-tokens); and request access to the [gated model](https://huggingface.co/docs/hub/en/models-gated#customize-requested-information) on your web browser |
| `mpirun` fails with SSH connection refused | SSH not configured between containers or nodes | Complete SSH setup from Connect Two Sparks playbook; verify `ssh <node_ip>` works without password from both nodes |
| `mpirun` hangs or times out connecting to remote node | Hostfile IPs don't match actual node IPs | Verify IPs in `/etc/openmpi-hostfile` match the IPs assigned to network interfaces with `ip addr show` |
| NCCL error: "Socket operation on non-socket" | Wrong network interface specified | Check `ibdev2netdev` output and ensure `NCCL_SOCKET_IFNAME` and `UCX_NET_DEVICES` match the active interfaces `enp1s0f1np1,enP2p1s0f1np1` |
| `Permission denied (publickey)` during mpirun | SSH keys not exchanged between containers | Re-run SSH setup from Connect Two Sparks playbook or manually verify `/root/.ssh/authorized_keys` contains public keys from both nodes |
| Model download fails silently in multi-node setup | HF_TOKEN not propagated to mpirun | Add `-e HF_TOKEN=$HF_TOKEN` to `docker exec` command and `-x HF_TOKEN` to `mpirun` command |