Merge ce1609e892 into 51615570a7

chore: Regenerate all playbooks
Add Nemotron-3-Nano playbook using llama.cpp with Flash Attention
2026-06-21 05:39:31 +00:00 · 2026-05-22 09:18:02 +08:00 · 2026-05-18 17:50:57 +00:00 · 2026-01-07 14:48:10 +05:00
2 changed files with 5 additions and 3 deletions
--- a/nvidia/nemotron/README.md
+++ b/nvidia/nemotron/README.md
@ -101,7 +101,7 @@ Build llama.cpp with CUDA enabled and targeting the GB10's sm_121 compute archit

 ```bash
 mkdir build && cd build
-cmake .. -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="121" -DLLAMA_CURL=OFF
+cmake .. -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="121" -DLLAMA_CURL=OFF -DGGML_CUDA_FA_ALL_QUANTS=ON 
 make -j8
 ```

@ -128,6 +128,7 @@ Launch the inference server with the Nemotron model. The server provides an Open
  --model ~/models/nemotron3-gguf/Nemotron-3-Nano-30B-A3B-UD-Q8_K_XL.gguf \
  --host 0.0.0.0 \
  --port 30000 \
+  --flash-attn 1 \
  --n-gpu-layers 99 \
  --ctx-size 8192 \
  --threads 8
@ -136,6 +137,7 @@ Launch the inference server with the Nemotron model. The server provides an Open
 **Parameter explanation:**
 - `--host 0.0.0.0`: Listen on all network interfaces
 - `--port 30000`: API server port
+- `--flash-attn 1`: Enables Flash Attention
 - `--n-gpu-layers 99`: Offload all layers to GPU
 - `--ctx-size 8192`: Context window size (can increase up to 1M)
 - `--threads 8`: CPU threads for non-GPU operations
--- a/nvidia/vllm/README.md
+++ b/nvidia/vllm/README.md
@ -218,7 +218,7 @@ Obtain the vLLM cluster deployment script on both nodes. This script orchestrate

 ```bash
 ## Download on both nodes
-wget https://raw.githubusercontent.com/vllm-project/vllm/refs/heads/main/examples/online_serving/run_cluster.sh
+wget https://raw.githubusercontent.com/vllm-project/vllm/refs/heads/main/examples/ray_serving/run_cluster.sh
 chmod +x run_cluster.sh
 ```

@ -445,7 +445,7 @@ Download the vLLM cluster deployment script on all nodes. This script orchestrat

 ```bash
 ## Download on all nodes
-wget https://raw.githubusercontent.com/vllm-project/vllm/refs/heads/main/examples/online_serving/run_cluster.sh
+wget https://raw.githubusercontent.com/vllm-project/vllm/refs/heads/main/examples/ray_serving/run_cluster.sh
 chmod +x run_cluster.sh
 ```
Author	SHA1	Message	Date
Shakhizat Nurgaliyev	8a2b604e06	Merge `ce1609e892` into `51615570a7`	2026-05-22 09:18:02 +08:00
GitLab CI	51615570a7	chore: Regenerate all playbooks	2026-05-18 17:50:57 +00:00
Shakhizat Nurgaliyev	ce1609e892	Add Nemotron-3-Nano playbook using llama.cpp with Flash Attention	2026-01-07 14:48:10 +05:00