mirror of
https://github.com/NVIDIA/dgx-spark-playbooks.git
synced 2026-04-26 20:03:52 +00:00
Merge ce1609e892 into 3ba4d58f1e
This commit is contained in:
commit
c2519e9cb0
@ -101,7 +101,7 @@ Build llama.cpp with CUDA enabled and targeting the GB10's sm_121 compute archit
|
|||||||
|
|
||||||
```bash
|
```bash
|
||||||
mkdir build && cd build
|
mkdir build && cd build
|
||||||
cmake .. -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="121" -DLLAMA_CURL=OFF
|
cmake .. -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="121" -DLLAMA_CURL=OFF -DGGML_CUDA_FA_ALL_QUANTS=ON
|
||||||
make -j8
|
make -j8
|
||||||
```
|
```
|
||||||
|
|
||||||
@ -128,6 +128,7 @@ Launch the inference server with the Nemotron model. The server provides an Open
|
|||||||
--model ~/models/nemotron3-gguf/Nemotron-3-Nano-30B-A3B-UD-Q8_K_XL.gguf \
|
--model ~/models/nemotron3-gguf/Nemotron-3-Nano-30B-A3B-UD-Q8_K_XL.gguf \
|
||||||
--host 0.0.0.0 \
|
--host 0.0.0.0 \
|
||||||
--port 30000 \
|
--port 30000 \
|
||||||
|
--flash-attn 1 \
|
||||||
--n-gpu-layers 99 \
|
--n-gpu-layers 99 \
|
||||||
--ctx-size 8192 \
|
--ctx-size 8192 \
|
||||||
--threads 8
|
--threads 8
|
||||||
@ -136,6 +137,7 @@ Launch the inference server with the Nemotron model. The server provides an Open
|
|||||||
**Parameter explanation:**
|
**Parameter explanation:**
|
||||||
- `--host 0.0.0.0`: Listen on all network interfaces
|
- `--host 0.0.0.0`: Listen on all network interfaces
|
||||||
- `--port 30000`: API server port
|
- `--port 30000`: API server port
|
||||||
|
- `--flash-attn 1`: Enables Flash Attention
|
||||||
- `--n-gpu-layers 99`: Offload all layers to GPU
|
- `--n-gpu-layers 99`: Offload all layers to GPU
|
||||||
- `--ctx-size 8192`: Context window size (can increase up to 1M)
|
- `--ctx-size 8192`: Context window size (can increase up to 1M)
|
||||||
- `--threads 8`: CPU threads for non-GPU operations
|
- `--threads 8`: CPU threads for non-GPU operations
|
||||||
|
|||||||
Loading…
Reference in New Issue
Block a user