From 756ec60b0a656491ca67b0f6385f5123c50fdfaf Mon Sep 17 00:00:00 2001 From: GitLab CI Date: Thu, 12 Mar 2026 04:22:56 +0000 Subject: [PATCH] chore: Regenerate all playbooks --- nvidia/trt-llm/README.md | 6 +++--- nvidia/vllm/README.md | 13 +++++-------- 2 files changed, 8 insertions(+), 11 deletions(-) diff --git a/nvidia/trt-llm/README.md b/nvidia/trt-llm/README.md index 76038c6..4641577 100644 --- a/nvidia/trt-llm/README.md +++ b/nvidia/trt-llm/README.md @@ -75,6 +75,7 @@ The following models are supported with TensorRT-LLM on Spark. All listed models | Model | Quantization | Support Status | HF Handle | |-------|-------------|----------------|-----------| +| **Nemotron-3-Super-120B** | FP8 | ✅ | `nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8` | | **GPT-OSS-20B** | MXFP4 | ✅ | `openai/gpt-oss-20b` | | **GPT-OSS-120B** | MXFP4 | ✅ | `openai/gpt-oss-120b` | | **Llama-3.1-8B-Instruct** | FP8 | ✅ | `nvidia/Llama-3.1-8B-Instruct-FP8` | @@ -103,9 +104,8 @@ Reminder: not all model architectures are supported for NVFP4 quantization. * **Duration**: 45-60 minutes for setup and API server deployment * **Risk level**: Medium - container pulls and model downloads may fail due to network issues * **Rollback**: Stop inference servers and remove downloaded models to free resources. -* **Last Updated:** 01/02/2026 - * Improve TRT-LLM Run on Two Sparks workflow - * Upgrade to the latest TRT-LLM container v1.2.0rc6 +* **Last Updated:** 03/12/2026 + * Introduce Nemotron-3-Super-120B support on TRT-LLM ## Single Spark diff --git a/nvidia/vllm/README.md b/nvidia/vllm/README.md index 04cf5d2..cd2a5ad 100644 --- a/nvidia/vllm/README.md +++ b/nvidia/vllm/README.md @@ -53,6 +53,7 @@ The following models are supported with vLLM on Spark. All listed models are ava | Model | Quantization | Support Status | HF Handle | |-------|-------------|----------------|-----------| +| **Nemotron-3-Super-120B** | FP8 | ✅ | []`nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8`](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8) | | **GPT-OSS-20B** | MXFP4 | ✅ | [`openai/gpt-oss-20b`](https://huggingface.co/openai/gpt-oss-20b) | | **GPT-OSS-120B** | MXFP4 | ✅ | [`openai/gpt-oss-120b`](https://huggingface.co/openai/gpt-oss-120b) | | **Llama-3.1-8B-Instruct** | FP8 | ✅ | [`nvidia/Llama-3.1-8B-Instruct-FP8`](https://huggingface.co/nvidia/Llama-3.1-8B-Instruct-FP8) | @@ -87,9 +88,9 @@ Reminder: not all model architectures are supported for NVFP4 quantization. * **Duration:** 30 minutes for Docker approach * **Risks:** Container registry access requires internal credentials * **Rollback:** Container approach is non-destructive. -* **Last Updated:** 01/22/2026 - * Added support for Qwen3-VL-Reranker-2B, Qwen3-VL-Reranker-8B, and Qwen3-VL-Embedding-2B models - * Updated container to January 2026 release (26.01-py3) +* **Last Updated:** 03/12/2026 + * Added support for Nemotron-3-Super-120B model + * Updated container to Feb 2026 release (26.02-py3) ## Instructions @@ -117,15 +118,11 @@ Find the latest container build from https://catalog.ngc.nvidia.com/orgs/nvidia/ export LATEST_VLLM_VERSION= ## example -## export LATEST_VLLM_VERSION=26.01-py3 +## export LATEST_VLLM_VERSION=26.02-py3 docker pull nvcr.io/nvidia/vllm:${LATEST_VLLM_VERSION} ``` -```bash -docker pull nvcr.io/nvidia/vllm:26.01-py3 -``` - ## Step 3. Test vLLM in container Launch the container and start vLLM server with a test model to verify basic functionality.