mirror of https://github.com/NVIDIA/dgx-spark-playbooks.git synced 2026-04-25 11:23:52 +00:00

github-actions[bot] 4d0d20d39f chore: regenerate skills/ from upstream playbooks [skip ci]

2026-04-19 09:25:00 +00:00

1.3 KiB

Raw Blame History

name	description
dgx-spark-trt-llm	Install and use TensorRT-LLM on DGX Spark — on NVIDIA DGX Spark. Use when setting up trt-llm on Spark hardware.

TRT LLM for Inference

Install and use TensorRT-LLM on DGX Spark

NVIDIA TensorRT-LLM (TRT-LLM) is an open-source library for optimizing and accelerating large language model (LLM) inference on NVIDIA GPUs.

It provides highly efficient kernels, memory management, and parallelism strategies—like tensor, pipeline, and sequence parallelism—so developers can serve LLMs with lower latency and higher throughput.

TRT-LLM integrates with frameworks like Hugging Face and PyTorch, making it easier to deploy state-of-the-art models at scale.

Outcome: You'll set up TensorRT-LLM to optimize and deploy large language models on your DGX Spark, achieving significantly higher throughput and lower latency than standard PyTorch inference through kernel-level optimizations, efficient memory layouts, and advanced quantization.

Duration: 45-60 minutes for setup and API server deployment · Risk: Medium - container pulls and model downloads may fail due to network issues

Full playbook: /home/runner/work/dgx-spark-playbooks/dgx-spark-playbooks/nvidia/trt-llm/README.md

1.3 KiB Raw Blame History

TRT LLM for Inference

1.3 KiB

Raw Blame History