mirror of
https://github.com/NVIDIA/dgx-spark-playbooks.git
synced 2026-04-23 02:23:53 +00:00
chore: Regenerate all playbooks
This commit is contained in:
parent
8499e486ff
commit
5757a85b1e
@ -222,7 +222,7 @@ Now that you have DGX Dashboard configured, you can:
|
||||
|
||||
| Symptom | Cause | Fix |
|
||||
|---------|-------|-----|
|
||||
| User can't run updates | User not in sudo group | Add user to sudo group: `sudo usermod -aG sudo <USERNAME>` |
|
||||
| User can't run updates | User not in sudo group | Add user to sudo group: `sudo usermod -aG sudo <USERNAME>`; then run `newgrp docker`|
|
||||
| JupyterLab won't start | Issue with current virtual environment | Change the working directory in the JupyterLab panel and start a new instance |
|
||||
| SSH tunnel connection refused | Incorrect IP or port | Verify Spark device IP and ensure SSH service is running |
|
||||
| GPU not visible in monitoring | Driver issues | Check GPU status with `nvidia-smi` |
|
||||
|
||||
@ -83,12 +83,11 @@ uname -m
|
||||
docker run --gpus all --rm nvcr.io/nvidia/cuda:13.0.1-runtime-ubuntu24.04 nvidia-smi
|
||||
```
|
||||
|
||||
If the `docker` command fails with a permission error, you can either run the command with `sudo`, or add yourself to the `docker` group to use `docker` without `sudo`.
|
||||
If you see a permission denied error (something like permission denied while trying to connect to the Docker daemon socket), add your user to the docker group, so that you don't need to use the command with sudo .
|
||||
|
||||
```bash
|
||||
sudo usermod -aG docker $USER
|
||||
newgrp docker
|
||||
sudo systemctl restart docker
|
||||
```
|
||||
|
||||
## Step 2. Clone the playbook repository
|
||||
|
||||
@ -61,7 +61,7 @@ Open a new terminal and test Docker access. In the terminal, run:
|
||||
docker ps
|
||||
```
|
||||
|
||||
If you see a permission denied error (something like permission denied while trying to connect to the Docker daemon socket), add your user to the docker group, so that you don't need to use the command with sudo .
|
||||
If you see a permission denied error (something like permission denied while trying to connect to the Docker daemon socket), add your user to the docker group so that you don't need to run the command with sudo .
|
||||
|
||||
```bash
|
||||
sudo usermod -aG docker $USER
|
||||
|
||||
@ -80,15 +80,13 @@ Open a new terminal and test Docker access. In the terminal, run:
|
||||
docker ps
|
||||
```
|
||||
|
||||
If you see a permission denied error (something like `permission denied while trying to connect to the Docker daemon socket`), add your user to the docker group:
|
||||
If you see a permission denied error (something like permission denied while trying to connect to the Docker daemon socket), add your user to the docker group so that you don't need to run the command with sudo .
|
||||
|
||||
```bash
|
||||
sudo usermod -aG docker $USER
|
||||
newgrp docker
|
||||
```
|
||||
|
||||
> **Warning**: After running usermod, you must log out and log back in to start a new
|
||||
> session with updated group permissions.
|
||||
|
||||
## Step 2. Prepare the environment
|
||||
|
||||
Create a local output directory where the quantized model files will be stored. This directory will be mounted into the container to persist results after the container exits.
|
||||
|
||||
@ -56,7 +56,7 @@ Open a new terminal and test Docker access. In the terminal, run:
|
||||
docker ps
|
||||
```
|
||||
|
||||
If you see a permission denied error (something like permission denied while trying to connect to the Docker daemon socket), add your user to the docker group, so that you don't need to use the command with sudo .
|
||||
If you see a permission denied error (something like permission denied while trying to connect to the Docker daemon socket), add your user to the docker group so that you don't need to run the command with sudo .
|
||||
|
||||
```bash
|
||||
sudo usermod -aG docker $USER
|
||||
@ -177,7 +177,7 @@ Open the Terminal app from NVIDIA Sync to start an interactive SSH session and t
|
||||
docker ps
|
||||
```
|
||||
|
||||
If you see a permission denied error (something like `permission denied while trying to connect to the Docker daemon socket`), add your user to the docker group:
|
||||
If you see a permission denied error (something like permission denied while trying to connect to the Docker daemon socket), add your user to the docker group so that you don't need to run the command with sudo .
|
||||
|
||||
```bash
|
||||
sudo usermod -aG docker $USER
|
||||
|
||||
@ -53,15 +53,13 @@ Open a new terminal and test Docker access. In the terminal, run:
|
||||
docker ps
|
||||
```
|
||||
|
||||
If you see a permission denied error (something like `permission denied while trying to connect to the Docker daemon socket`), add your user to the docker group:
|
||||
If you see a permission denied error (something like permission denied while trying to connect to the Docker daemon socket), add your user to the docker group so that you don't need to run the command with sudo .
|
||||
|
||||
```bash
|
||||
sudo usermod -aG docker $USER
|
||||
newgrp docker
|
||||
```
|
||||
|
||||
> **Warning**: After running usermod, you must log out and log back in to start a new
|
||||
> session with updated group permissions.
|
||||
|
||||
## Step 2. Pull the latest Pytorch container
|
||||
|
||||
```bash
|
||||
|
||||
@ -68,7 +68,7 @@ Open a new terminal and test Docker access. In the terminal, run:
|
||||
docker ps
|
||||
```
|
||||
|
||||
If you see a permission denied error (something like permission denied while trying to connect to the Docker daemon socket), add your user to the docker group, so that you don't need to use the command with sudo .
|
||||
If you see a permission denied error (something like permission denied while trying to connect to the Docker daemon socket), add your user to the docker group so that you don't need to run the command with sudo .
|
||||
|
||||
```bash
|
||||
sudo usermod -aG docker $USER
|
||||
|
||||
@ -130,7 +130,7 @@ Open a new terminal and test Docker access. In the terminal, run:
|
||||
docker ps
|
||||
```
|
||||
|
||||
If you see a permission denied error (something like `permission denied while trying to connect to the Docker daemon socket`), add your user to the docker group:
|
||||
If you see a permission denied error (something like permission denied while trying to connect to the Docker daemon socket), add your user to the docker group so that you don't need to run the command with sudo .
|
||||
|
||||
```bash
|
||||
sudo usermod -aG docker $USER
|
||||
@ -429,13 +429,12 @@ Open a new terminal and test Docker access. In the terminal, run:
|
||||
docker ps
|
||||
```
|
||||
|
||||
If you see a permission denied error (something like `permission denied while trying to connect to the Docker daemon socket`), add your user to the docker group:
|
||||
If you see a permission denied error (something like permission denied while trying to connect to the Docker daemon socket), add your user to the docker group so that you don't need to run the command with sudo .
|
||||
|
||||
```bash
|
||||
sudo usermod -aG docker $USER
|
||||
newgrp docker
|
||||
```
|
||||
|
||||
### Step 3. Install NVIDIA Container Toolkit & setup Docker environment
|
||||
|
||||
Ensure the NVIDIA drivers and the NVIDIA Container Toolkit are installed on each node (both manager and workers) that will provide GPU resources. This package enables Docker containers to access the host's GPU hardware. Ensure you complete the [installation steps](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html), including the [Docker configuration](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#configuring-docker) for NVIDIA Container Toolkit.
|
||||
|
||||
@ -43,16 +43,16 @@ The setup includes:
|
||||
|
||||
## Time & risk
|
||||
|
||||
⏱️ **Duration**:
|
||||
**Duration**:
|
||||
- 2-3 minutes for initial setup and container deployment
|
||||
- 5-10 minutes for Ollama model download (depending on model size)
|
||||
- Immediate document processing and knowledge graph generation
|
||||
|
||||
⚠️ **Risks**:
|
||||
**Risks**:
|
||||
- GPU memory requirements depend on chosen Ollama model size
|
||||
- Document processing time scales with document size and complexity
|
||||
|
||||
↩️ **Rollback**: Stop and remove Docker containers, delete downloaded models if needed
|
||||
**Rollback**: Stop and remove Docker containers, delete downloaded models if needed
|
||||
|
||||
## Instructions
|
||||
|
||||
|
||||
@ -67,15 +67,13 @@ Open a new terminal and test Docker access. In the terminal, run:
|
||||
docker ps
|
||||
```
|
||||
|
||||
If you see a permission denied error (something like `permission denied while trying to connect to the Docker daemon socket`), add your user to the docker group:
|
||||
If you see a permission denied error (something like permission denied while trying to connect to the Docker daemon socket), add your user to the docker group so that you don't need to run the command with sudo .
|
||||
|
||||
```bash
|
||||
sudo usermod -aG docker $USER
|
||||
newgrp docker
|
||||
```
|
||||
|
||||
> **Warning**: After running usermod, you must log out and log back in to start a new
|
||||
> session with updated group permissions.
|
||||
|
||||
## Step 2. Clone the repository
|
||||
|
||||
In a terminal, clone the repository and navigate to the VLM fine-tuning directory.
|
||||
|
||||
@ -81,17 +81,13 @@ Open a new terminal and test Docker access. In the terminal, run:
|
||||
docker ps
|
||||
```
|
||||
|
||||
If you see a permission denied error (something like `permission denied while trying to connect to the Docker daemon socket`), add your user to the docker group:
|
||||
If you see a permission denied error (something like permission denied while trying to connect to the Docker daemon socket), add your user to the docker group so that you don't need to run the command with sudo .
|
||||
|
||||
```bash
|
||||
sudo usermod -aG docker $USER
|
||||
newgrp docker
|
||||
```
|
||||
|
||||
> **Warning**: After running usermod, you must log out and log back in to start a new
|
||||
> session with updated group permissions, or in rare cases restart their spark for the
|
||||
> changes to take effect.
|
||||
|
||||
|
||||
Additionally, configure Docker so that it can use the NVIDIA Container Runtime.
|
||||
|
||||
|
||||
Loading…
Reference in New Issue
Block a user