mirror of
https://github.com/NVIDIA/dgx-spark-playbooks.git
synced 2026-04-25 19:33:53 +00:00
chore: Regenerate all playbooks
This commit is contained in:
parent
302c15b6cf
commit
f4c0014bf5
@ -91,16 +91,7 @@ curl http://localhost:8000/v1/chat/completions \
|
|||||||
|
|
||||||
Expected response should contain `"content": "204"` or similar mathematical calculation.
|
Expected response should contain `"content": "204"` or similar mathematical calculation.
|
||||||
|
|
||||||
## Step 3. Troubleshooting
|
## Step 3. Cleanup and rollback
|
||||||
|
|
||||||
| Symptom | Cause | Fix |
|
|
||||||
|---------|--------|-----|
|
|
||||||
| CUDA version mismatch errors | Wrong CUDA toolkit version | Reinstall CUDA 12.9 using exact installer |
|
|
||||||
| Container registry authentication fails | Invalid or expired GitLab token | Generate new auth token |
|
|
||||||
| SM_121a architecture not recognized | Missing LLVM patches | Verify SM_121a patches applied to LLVM source |
|
|
||||||
|
|
||||||
|
|
||||||
## Step 4. Cleanup and rollback
|
|
||||||
|
|
||||||
For container approach (non-destructive):
|
For container approach (non-destructive):
|
||||||
|
|
||||||
@ -116,7 +107,7 @@ To remove CUDA 12.9:
|
|||||||
sudo /usr/local/cuda-12.9/bin/cuda-uninstaller
|
sudo /usr/local/cuda-12.9/bin/cuda-uninstaller
|
||||||
```
|
```
|
||||||
|
|
||||||
## Step 5. Next steps
|
## Step 4. Next steps
|
||||||
|
|
||||||
- **Production deployment:** Configure vLLM with your specific model requirements
|
- **Production deployment:** Configure vLLM with your specific model requirements
|
||||||
- **Performance tuning:** Adjust batch sizes and memory settings for your workload
|
- **Performance tuning:** Adjust batch sizes and memory settings for your workload
|
||||||
@ -127,7 +118,7 @@ sudo /usr/local/cuda-12.9/bin/cuda-uninstaller
|
|||||||
|
|
||||||
## Step 1. Configure network connectivity
|
## Step 1. Configure network connectivity
|
||||||
|
|
||||||
Follow the network setup instructions from the Connect two Sparks playbook to establish connectivity between your DGX Spark nodes.
|
Follow the network setup instructions from the [Connect two Sparks](https://build.nvidia.com/spark/stack-sparks/stacked-sparks) playbook to establish connectivity between your DGX Spark nodes.
|
||||||
|
|
||||||
This includes:
|
This includes:
|
||||||
- Physical QSFP cable connection
|
- Physical QSFP cable connection
|
||||||
@ -339,6 +330,15 @@ http://192.168.100.10:8265
|
|||||||
|
|
||||||
## Troubleshooting
|
## Troubleshooting
|
||||||
|
|
||||||
|
## Common issues for running on a single Spark
|
||||||
|
|
||||||
|
| Symptom | Cause | Fix |
|
||||||
|
|---------|--------|-----|
|
||||||
|
| CUDA version mismatch errors | Wrong CUDA toolkit version | Reinstall CUDA 12.9 using exact installer |
|
||||||
|
| Container registry authentication fails | Invalid or expired GitLab token | Generate new auth token |
|
||||||
|
| SM_121a architecture not recognized | Missing LLVM patches | Verify SM_121a patches applied to LLVM source |
|
||||||
|
|
||||||
|
## Common Issues for running on two Starks
|
||||||
| Symptom | Cause | Fix |
|
| Symptom | Cause | Fix |
|
||||||
|---------|--------|-----|
|
|---------|--------|-----|
|
||||||
| Node 2 not visible in Ray cluster | Network connectivity issue | Verify QSFP cable connection, check IP configuration |
|
| Node 2 not visible in Ray cluster | Network connectivity issue | Verify QSFP cable connection, check IP configuration |
|
||||||
|
|||||||
Loading…
Reference in New Issue
Block a user