20 KiB
Fine-Tune a Recommender System on DGX Station
Train and serve an HLLM product recommender with LoRA, FAISS, and an interactive UI
Table of Contents
Overview
Basic idea
DGX Station is an all-in-one platform for training and serving enterprise-scale recommender systems. This playbook packages an end-to-end fashion recommendation pipeline on Amazon Dresses: a two-stage retriever-and-ranker, a PPO dynamic-pricing agent, a live FastAPI UI, and a high-concurrency benchmark.
The retriever uses HLLM (Hierarchical Large Language Model) in a two-tower setup: an item tower encodes product titles + descriptions, a user tower encodes interaction histories (ByteDance 2024). A nearest neighbor similarity search (FAISS) then retrieves the top 100 candidates per user via inner-product search over the 16k-item catalog (293k interactions). Then, the re-ranker, LightGBM with the lambdarank objective, orders the top 3–5 recommendations from those 100 using ~20 handcrafted features (popularity windows, user history, price ratios, embedding similarity). Together: HLLM learns what each item/user is like; LightGBM learns which of the top 100 are most likely to be the next purchase.
Once the recommender pipeline is trained, a PPO reinforcement-learning agent decides at what price — picking per-item daily price multipliers based on inventory state, time-on-shelf, and item popularity to increase margin and revenue while reducing stockouts.
Finally, a FastAPI web UI demonstrates live recommendations end-to-end, and a high-concurrency benchmarking script measures throughput with up to 1M concurrent users.
What you'll accomplish
Train and serve an enterprise recommender system:
- Pre-process the Amazon Reviews 2023 dataset (dresses category subset).
- Train a two-stage recommender (HLLM retriever + LightGBM re-ranker).
- Train a PPO dynamic-pricing agent to increase margin and revenue.
- Launch a web UI showing live per-user recommendations.
- Benchmark serving throughput up to 1M concurrent users.
What to know before starting
- Comfortable using a Linux terminal.
- Basic Python environment familiarity, especially
uv. - Basic machine learning and reinforcement learning familiarity.
- Basic recommender-system concepts: users, items, interactions, retrieval, and re-ranking.
Prerequisites
Hardware:
- NVIDIA DGX Station with GB300 GPU.
- At least 80 GB available storage for the dataset, HLLM checkout, TinyLlama model, virtual environment, and checkpoints.
- Network access for GitHub, Hugging Face, Ollama, and Python package downloads.
Software:
- NVIDIA driver and CUDA toolkit visible on the host:
nvidia-smi
nvcc --version
git,wget, andcurl.sudoaccess if Ollama is not already installed.- Optional W&B account for experiment tracking.
Ancillary files
All user-facing assets are in assets/.
assets/setup.sh- Installs tools, syncs the Python environment, builds flash-attn, clones and patches HLLM, downloads TinyLlama/Nemotron/Amazon data, and prepares the dataset.assets/train_retriever.sh- Launches HLLM LoRA retriever training on Amazon Dresses.assets/extract_embeddings.py- Extracts HLLM item embeddings to.npyfor FAISS retrieval;--regression-evaloptionally runs held-out validation metrics.assets/train_reranker.shandassets/train_reranker_lightgbm.py- Train the LightGBM lambdarank re-ranker on cached HLLM embeddings + handcrafted features.assets/pricing_agent.pyandassets/pricing_agent.sh- Train and evaluate a PPO dynamic pricing agent against simulator baselines.assets/app.py- FastAPI recommendation UI.assets/launch_web_ui.sh- Starts Ollama if needed and launches the UI.assets/benchmark_retrieval.py- In-process throughput + latency benchmark for the retrieval engine (optionally with the LightGBM re-ranker).assets/teardown.sh- Stops playbook processes and can optionally remove downloaded assets while preserving checkpoints.assets/patches/HLLM/- HLLM patch files applied during setup.
Time & risk
- Estimated time: About 2 hours with default settings. Setup and downloads can take about 1 hour. flash-attn source build can take up to 30 minutes on first run. Retriever training is about 20 minutes on a GB300 with the default 1-epoch recipe at bs=512 (set
PLAYBOOK_EPOCHS=3-5for the production-quality recipe, which takes ~60–100 minutes). Pricing-agent training takes about 5–7 minutes. Inference, re-ranking, UI launch, and benchmarking take minutes after training. - Risk level: Medium
- Large downloads can fail or be interrupted.
- flash-attn builds from source for the target GPU and can take time.
- HLLM training is GPU-memory intensive; avoid running other large GPU jobs during training.
assets/teardown.sh --purge-downloadsremoves downloaded data/models after an explicit confirmation, but preserves checkpoints.
- Rollback: Use
bash assets/teardown.shto stop running processes. Usebash assets/teardown.sh --purge-downloads --dry-runto preview removable assets, then run without--dry-runonly if you want to reclaim disk. - Last Updated: 05/11/2026
- First standalone playbook publication.
Instructions
Step 1. Setup the development environment
Clone the playbook repository and navigate to the assets directory:
git clone https://github.com/NVIDIA/dgx-spark-playbooks
cd dgx-spark-playbooks/nvidia/station-rec-sys
From the repository root, verify the development environment is suitable:
bash assets/setup.sh --check
Expected output:
Checking pre-requisites...
[OK] GPU: NVIDIA GB300 (...)
[OK] CUDA: 13.1
[OK] Disk: ... GB available ...
[OK] git: /usr/bin/git
[OK] wget: /usr/bin/wget
[OK] curl: /usr/bin/curl
Result: ... passed, ... warnings, ... failed
[!Note:] If you are low on disk space (~80 GB needed), point
PLAYBOOK_WORKSPACEat a location with more room and re-run the check, for example:
export PLAYBOOK_WORKSPACE=/raid/recsys-playbook
mkdir -p "$PLAYBOOK_WORKSPACE"
bash assets/setup.sh --check
Run setup. This installs uv and Ollama if needed, creates the virtual environment, builds flash-attn, clones and patches HLLM, downloads the LLM backbone (TinyLlama) and dataset (Amazon Reviews 2023, dresses category), processes the data, and starts Ollama. Compiling flash attention from source can take up to an 1 hour.
bash assets/setup.sh
Expected output includes these sections:
============================================================
Step 1: System tools (uv, Ollama)
============================================================
...
============================================================
Step 4: Clone and patch HLLM
============================================================
Applying LoRA patches from .../assets/patches/HLLM ...
LoRA patches applied.
...
============================================================
Setup complete!
============================================================
Step 2. Train the HLLM retriever
The retriever is the first of the two-stage pipeline: given a user's history, it returns the top-N (default 100) most similar items from the 16k item catalog. By default we train for 1 epoch (~20 min at the bs=512 default). Train for 3-5 epochs to further refine embeddings.
The architecture uses TinyLlama-1.1B fine-tuned with LoRA and 4096 negatives in a two-tower setup with item and user towers. Training optimizes for high-quality generalization with 4096 sampled negatives per positive, enabling the model to learn against 25% of the entire catalog per gradient step.
See assets/train_retriever for the full training config. Checkpoints are written to $PLAYBOOK_WORKSPACE/checkpoints/dresses_lora_r16/.
Launch training:
bash assets/train_retriever.sh
Expected startup output:
============================================================
HLLM Retriever Training (LoRA, TinyLlama-1.1B)
============================================================
Model: TinyLlama-1.1B + LoRA r16
Dataset: Amazon Dresses (293K interactions)
GPU: 0
Checkpoints: .../checkpoints/dresses_lora_r16
Monitor GPU usage in another terminal with watch nvidia-smi and view the training curves in wandb (if logged in).
Step 3. Extract retriever embeddings
After training the retriever, we extract embeddings from the current checkpoint (~20s) and save them to a file for efficient access during live inference. In the full pipeline, the 100 best recommendation candidates will be accessed with a nearest-neighbor search from this pool of saved embeddings. Of those 100 candidates, the ranking model trained in the next step will be used to order the top 3-5 recommendations.
uv run python assets/extract_embeddings.py
Expected output:
Checkpoint: .../checkpoints/dresses_lora_r16/HLLM-0.pth
Output dir: .../data/processed
Mode: embeddings only
Items: 16,460, Users: 39,247
Loading checkpoint and computing item embeddings...
Item embeddings shape: (16461, 2048)
Saved to .../data/processed/hllm_item_embeddings.npy
Step 4. Train the re-ranker
For our ranking model, we use LightGBM with the lambdarank objective trained on ~20 handcrafted features. LightGBM is industry SOTA for sparse retail datasets, utilizing gradient-boosted decision trees to extract fine-grained signals. With the GB300 Chip delivering high parallelism across 72 CPU cores, this CPU-heavy training job completes quickly, training ~50 trees in under 30 seconds.
bash assets/train_reranker.sh
Expected output:
============================================================
HLLM Re-ranker Training (LightGBM lambdarank)
============================================================
...
--- Building feature matrix ---
Kept .../... users (positive in top-100: ...%); ... rows × 23 features
--- Training LightGBM lambdarank ---
[50] train's ndcg@10: ... valid's ndcg@10: ...
Trained ... rounds in ...s
Top 10 features by gain:
hllm_max_hist_sim gain=... splits=...
user_unique_items gain=... splits=...
is_repurchase gain=... splits=...
...
Saved model to .../models/reranker_lightgbm/reranker_lightgbm.txt
Saved metrics to .../models/reranker_lightgbm/metrics.json
Step 5. Train the dynamic pricing agent
Once the recommender knows what products to recommend, the next question is at what price? Static pricing leaves money on the table: aging stock languishes at full price, popular items underprice their demand, and budget items hit unnecessary stockouts.
To improve revenue and margin, we train a PPO reinforcement-learning agent. The agent learns daily price multipliers based on inventory state, time-on-shelf, and item popularity. Over 200 PPO iterations, the agent delivers +5.3% revenue, +17.8% margin, and a 3.6× reduction in stockouts vs static fixed pricing on the Amazon Dresses catalog.
Training Defaults: 200 PPO iterations × 16 parallel simulator envs × 14-day horizon × 1000 dresses sampled from the catalog.
Train the agent (~5–7 minutes):
bash assets/pricing_agent.sh
Expected output:
Catalog [Amazon Dresses]: 1000 items | price median $29.99 | tiers: {'luxury': 354, 'midrange': 334, 'budget': 312}
Multipliers: [0.6, 0.7, 0.8, 0.9, 1.0, 1.05, 1.1, 1.15, 1.25], horizon: 14d
Training PPO: 200 iters × 16 envs × 14 days × 1000 items on cuda
iter 10/200 | rev/ep: 3791437 | margin/ep: 1933387 | pi_loss: -0.002 | v_loss: 27.5 | H: 1.83 | ent_c: 0.048
...
iter 200/200 | rev/ep: 3970279 | margin/ep: 2142018 | pi_loss: +0.000 | v_loss: 9.0 | H: 0.43 | ent_c: 0.005
Training complete in 369.4s
Saved checkpoint → .../models/pricing_ppo/policy.pt
Saved training curve → .../data/processed/pricing_training_curve.png
Note
The simulator's price→demand response is parametric and calibrated, not learned from real price experiments, since our source dataset has no per-item price variation (e.g. A/B tests, markdown events, seasonal repricing). To deploy in a commercial setting, either (a) re-fit the demand model from your own price-variation logs, or (b) utilize historical sales data with offline RL.
Step 6. Launch the web UI
Launch the FastAPI app to interact with the full pipeline.
bash assets/launch_web_ui.sh
Expected output:
Loading data...
... items, ... interactions
Loading HLLM embeddings...
Building FAISS index...
Open the UI from a browser:
http://localhost:7860
[!Note] Run
ssh -L 7860:localhost:7860 <Station IP Address>to enable local port forwarding if connected to DGX Station over SSH.
Step 7. Benchmark the retrieval engine
Measure throughput and latency of product recommendations pipeline with up to 1M concurrent requests.
uv run python assets/benchmark_retrieval.py --with-reranker
Expected output:
================================================================================
HLLM Retrieval Engine Benchmark
================================================================================
Loaded: ... items × 2048 dims, ... user embeddings. (...s)
Search backend: torch.mm + topk on cuda:0 (NVIDIA GB300)
top_k retrieval depth: 100
Running 1 users... done (...)
Running 1,000 users... done (...)
Running 10,000 users... done (...)
Running 100,000 users... done (...)
Running 1,000,000 users... done (...)
================================================================================
Summary
================================================================================
Users | Per-user | Throughput
-------------------------------------------
1 | ...ms | ... /s
1,000 | ...ms | ... /s
10,000 | ...ms | ... /s
100,000 | ...ms | ... /s
1,000,000 | ...ms | ... /s
Step 8. Next Steps
Next steps can include:
- Swap in a custom dataset
- Train the price optimization agent on real price variation logs
- Incorporate live user feedback to continuously refine the retrieval and ranking models. For example, you could train the ranker daily on yesterday's interaction data in ~30s and fine-tune the retriever weekly from the previous checkpoint using
--resume. Continual retraining keeps recommendations responsive to catalog churn and shifting user preferences.
Step 9. Clean up
Stop running playbook processes:
bash assets/teardown.sh
To preview downloaded assets that can be removed, add --purge-downloads --dry-run and to remove all downloaded data, base models, HLLM code, and local environments while preserving checkpoints, run bash assets/teardown.sh --purge-downloads.
Troubleshooting
| Symptom | Cause | Fix |
|---|---|---|
bash assets/setup.sh --check reports low disk space |
The Amazon raw data, processed data, model, environment, and checkpoints need a large workspace | Use a larger workspace: export PLAYBOOK_WORKSPACE=/raid/recsys-playbook && mkdir -p "$PLAYBOOK_WORKSPACE", then rerun bash assets/setup.sh --workspace "$PLAYBOOK_WORKSPACE". |
| flash-attn build takes 20-30 minutes | flash-attn is built from source for the local GPU architecture | Let the first build finish. Re-runs should skip once the editable install exists. |
uv sync removes flash-attn |
Plain uv sync reconciles the environment and can remove editable packages not represented in the lockfile |
Run bash assets/setup.sh again, then use uv sync --inexact for future dependency syncs. |
torch.OutOfMemoryError: CUDA out of memory during retriever training |
Another GPU job is using memory, or a larger experimental batch/negative setting is being used | Stop unrelated GPU jobs with nvidia-smi/kill, or return to the packaged assets/train_retriever.sh settings. The script already sets PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True. |
| First training step appears stalled for 60-120 seconds | torch.compile graph capture and compile happen on the first step |
Wait for compilation to complete. Watch nvidia-smi; later steps should progress normally. |
Repeated torch._dynamo hit config.recompile_limit warnings |
Dynamic module attributes can cause excessive recompilation | Use the packaged scripts and patches. If experimenting, disable --torch_compile or revert to the packaged training settings. |
ValueError: Training loss is nan |
Unstable experimental training settings, usually from unsupported batch/negative/model changes | Re-run with the packaged TinyLlama LoRA settings. If modifying training, reduce learning rate or negatives and inspect the W&B loss curve. |
RuntimeError: basic_ios::clear: iostream error or checkpoint save errors |
Disk is full or the checkpoint write was interrupted | Check df -h "$PLAYBOOK_WORKSPACE", clear space, and rerun training. Checkpoints are under $PLAYBOOK_WORKSPACE/checkpoints/dresses_lora_r16/. |
| Training was interrupted (Ctrl-C, OOM, reboot) and you want to continue | Step checkpoints are saved every PLAYBOOK_SAVE_STEPS micro-batches (default 200) under $PLAYBOOK_WORKSPACE/checkpoints/dresses_lora_r16/ |
Re-run with bash assets/train_retriever.sh --resume. The script picks up the latest checkpoint and restores model + optimizer + RNG state. The startup banner shows Resume: on and the log prints auto_resume: Resuming from <ckpt>. Without --resume the script trains from scratch and ignores existing checkpoints. |
RuntimeError: PytorchStreamReader failed reading zip archive during auto-resume |
The latest checkpoint file is corrupted (interrupted write, partial transfer, manual edit) | Re-run without --resume (the default) to ignore the broken checkpoint and start fresh, or manually remove the corrupted file under $PLAYBOOK_WORKSPACE/checkpoints/dresses_lora_r16/ and re-run with --resume so the trainer picks up the previous valid checkpoint. |
| W&B link does not appear | W&B is not logged in or networking is unavailable | Run uv run wandb login, then rerun bash assets/train_retriever.sh. The script prints the project URL and W&B emits the run URL after initialization. |
ModuleNotFoundError: No module named 'faiss' when running Python directly |
The ambient Python interpreter is outside the playbook uv environment |
Run commands through uv run, for example uv run python assets/extract_embeddings.py. |
Embeddings not found or similar when running Step 4 |
Processed parquet files or checkpoint missing | Run setup/data prep first (bash assets/setup.sh), then training (bash assets/train_retriever.sh) to produce a checkpoint. Then re-run uv run python assets/extract_embeddings.py. |
| Ollama warning when the web UI requests an explanation | The local Ollama service is not running or nemotron-mini has not been pulled |
Run ollama serve & and ollama pull nemotron-mini. |
| Web UI exits while loading data | Required processed parquet files or HLLM embeddings are missing | Run bash assets/setup.sh, train or provide a checkpoint, then run uv run python assets/extract_embeddings.py before launching the UI. |
| Benchmark requests fail with connection errors | The FastAPI UI is not running on the benchmark URL | Start bash assets/launch_web_ui.sh first, then run bash assets/benchmark_inference.sh --url http://localhost:7860. |
FileNotFoundError: Amazon Dresses metadata not found from pricing_agent.sh |
Setup/data prep has not been run for the playbook workspace | Run bash assets/setup.sh to download and prepare the Dresses dataset, or pass --synthetic to train against a generated catalog (bash assets/pricing_agent.sh train --synthetic). |
Pricing-agent entropy (H) collapses to 0 in the first ~50 iterations |
The policy committed to a single action before learning per-category specialization (rare with the default --entropy-coef-start 0.05, but can happen with very small --n-items or non-default learning rates) |
Re-run with the default flags, or pass a different --seed. If it persists, lower --lr to 1e-4. |
Pricing-agent v_loss grows by orders of magnitude during training |
Value function is diverging — typically caused by overly large --lr for the catalog size |
Lower --lr to 1e-4. The clipped value loss should normally contain this; if not, also lower --n-items to reduce per-iteration batch noise. |
CUDA out of memory during pricing-agent training |
Too many parallel envs × items for available GPU memory | Lower --n-envs (default 16) or --n-items (default 1000). On Station this usually only triggers when other large GPU jobs are competing. |