sarman/dgx-spark-playbooks

Fork 0

mirror of https://github.com/NVIDIA/dgx-spark-playbooks.git synced 2026-06-18 20:42:20 +00:00

GitLab CI 4073d2c1de chore: Regenerate all playbooks

2026-05-26 18:25:53 +00:00

8.6 KiB

Raw Blame History

Topic Modeling

Extract insights from massive text datasets using cuML's GPU-accelerated BERTopic

Overview
Instructions

Overview

Basic idea

Topic modeling helps you discover hidden themes in large document collections—but traditional methods crawl when datasets grow to millions of records. This playbook shows how to process 40 million Amazon product reviews in minutes using GPU-accelerated BERTopic.

BERTopic combines transformer embeddings with clustering to extract human-readable topics from text. By swapping CPU-based UMAP and HDBSCAN with GPU-accelerated versions from RAPIDS cuML, you get the same results dramatically faster—no code changes required.

Drop-in GPU acceleration: Load cuml.accel and your existing UMAP/HDBSCAN code runs on GPU automatically
Scale to millions: Process datasets that would take hours on CPU in minutes on GPU
Interactive visualizations: Explore topic distributions, relationships, and document clusters

What you'll accomplish

You'll run a complete topic modeling pipeline on 40 million product reviews and generate interactive visualizations of discovered topics.

By the end, you'll be able to:

Use cuML's drop-in accelerators for UMAP and HDBSCAN
Generate sentence embeddings at scale with SentenceTransformers
Create topic visualizations including heatmaps, barcharts, and document datamaps

What to know before starting

Experience with Python and Jupyter notebooks
Basic understanding of machine learning concepts (embeddings, clustering)
Familiarity with pandas DataFrames

Prerequisites

Hardware Requirements:

NVIDIA DGX Station with GB300 GPU
Minimum 64GB GPU memory for processing 40M documents
At least 50GB available storage for dataset and embeddings

Software Requirements:

Conda (Miniconda or Anaconda): conda --version
CUDA 13.0 compatible drivers: nvidia-smi
Network access to download the Amazon Reviews dataset (~14GB compressed)

Ancillary files

All required assets are in the playbook directory nvidia/station-topic-modeling/assets (see Instructions, Step 7). Key file:

video_notebook_for_GPU_Accelerated_Machine_Learning_BERTopic_RTX6000_40M.ipynb - Complete Jupyter notebook with GPU-accelerated topic modeling pipeline (filename reflects original demo hardware; the notebook runs on GB300 and other NVIDIA GPUs)

Time & risk

Estimated time: 45 minutes (includes environment setup, dataset download, and embedding generation)
Risk level: Low
- Large dataset download (~14GB) may take time depending on network speed
- Embedding generation requires significant GPU memory
Rollback: Delete the downloaded dataset and any generated embedding files to restore state
Last Updated: 03/02/2026
- First Publication

Instructions

Step 1. (DGX Station) Hugging Face cache permissions

On DGX Station, ensure the Hugging Face cache is writable so model downloads succeed:

sudo chown -R $USER:$USER $HOME/.cache/huggingface 2>/dev/null || true
sudo chmod -R u+rwX $HOME/.cache/huggingface 2>/dev/null || true
mkdir -p $HOME/.cache/huggingface

If you see "Permission denied" when downloading models later, run the chown/chmod lines with your username (e.g. nvidia).

Step 2. Install RAPIDS cuDF and cuML

Create a new conda environment with RAPIDS libraries for GPU-accelerated data processing.

conda create -n rapids-25.10 \
  -c rapidsai -c conda-forge \
  cudf=25.10 cuml=25.10 python=3.11 'cuda-version=13.0'

This installs cuDF (GPU DataFrame library) and cuML (GPU machine learning library) that provide drop-in acceleration for pandas and scikit-learn operations.

Step 3. Activate the conda environment

conda activate rapids-25.10

Step 4. Install machine learning packages

Install UMAP, HDBSCAN, BERTopic, and supporting libraries for topic modeling. Note: datamapplot will upgrade dask/distributed — the next step pins them back.

pip install \
  transformers datasets sentence-transformers \
  umap-learn hdbscan==0.8.40 bertopic matplotlib \
  scikit-learn==1.4.2 datamapplot

Pin dask/distributed back to RAPIDS-compatible versions:

pip install "dask==2025.9.1" "distributed==2025.9.1"

These packages provide:

dask: Parallel computing library
distributed: Distributed task scheduler for dask
sentence-transformers: Generate text embeddings
umap-learn / hdbscan: Dimensionality reduction and clustering (GPU-accelerated via cuML)
bertopic: Topic modeling framework
datamapplot: Document visualization

Note

Pip may report dependency conflicts (e.g. dask/distributed downgraded, cuml/rapids-dask-dependency). BERTopic and the notebook can still run. If you need cuML and RAPIDS dask together, consider keeping the conda default dask versions and installing only the BERTopic stack via pip in a separate env; see Troubleshooting.

Step 5. Install visualization packages

Install JupyterLab and visualization libraries for interactive topic exploration.

conda install -c conda-forge \
    notebook=7.5.0 \
    jupyterlab=4.5.0 \
    ipywidgets=8.1.8 \
    jupyterlab-widgets=3.0.16 \
    bokeh=3.8.1 \
    colorcet=3.1.0 \
    datashader=0.18.2 \
    plotly=6.5.0

If conda reports PackagesNotFoundError for jupyterlab-widgets (e.g. on some platforms), install it with pip:

pip install jupyterlab-widgets

Step 6. Install compatible PyTorch

Install PyTorch with CUDA 13.0 support for GPU-accelerated embedding generation.

pip install torch==2.9.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu130

Step 7. Clone the repository and download the dataset

Clone the playbook repository and download the Amazon Electronics Reviews dataset.

git clone https://github.com/NVIDIA/dgx-spark-playbooks
cd dgx-spark-playbooks/nvidia/station-topic-modeling/assets

Download the dataset (~14GB compressed):

wget https://mcauleylab.ucsd.edu/public_datasets/data/amazon_2023/raw/review_categories/Electronics.jsonl.gz

Step 8. Pull Git LFS files (notebooks)

The notebook files are stored in Git LFS — without this step, JupyterLab will throw a NotJSONError when trying to open them.

conda install -c conda-forge git-lfs
git lfs install
git lfs pull

Step 9. Launch JupyterLab

Start JupyterLab from the assets directory:

jupyter lab

Step 10. Select the rapids-25.10 kernel

In JupyterLab, open the notebook video_notebook_for_GPU_Accelerated_Machine_Learning_BERTopic_1M.ipynb.

Select the rapids-25.10 kernel from the kernel selector in the top right corner of the notebook interface.

Step 11. Execute all cells

Run all cells in the notebook sequentially. The notebook will:

Load data with cuDF: GPU-accelerated pandas via %load_ext cudf.pandas
Preprocess text: Clean and normalize review text
Generate embeddings: Create sentence embeddings
Enable GPU acceleration: Load cuML accelerators via %load_ext cuml.accel
Run BERTopic: Cluster documents into topics using GPU-accelerated UMAP and HDBSCAN
Visualize results: Generate interactive topic visualizations

Step 12. Explore the results

After the notebook completes, you'll have:

Topic information table: Discovered topics with keywords and document counts
Topic visualization: Interactive 2D map of topic relationships
Barchart: Top keywords for each topic
Heatmap: Topic similarity matrix
Document datamap: Visual clustering of documents by topic

Step 13. Cleanup (optional)

Remove the conda environment when finished:

conda deactivate
conda env remove -n rapids-25.10

Remove the downloaded dataset:

rm Electronics.jsonl.gz

Remove generated embedding files and the cloned playbook directory if you no longer need them:

## Optional: remove Hugging Face cache (embedding cache from the notebook)
rm -rf ~/.cache/huggingface

## From the parent of dgx-spark-playbooks/, remove the cloned repo
rm -rf dgx-spark-playbooks/

Next steps

Apply this workflow to your own datasets:

Adjust data size: Modify nrows parameter when loading data to process smaller subsets
Tune clustering: Experiment with min_cluster_size and min_samples in HDBSCAN
Try different embedding models: Swap all-MiniLM-L6-v2 for domain-specific models
Export topics: Save the topic model using topic_model.save() for later analysis
Monitor GPU usage: Run nvidia-smi -l 1 to watch GPU utilization during processing

8.6 KiB Raw Blame History