Topic modeling helps you discover hidden themes in large document collections—but traditional methods crawl when datasets grow to millions of records. This playbook shows how to process **40 million Amazon product reviews in minutes** using GPU-accelerated BERTopic.
BERTopic combines transformer embeddings with clustering to extract human-readable topics from text. By swapping CPU-based UMAP and HDBSCAN with GPU-accelerated versions from **RAPIDS cuML**, you get the same results dramatically faster—no code changes required.
- **Drop-inGPU acceleration**:Load `cuml.accel` and your existing UMAP/HDBSCAN code runs on GPU automatically
- **Scaleto millions**:Process datasets that would take hours on CPU in minutes on GPU
- **Interactivevisualizations**:Explore topic distributions, relationships, and document clusters
# What you'll accomplish
You'll run a complete topic modeling pipeline on 40 million product reviews and generate interactive visualizations of discovered topics.
By the end, you'll be able to:
- Use cuML's drop-in accelerators for UMAP and HDBSCAN
- Generate sentence embeddings at scale with SentenceTransformers
- Create topic visualizations including heatmaps, barcharts, and document datamaps
# What to know before starting
- Experience with Python and Jupyter notebooks
- Basic understanding of machine learning concepts (embeddings, clustering)
- Familiarity with pandas DataFrames
# Prerequisites
**HardwareRequirements:**
- NVIDIA DGX Station with GB300 GPU
- Minimum 64GB GPU memory for processing 40M documents
- At least 50GB available storage for dataset and embeddings
**SoftwareRequirements:**
- Conda (Miniconda or Anaconda):`conda --version`
- CUDA 13.0 compatible drivers:`nvidia-smi`
- Network access to download the Amazon Reviews dataset (~14GB compressed)
# Ancillary files
All required assets are in the playbook directory `nvidia/station-topic-modeling/assets` (see [Instructions](https://build.nvidia.com/station/topic-modeling/instructions), Step 7). Key file:
- `video_notebook_for_GPU_Accelerated_Machine_Learning_BERTopic_RTX6000_40M.ipynb` - Complete Jupyter notebook with GPU-accelerated topic modeling pipeline (filename reflects original demo hardware; the notebook runs on GB300 and other NVIDIA GPUs)
This installs cuDF (GPU DataFrame library) and cuML (GPU machine learning library) that provide drop-in acceleration for pandas and scikit-learn operations.
# Step 3. Activate the conda environment
```bash
conda activate rapids-25.10
```
# Step 4. Install machine learning packages
Install UMAP, HDBSCAN, BERTopic, and supporting libraries for topic modeling.
Note:`datamapplot` will upgrade dask/distributed — the next step pins them back.
```bash
pip install \
transformers datasets sentence-transformers \
umap-learn hdbscan==0.8.40 bertopic matplotlib \
scikit-learn==1.4.2 datamapplot
```
Pin dask/distributed back to RAPIDS-compatible versions:
- **distributed**:Distributed task scheduler for dask
- **sentence-transformers**:Generate text embeddings
- **umap-learn/ hdbscan**:Dimensionality reduction and clustering (GPU-accelerated via cuML)
- **bertopic**:Topic modeling framework
- **datamapplot**:Document visualization
> [!NOTE]
> Pip may report dependency conflicts (e.g. dask/distributed downgraded, cuml/rapids-dask-dependency). BERTopic and the notebook can still run. If you need cuML and RAPIDS dask together, consider keeping the conda default dask versions and installing only the BERTopic stack via pip in a separate env; see **Troubleshooting**.
# Step 5. Install visualization packages
Install JupyterLab and visualization libraries for interactive topic exploration.
```bash
conda install -c conda-forge \
notebook=7.5.0 \
jupyterlab=4.5.0 \
ipywidgets=8.1.8 \
jupyterlab-widgets=3.0.16 \
bokeh=3.8.1 \
colorcet=3.1.0 \
datashader=0.18.2 \
plotly=6.5.0
```
If conda reports `PackagesNotFoundError` for `jupyterlab-widgets` (e.g. on some platforms), install it with pip:
```bash
pip install jupyterlab-widgets
```
# Step 6. Install compatible PyTorch
Install PyTorch with CUDA 13.0 support for GPU-accelerated embedding generation.