mirror of
https://github.com/NVIDIA/dgx-spark-playbooks.git
synced 2026-04-22 01:53:53 +00:00
chore: Regenerate all playbooks
This commit is contained in:
parent
3455359d65
commit
509b04c407
@ -12,44 +12,53 @@
|
||||
## Overview
|
||||
|
||||
## Basic Idea
|
||||
CUDA-X Data Science (formally RAPIDS) is an open-source library collection that accelerates the data science and data processing ecosystem. Accelerate popular python tools like scikit-learn and pandas with zero code changes on DGX Spark to maximize performance at your desk. This playbook orients you with example workflows, demonstrating the acceleration of key machine learning algorithms like UMAP and HBDSCAN and core pandas operations, without changing your code.
|
||||
|
||||
In this playbook, we will demonstrate the acceleration of key machine learning algorithms like UMAP and HBDSCAN and core pandas operations, without changing your code.
|
||||
|
||||
## What to know before starting
|
||||
|
||||
|
||||
- Familiarity with pandas, scikit learn, machine learning algorithms, such as support vector machine, clustering, and dimensionality reduction algorithms
|
||||
|
||||
## Prerequisites
|
||||
- Install conda
|
||||
- Generate a Kaggle API key
|
||||
|
||||
|
||||
|
||||
## Ancillary files
|
||||
|
||||
|
||||
|
||||
## Time & risk
|
||||
|
||||
**Time estimate:**
|
||||
|
||||
**Risks:**
|
||||
|
||||
**Rollback:**
|
||||
**Duration:** 20-30 minutes setup time and 2-3 minutes to run each notebook.
|
||||
|
||||
## Instructions
|
||||
|
||||
## Step 1. Verify system requirements
|
||||
- Verify the system has CUDA 13 installed
|
||||
- Verify the python version is greater than 3.10
|
||||
- Install conda using [these instructions](https://docs.anaconda.com/miniconda/install/)
|
||||
- Create Kaggle API key using [these instructions](https://www.kaggle.com/discussions/general/74235) and place the **kaggle.json** file in the same folder as the notebook
|
||||
|
||||
## Step 2. Installing CUDA-X libraries
|
||||
- use the following command to install the CUDA-X libraries (this will create a new conda environment)
|
||||
```bash
|
||||
conda create -n rapids-test -c rapidsai-nightly -c conda-forge -c nvidia \
|
||||
rapids=25.10 python=3.12 'cuda-version=13.0' \
|
||||
jupyterlab hdbscan umap-learn
|
||||
```
|
||||
## Step 3. Activate the conda environment
|
||||
- activate the conda environment
|
||||
```bash
|
||||
conda activate rapids-test
|
||||
```
|
||||
## Step 4. Cloning the notebooks
|
||||
- clone the github repository and go the cuda-x-data-science/assets folder
|
||||
```bash
|
||||
ssh://git@******:12051/spark-playbooks/dgx-spark-playbook-assets.git
|
||||
```
|
||||
- place the **kaggle.json** created in Step 1 in the assets folder
|
||||
|
||||
## Step 2. something else
|
||||
|
||||
## Step N. Troubleshooting
|
||||
|
||||
Common issues and solutions for NeMo AutoModel setup on NVIDIA Spark devices.
|
||||
|
||||
| Symptom | Cause | Fix |
|
||||
|---------|--------|-----|
|
||||
| blah | blah | blah |
|
||||
|
||||
## Step N+1. Cleanup and rollback
|
||||
|
||||
|
||||
|
||||
## Step N+2. Next steps
|
||||
## Step 5. Run the notebooks
|
||||
- Both the notebooks are self explanatory
|
||||
- To experience the acceleration achieved using cudf.pandas, run the cudf_pandas_demo.ipynb notebook
|
||||
```bash
|
||||
jupyter notebook cudf_pandas_demo.ipynb
|
||||
```
|
||||
- To experience the acceleration achieved using cuml, run the cuml_sklearn_demo.ipynb notebook
|
||||
```bash
|
||||
jupyter notebook cuml_sklearn_demo.ipynb
|
||||
```
|
||||
|
||||
@ -12,7 +12,10 @@
|
||||
"\n",
|
||||
"cuDF now provides a <a href=\"https://rapids.ai/cudf-pandas/\">pandas accelerator mode</a> (`cudf.pandas`), allowing you to bring accelerated computing to your pandas workflows without requiring any code change.\n",
|
||||
"\n",
|
||||
"This notebook demonstrates how cuDF pandas accelerator mode can help accelerate processing of datasets with large string fields (4 GB+) processing by simply adding a `%load_ext` command. We have introduced this feature as part of our Rapids 24.08 release."
|
||||
"This notebook demonstrates how cuDF pandas accelerator mode can help accelerate processing of datasets with large string fields (4 GB+) processing by simply adding a `%load_ext` command. We have introduced this feature as part of our Rapids 24.08 release.\n",
|
||||
"\n",
|
||||
"**Author:** Allison Ding, Mitesh Patel <br>\n",
|
||||
"**Date:** October 3, 2025"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -92,20 +95,18 @@
|
||||
"## Overview\n",
|
||||
"The data we'll be working with summarizes job postings data that a developer working at a job listing firm might analyze to understand posting trends.\n",
|
||||
"\n",
|
||||
"We'll need to download a curated copy of this Kaggle dataset [https://www.kaggle.com/datasets/asaniczka/1-3m-linkedin-jobs-and-skills-2024/data?select=job_summary.csv] directly from the kaggle API. \n",
|
||||
"We'll need to download a curated copy of this [Kaggle dataset](https://www.kaggle.com/datasets/asaniczka/1-3m-linkedin-jobs-and-skills-2024/data?select=job_summary.csv) directly from the kaggle API. \n",
|
||||
"\n",
|
||||
"**Data License and Terms** <br>\n",
|
||||
"As this dataset originates from a Kaggle dataset, it's governed by that dataset's license and terms of use, which is the Open Data Commons license. Review here: https://opendatacommons.org/licenses/by/1-0/index.html . For each dataset an user elects to use, the user is responsible for checking if the dataset license is fit for the intended purpose.\n",
|
||||
"As this dataset originates from a Kaggle dataset, it's governed by that dataset's license and terms of use, which is the Open Data Commons license. Review here:https://opendatacommons.org/licenses/by/1-0/index.html. For each dataset an user elects to use, the user is responsible for checking if the dataset license is fit for the intended purpose.\n",
|
||||
"\n",
|
||||
"**Are there restrictions on how I can use this data? </br>**\n",
|
||||
"For each dataset an user elects to use, the user is responsible for checking if the dataset license is fit for the intended purpose.\n",
|
||||
"\n",
|
||||
"## Get the Data\n",
|
||||
"First, [please follow these instructions from Kaggle to download and/or updating your Kaggle API token to get acces the dataset](https://www.kaggle.com/discussions/general/74235). \n",
|
||||
"- If you're using Colab, you can skip Step #1\n",
|
||||
"- If you're working on your local system, you can skip the Step #2.\n",
|
||||
"\n",
|
||||
"This should take about 1-2 minutes.\n",
|
||||
"Once generated, make sure to have the **kaggle.json** file in the same folder as the notebook\n",
|
||||
"\n",
|
||||
"Next, run this code below, which should also take 1-2 minutes:"
|
||||
]
|
||||
@ -935,40 +936,8 @@
|
||||
"\n",
|
||||
"With cudf.pandas, you can keep using pandas as your primary dataframe library. When things start to get a little slow, just load the `cudf.pandas` extension and enjoy the incredible speedups.\n",
|
||||
"\n",
|
||||
"If you like Google Colab and want to get peak `cudf.pandas` performance to process even larger datasets, Google Colab's paid tier includes both L4 and A100 GPUs (in addition to the T4 GPU this demo notebook is using).\n",
|
||||
"\n",
|
||||
"To learn more about cudf.pandas, we encourage you to visit https://rapids.ai/cudf-pandas."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "76bfe45d-dee5-435a-86ce-e3c945692a40",
|
||||
"metadata": {
|
||||
"id": "76bfe45d-dee5-435a-86ce-e3c945692a40"
|
||||
},
|
||||
"source": [
|
||||
"# Do you have any feedback for us?"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "0d5ad763-961a-453b-ab65-f95c4b8c78df",
|
||||
"metadata": {
|
||||
"id": "0d5ad763-961a-453b-ab65-f95c4b8c78df"
|
||||
},
|
||||
"source": [
|
||||
"Fill this quick survey <a href=\"https://www.surveymonkey.com/r/TX3QQQR\">HERE</a>\n",
|
||||
"\n",
|
||||
"Raise an issue on our github repo <a href=\"https://github.com/rapidsai/cudf/issues\">HERE</a>"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "1cbf6c1f-497c-4077-b3b9-e21514c859aa",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
|
||||
Loading…
Reference in New Issue
Block a user