chore: Regenerate all playbooks

This commit is contained in:
GitLab CI 2025-10-16 14:13:04 +00:00
parent 99c6530528
commit 6dd7697210
4 changed files with 3333 additions and 0 deletions

View File

@ -24,6 +24,7 @@ Each playbook includes prerequisites, step-by-step instructions, troubleshooting
- [Comfy UI](nvidia/comfy-ui/)
- [Set Up Local Network Access](nvidia/connect-to-your-spark/)
- [Connect Two Sparks](nvidia/connect-two-sparks/)
- [CUDA-X Data Science](nvidia/cuda-x-data-science/)
- [DGX Dashboard](nvidia/dgx-dashboard/)
- [FLUX.1 Dreambooth LoRA Fine-tuning](nvidia/flux-finetuning/)
- [Optimized JAX](nvidia/jax/)

View File

@ -0,0 +1,82 @@
# CUDA-X Data Science
> Install and use NVIDIA cuML and NVIDIA cuDF to accelerate UMAP, HDBSCAN, pandas and more with zero code changes
## Table of Contents
- [Overview](#overview)
- [Instructions](#instructions)
---
## Overview
## Basic idea
This playbook includes two example notebooks that demonstrate the acceleration of key machine learning algorithms and core pandas operations using CUDA-X Data Science libraries:
- **NVIDIA cuDF:** Accelerates operations for data preparation and core data processing of 8GB of strings data, with no code changes.
- **NVIDIA cuML:** Accelerates popular, compute intensive machine learning algorithms in sci-kit learn (LinearSVC), UMAP, and HDBSCAN, with no code changes.
CUDA-X Data Science (formally RAPIDS) is an open-source library collection that accelerates the data science and data processing ecosystem. These libraries accelerate popular Python tools like scikit-learn and pandas with zero code changes. On DGX Spark, these libraries maximize performance at your desk with your existing code.
## What you'll accomplish
You will accelerate popular machine learning algorithms and data analytics operations GPU. You will understand how to accelerate popular Python tools, and the value of running data science workflows on your DGX Spark.
## Prerequisites
- Familiarity with pandas, scikit-learn, machine learning algorithms, such as support vector machine, clustering, and dimensionality reduction algorithms.
- Install conda
- Generate a Kaggle API key
## Time & risk
* **Duration:** 20-30 minutes setup time and 2-3 minutes to run each notebook.
* **Risk level:**
* Data download slowness or failure due to network issues
* Kaggle API generation failure requiring retries
* **Rollback:** No permanent system changes made during normal usage.
## Instructions
## Step 1. Verify system requirements
- Verify the system has CUDA 13 installed using `nvcc --version` or `nvidia-smi`
- Install conda using [these instructions](https://docs.anaconda.com/miniconda/install/)
- Create Kaggle API key using [these instructions](https://www.kaggle.com/discussions/general/74235) and place the **kaggle.json** file in the same folder as the notebook
## Step 2. Installing Data Science libraries
- Use the following command to install the CUDA-X libraries (this will create a new conda environment)
```bash
conda create -n rapids-test -c rapidsai-nightly -c conda-forge -c nvidia \
rapids=25.10 python=3.12 'cuda-version=13.0' \
jupyter hdbscan umap-learn
```
## Step 3. Activate the conda environment
- Activate the conda environment
```bash
conda activate rapids-test
```
## Step 4. Cloning the playbook repository
- Clone the github repository and go the assets folder place in cuda-x-data-science folder
```bash
git clone https://github.com/NVIDIA/dgx-spark-playbooks
```
- Place the **kaggle.json** created in Step 1 in the assets folder
## Step 5. Run the notebooks
There are two notebooks in the GitHub repository.
One runs an example of a large strings data processing workflow with pandas code on GPU.
- Run the cudf_pandas_demo.ipynb notebook and use `localhost:8888` in your browser to access the notebook
```bash
jupyter notebook cudf_pandas_demo.ipynb
```
The other goes over an example of machine learning algorithms including UMAP and HDBSCAN.
- Run the cuml_sklearn_demo.ipynb notebook and use `localhost:8888` in your browser to access the notebook
```bash
jupyter notebook cuml_sklearn_demo.ipynb
```
If you are remotely accessing your DGX-Spark then make sure to forward the necesary port to access the notebook in your local browser. Use the below instruction for port fowarding
```bash
ssh -N -L YYYY:localhost:XXXX username@remote_host
```
- `YYYY`: The local port you want to use (e.g. 8888)
- `XXXX`: The port you specified when starting Jupyter Notebook on the remote machine (e.g. 8888)
- `-N`: Prevents SSH from executing a remote command
- `-L`: Spcifies local port forwarding

View File

@ -0,0 +1,939 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "84635d55-68a2-468b-ac09-9029ebdab55f",
"metadata": {
"id": "84635d55-68a2-468b-ac09-9029ebdab55f"
},
"source": [
"# Accelerating large string data processing with cudf pandas accelerator mode (cudf.pandas)\n",
"<a href=\"https://github.com/rapidsai/cudf\">cuDF</a> is a Python GPU DataFrame library (built on the Apache Arrow columnar memory format) for loading, joining, aggregating, filtering, and otherwise manipulating tabular data using a DataFrame style API in the style of pandas.\n",
"\n",
"cuDF now provides a <a href=\"https://rapids.ai/cudf-pandas/\">pandas accelerator mode</a> (`cudf.pandas`), allowing you to bring accelerated computing to your pandas workflows without requiring any code change.\n",
"\n",
"This notebook demonstrates how cuDF pandas accelerator mode can help accelerate processing of datasets with large string fields (4 GB+) processing by simply adding a `%load_ext` command. We have introduced this feature as part of our Rapids 24.08 release.\n",
"\n",
"**Author:** Allison Ding, Mitesh Patel <br>\n",
"**Date:** October 3, 2025"
]
},
{
"cell_type": "markdown",
"id": "bb8fe7ab-c055-40e9-897d-c62c72f28a16",
"metadata": {
"id": "bb8fe7ab-c055-40e9-897d-c62c72f28a16"
},
"source": [
"# ⚠️ Verify your setup\n",
"\n",
"First, we'll verify that you are running with an NVIDIA GPU."
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "a88b8586-cfdd-4d31-9b4d-9be8508f7ba0",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "a88b8586-cfdd-4d31-9b4d-9be8508f7ba0",
"outputId": "18525b64-b34b-40e3-ed3a-1ad56ae794b5"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Fri Oct 3 23:16:52 2025 \n",
"+-----------------------------------------------------------------------------------------+\n",
"| NVIDIA-SMI 580.82.09 Driver Version: 580.82.09 CUDA Version: 13.0 |\n",
"+-----------------------------------------+------------------------+----------------------+\n",
"| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |\n",
"| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |\n",
"| | | MIG M. |\n",
"|=========================================+========================+======================|\n",
"| 0 NVIDIA GB10 Off | 0000000F:01:00.0 Off | N/A |\n",
"| N/A 44C P0 10W / N/A | Not Supported | 0% Default |\n",
"| | | N/A |\n",
"+-----------------------------------------+------------------------+----------------------+\n",
"\n",
"+-----------------------------------------------------------------------------------------+\n",
"| Processes: |\n",
"| GPU GI CI PID Type Process name GPU Memory |\n",
"| ID ID Usage |\n",
"|=========================================================================================|\n",
"| 0 N/A N/A 3405 G /usr/lib/xorg/Xorg 242MiB |\n",
"| 0 N/A N/A 3562 G /usr/bin/gnome-shell 53MiB |\n",
"| 0 N/A N/A 214921 C .../envs/rapids-25.10/bin/python 196MiB |\n",
"+-----------------------------------------------------------------------------------------+\n"
]
}
],
"source": [
"!nvidia-smi # this should display information about available GPUs"
]
},
{
"cell_type": "markdown",
"id": "5cd58071-4371-428b-8a02-9cd66e6cb91f",
"metadata": {
"id": "5cd58071-4371-428b-8a02-9cd66e6cb91f"
},
"source": [
"# Download the data"
]
},
{
"cell_type": "markdown",
"id": "9eb67713-7cf4-415a-bce7-ff4695862faa",
"metadata": {
"id": "9eb67713-7cf4-415a-bce7-ff4695862faa"
},
"source": [
"## Overview\n",
"The data we'll be working with summarizes job postings data that a developer working at a job listing firm might analyze to understand posting trends.\n",
"\n",
"We'll need to download a curated copy of this [Kaggle dataset](https://www.kaggle.com/datasets/asaniczka/1-3m-linkedin-jobs-and-skills-2024/data?select=job_summary.csv) directly from the kaggle API. \n",
"\n",
"**Data License and Terms** <br>\n",
"As this dataset originates from a Kaggle dataset, it's governed by that dataset's license and terms of use, which is the Open Data Commons license. Review here:https://opendatacommons.org/licenses/by/1-0/index.html. For each dataset an user elects to use, the user is responsible for checking if the dataset license is fit for the intended purpose.\n",
"\n",
"**Are there restrictions on how I can use this data? </br>**\n",
"For each dataset an user elects to use, the user is responsible for checking if the dataset license is fit for the intended purpose.\n",
"\n",
"## Get the Data\n",
"First, [please follow these instructions from Kaggle to download and/or updating your Kaggle API token to get acces the dataset](https://www.kaggle.com/discussions/general/74235). \n",
"\n",
"Once generated, make sure to have the **kaggle.json** file in the same folder as the notebook\n",
"\n",
"Next, run this code below, which should also take 1-2 minutes:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "406838c6-267c-423e-82ab-ea13d5fa9c90",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Requirement already satisfied: kaggle in /home/nvidia/miniconda3/envs/rapids-25.10/lib/python3.12/site-packages (1.7.4.5)\n",
"Requirement already satisfied: bleach in /home/nvidia/miniconda3/envs/rapids-25.10/lib/python3.12/site-packages (from kaggle) (6.2.0)\n",
"Requirement already satisfied: certifi>=14.05.14 in /home/nvidia/miniconda3/envs/rapids-25.10/lib/python3.12/site-packages (from kaggle) (2025.8.3)\n",
"Requirement already satisfied: charset-normalizer in /home/nvidia/miniconda3/envs/rapids-25.10/lib/python3.12/site-packages (from kaggle) (3.4.3)\n",
"Requirement already satisfied: idna in /home/nvidia/miniconda3/envs/rapids-25.10/lib/python3.12/site-packages (from kaggle) (3.10)\n",
"Requirement already satisfied: protobuf in /home/nvidia/miniconda3/envs/rapids-25.10/lib/python3.12/site-packages (from kaggle) (6.32.1)\n",
"Requirement already satisfied: python-dateutil>=2.5.3 in /home/nvidia/miniconda3/envs/rapids-25.10/lib/python3.12/site-packages (from kaggle) (2.9.0.post0)\n",
"Requirement already satisfied: python-slugify in /home/nvidia/miniconda3/envs/rapids-25.10/lib/python3.12/site-packages (from kaggle) (8.0.4)\n",
"Requirement already satisfied: requests in /home/nvidia/miniconda3/envs/rapids-25.10/lib/python3.12/site-packages (from kaggle) (2.32.5)\n",
"Requirement already satisfied: setuptools>=21.0.0 in /home/nvidia/miniconda3/envs/rapids-25.10/lib/python3.12/site-packages (from kaggle) (80.9.0)\n",
"Requirement already satisfied: six>=1.10 in /home/nvidia/miniconda3/envs/rapids-25.10/lib/python3.12/site-packages (from kaggle) (1.17.0)\n",
"Requirement already satisfied: text-unidecode in /home/nvidia/miniconda3/envs/rapids-25.10/lib/python3.12/site-packages (from kaggle) (1.3)\n",
"Requirement already satisfied: tqdm in /home/nvidia/miniconda3/envs/rapids-25.10/lib/python3.12/site-packages (from kaggle) (4.67.1)\n",
"Requirement already satisfied: urllib3>=1.15.1 in /home/nvidia/miniconda3/envs/rapids-25.10/lib/python3.12/site-packages (from kaggle) (2.5.0)\n",
"Requirement already satisfied: webencodings in /home/nvidia/miniconda3/envs/rapids-25.10/lib/python3.12/site-packages (from kaggle) (0.5.1)\n"
]
}
],
"source": [
"!pip install kaggle\n",
"!mkdir -p ~/.kaggle\n",
"!cp kaggle.json ~/.kaggle/\n",
"!chmod 600 ~/.kaggle/kaggle.json"
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "3efacb3c-5f3d-4ff0-b32a-76bbb80b5f74",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "3efacb3c-5f3d-4ff0-b32a-76bbb80b5f74",
"outputId": "5fe4a878-cf57-44f9-e40e-ed413035b150"
},
"outputs": [],
"source": [
"# Download the dataset through kaggle API-\n",
"!kaggle datasets download -d asaniczka/1-3m-linkedin-jobs-and-skills-2024\n",
"#unzip the file to access contents\n",
"!unzip 1-3m-linkedin-jobs-and-skills-2024.zip"
]
},
{
"cell_type": "markdown",
"id": "2__ZMVe6LaBJ",
"metadata": {
"id": "2__ZMVe6LaBJ"
},
"source": [
"# Analysis with cuDF Pandas"
]
},
{
"cell_type": "markdown",
"id": "df47f304-2b30-4380-afd5-0613b63d103d",
"metadata": {},
"source": [
"The magic command `%load_ext cudf.pandas` enables GPU acceleration for pandas data processing in a Jupyter notebook, allowing most pandas operations to automatically execute on NVIDIA GPUs for improved performance. \n",
"\n",
"With this extension loaded before importing pandas, your code can use standard pandas syntax while gaining the benefits of GPU speedup, automatically falling back to CPU execution for operations not supported on the GPU. This provides a seamless way to accelerate existing pandas workflows with zero code changes, especially for large data analytics tasks or machine learning preprocessing."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "e5cd2520-30a6-41c1-b7c5-5abe0eb90d82",
"metadata": {},
"outputs": [],
"source": [
"%load_ext cudf.pandas"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "eadb8d77-cb45-4c7c-ae9f-77e47a4f29b3",
"metadata": {
"id": "eadb8d77-cb45-4c7c-ae9f-77e47a4f29b3"
},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np"
]
},
{
"cell_type": "markdown",
"id": "196268f2-6169-4ed7-a9e6-db9078caa6ab",
"metadata": {
"id": "196268f2-6169-4ed7-a9e6-db9078caa6ab"
},
"source": [
"We'll run a piece of code to get a feel what GPU-acceleration brings to pandas workflows."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ae3b6a16-ff72-4421-b43c-06c33f57ec12",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "ae3b6a16-ff72-4421-b43c-06c33f57ec12",
"outputId": "656acbf7-078f-42b3-832d-ad4e84e01c70"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 185 ms, sys: 2.08 s, total: 2.27 s\n",
"Wall time: 2.95 s\n",
"Dataset Size (in GB): 4.76\n"
]
}
],
"source": [
"%%time \n",
"job_summary_df = pd.read_csv(\"job_summary.csv\", dtype=('str'))\n",
"print(\"Dataset Size (in GB):\",round(job_summary_df.memory_usage(\n",
" deep=True).sum()/(1024**3),2))"
]
},
{
"cell_type": "markdown",
"id": "01c506e1-f135-4afb-8fc7-23e72c05d73c",
"metadata": {
"id": "01c506e1-f135-4afb-8fc7-23e72c05d73c"
},
"source": [
"The same dataset takes about around 1.5 minutes to load with pandas. That's around **5x speedup** with no changes to the code!"
]
},
{
"cell_type": "markdown",
"id": "d9d0a0e1-1d74-494d-bd12-b829f11eeede",
"metadata": {
"id": "d9d0a0e1-1d74-494d-bd12-b829f11eeede"
},
"source": [
"Let's load the remaining two datasets as well:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "12e4cf7e-8824-4822-9d30-46b81ba2acd7",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "12e4cf7e-8824-4822-9d30-46b81ba2acd7",
"outputId": "5ca1be17-09e3-40ab-928b-82176bf597bf"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 45.3 ms, sys: 199 ms, total: 244 ms\n",
"Wall time: 354 ms\n"
]
}
],
"source": [
"%%time\n",
"job_skills_df = pd.read_csv(\"job_skills.csv\", dtype=('str'))\n",
"job_postings_df = pd.read_csv(\"linkedin_job_postings.csv\", dtype=('str'))"
]
},
{
"cell_type": "code",
"execution_count": 38,
"id": "13c8f9da-121f-4311-8a79-274425363e5e",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 276
},
"id": "13c8f9da-121f-4311-8a79-274425363e5e",
"outputId": "a73599c1-05b2-4f56-a190-c69c017bb330"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 4.46 ms, sys: 3.1 ms, total: 7.56 ms\n",
"Wall time: 46.3 ms\n"
]
},
{
"data": {
"text/plain": [
"0 957\n",
"1 3816\n",
"2 5314\n",
"3 2774\n",
"4 2749\n",
"Name: summary_length, dtype: int32"
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%time\n",
"job_summary_df['summary_length'] = job_summary_df['job_summary'].str.len()\n",
"job_summary_df['summary_length'].head()"
]
},
{
"cell_type": "markdown",
"id": "67b68792-5c64-4ebd-9d80-cf6ff55baeef",
"metadata": {
"id": "67b68792-5c64-4ebd-9d80-cf6ff55baeef"
},
"source": [
"That was lightning fast! We went from around 10+ (with pandas) to a few milliseconds."
]
},
{
"cell_type": "code",
"execution_count": 39,
"id": "31e1cc84-debb-4da7-bc20-5c7139f786f7",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 504
},
"id": "31e1cc84-debb-4da7-bc20-5c7139f786f7",
"outputId": "2d89fc49-7e5b-41db-c25b-441d54480711"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 39.8 ms, sys: 30 ms, total: 69.8 ms\n",
"Wall time: 211 ms\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>job_link</th>\n",
" <th>last_processed_time</th>\n",
" <th>got_summary</th>\n",
" <th>got_ner</th>\n",
" <th>is_being_worked</th>\n",
" <th>job_title</th>\n",
" <th>company</th>\n",
" <th>job_location</th>\n",
" <th>first_seen</th>\n",
" <th>search_city</th>\n",
" <th>search_country</th>\n",
" <th>search_position</th>\n",
" <th>job_level</th>\n",
" <th>job_type</th>\n",
" <th>job_summary</th>\n",
" <th>summary_length</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>https://www.linkedin.com/jobs/view/account-exe...</td>\n",
" <td>2024-01-21 07:12:29.00256+00</td>\n",
" <td>t</td>\n",
" <td>t</td>\n",
" <td>f</td>\n",
" <td>Account Executive - Dispensing (NorCal/Norther...</td>\n",
" <td>BD</td>\n",
" <td>San Diego, CA</td>\n",
" <td>2024-01-15</td>\n",
" <td>Coronado</td>\n",
" <td>United States</td>\n",
" <td>Color Maker</td>\n",
" <td>Mid senior</td>\n",
" <td>Onsite</td>\n",
" <td>Responsibilities\\nJob Description Summary\\nJob...</td>\n",
" <td>4602</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>https://www.linkedin.com/jobs/view/registered-...</td>\n",
" <td>2024-01-21 07:39:58.88137+00</td>\n",
" <td>t</td>\n",
" <td>t</td>\n",
" <td>f</td>\n",
" <td>Registered Nurse - RN Care Manager</td>\n",
" <td>Trinity Health MI</td>\n",
" <td>Norton Shores, MI</td>\n",
" <td>2024-01-14</td>\n",
" <td>Grand Haven</td>\n",
" <td>United States</td>\n",
" <td>Director Nursing Service</td>\n",
" <td>Mid senior</td>\n",
" <td>Onsite</td>\n",
" <td>Employment Type:\\nFull time\\nShift:\\nDescripti...</td>\n",
" <td>2950</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>https://www.linkedin.com/jobs/view/restaurant-...</td>\n",
" <td>2024-01-21 07:40:00.251126+00</td>\n",
" <td>t</td>\n",
" <td>t</td>\n",
" <td>f</td>\n",
" <td>RESTAURANT SUPERVISOR - THE FORKLIFT</td>\n",
" <td>Wasatch Adaptive Sports</td>\n",
" <td>Sandy, UT</td>\n",
" <td>2024-01-14</td>\n",
" <td>Tooele</td>\n",
" <td>United States</td>\n",
" <td>Stand-In</td>\n",
" <td>Mid senior</td>\n",
" <td>Onsite</td>\n",
" <td>Job Details\\nDescription\\nWhat You'll Do\\nAs a...</td>\n",
" <td>4571</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>https://www.linkedin.com/jobs/view/independent...</td>\n",
" <td>2024-01-21 07:40:00.308133+00</td>\n",
" <td>t</td>\n",
" <td>t</td>\n",
" <td>f</td>\n",
" <td>Independent Real Estate Agent</td>\n",
" <td>Howard Hanna | Rand Realty</td>\n",
" <td>Englewood Cliffs, NJ</td>\n",
" <td>2024-01-16</td>\n",
" <td>Pinehurst</td>\n",
" <td>United States</td>\n",
" <td>Real-Estate Clerk</td>\n",
" <td>Mid senior</td>\n",
" <td>Onsite</td>\n",
" <td>Who We Are\\nRand Realty is a family-owned brok...</td>\n",
" <td>3944</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>https://www.linkedin.com/jobs/view/group-unit-...</td>\n",
" <td>2024-01-19 09:45:09.215838+00</td>\n",
" <td>f</td>\n",
" <td>f</td>\n",
" <td>f</td>\n",
" <td>Group/Unit Supervisor (Systems Support Manager...</td>\n",
" <td>IRS, Office of Chief Counsel</td>\n",
" <td>Chamblee, GA</td>\n",
" <td>2024-01-17</td>\n",
" <td>Gadsden</td>\n",
" <td>United States</td>\n",
" <td>Supervisor Travel-Information Center</td>\n",
" <td>Mid senior</td>\n",
" <td>Onsite</td>\n",
" <td>None</td>\n",
" <td>&lt;NA&gt;</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" job_link \\\n",
"0 https://www.linkedin.com/jobs/view/account-exe... \n",
"1 https://www.linkedin.com/jobs/view/registered-... \n",
"2 https://www.linkedin.com/jobs/view/restaurant-... \n",
"3 https://www.linkedin.com/jobs/view/independent... \n",
"4 https://www.linkedin.com/jobs/view/group-unit-... \n",
"\n",
" last_processed_time got_summary got_ner is_being_worked \\\n",
"0 2024-01-21 07:12:29.00256+00 t t f \n",
"1 2024-01-21 07:39:58.88137+00 t t f \n",
"2 2024-01-21 07:40:00.251126+00 t t f \n",
"3 2024-01-21 07:40:00.308133+00 t t f \n",
"4 2024-01-19 09:45:09.215838+00 f f f \n",
"\n",
" job_title \\\n",
"0 Account Executive - Dispensing (NorCal/Norther... \n",
"1 Registered Nurse - RN Care Manager \n",
"2 RESTAURANT SUPERVISOR - THE FORKLIFT \n",
"3 Independent Real Estate Agent \n",
"4 Group/Unit Supervisor (Systems Support Manager... \n",
"\n",
" company job_location first_seen \\\n",
"0 BD San Diego, CA 2024-01-15 \n",
"1 Trinity Health MI Norton Shores, MI 2024-01-14 \n",
"2 Wasatch Adaptive Sports Sandy, UT 2024-01-14 \n",
"3 Howard Hanna | Rand Realty Englewood Cliffs, NJ 2024-01-16 \n",
"4 IRS, Office of Chief Counsel Chamblee, GA 2024-01-17 \n",
"\n",
" search_city search_country search_position \\\n",
"0 Coronado United States Color Maker \n",
"1 Grand Haven United States Director Nursing Service \n",
"2 Tooele United States Stand-In \n",
"3 Pinehurst United States Real-Estate Clerk \n",
"4 Gadsden United States Supervisor Travel-Information Center \n",
"\n",
" job_level job_type job_summary \\\n",
"0 Mid senior Onsite Responsibilities\\nJob Description Summary\\nJob... \n",
"1 Mid senior Onsite Employment Type:\\nFull time\\nShift:\\nDescripti... \n",
"2 Mid senior Onsite Job Details\\nDescription\\nWhat You'll Do\\nAs a... \n",
"3 Mid senior Onsite Who We Are\\nRand Realty is a family-owned brok... \n",
"4 Mid senior Onsite None \n",
"\n",
" summary_length \n",
"0 4602 \n",
"1 2950 \n",
"2 4571 \n",
"3 3944 \n",
"4 <NA> "
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%time\n",
"df_merged=pd.merge(job_postings_df, job_summary_df, how=\"left\", on=\"job_link\")\n",
"df_merged.head()"
]
},
{
"cell_type": "code",
"execution_count": 40,
"id": "0160a559-2b17-40a6-ad9d-34ce746236d0",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 490
},
"id": "0160a559-2b17-40a6-ad9d-34ce746236d0",
"outputId": "e397c28b-a90d-42d2-8a9a-4c6260c45b38"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 33.2 ms, sys: 17.3 ms, total: 50.6 ms\n",
"Wall time: 120 ms\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th></th>\n",
" <th>summary_length</th>\n",
" </tr>\n",
" <tr>\n",
" <th>company</th>\n",
" <th>job_title</th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>ClickJobs.io</th>\n",
" <th>Adolescent Behavioral Health Therapist - Substance Use Specialty (Entry Senior Level) Psychiatry</th>\n",
" <td>23748.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Mt. San Antonio College</th>\n",
" <th>Chief, Police and Campus Safety</th>\n",
" <td>22998.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>CareerBeacon</th>\n",
" <th>Airside/Groundside Project Manager [Halifax International Airport Authority]</th>\n",
" <td>22938.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Tacoma Community College</th>\n",
" <th>Anthropology Professor - Part-time</th>\n",
" <td>22790.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>IRS, Office of Chief Counsel</th>\n",
" <th>Program Analyst (12-Month Roster)</th>\n",
" <td>22774.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th rowspan=\"4\" valign=\"top\">鴻海精密工業股份有限公司</th>\n",
" <th>HR Specialist - Payroll &amp; Benefit</th>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Material Planner</th>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>RFQ Specialist</th>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Supply Chain Program Manager</th>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>🌟Daniel-Scott Recruitment Ltd🌟</th>\n",
" <th>IT Manager</th>\n",
" <td>0.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>801276 rows × 1 columns</p>\n",
"</div>"
],
"text/plain": [
" summary_length\n",
"company job_title \n",
"ClickJobs.io Adolescent Behavioral Health Therapist - Substa... 23748.0\n",
"Mt. San Antonio College Chief, Police and Campus Safety 22998.0\n",
"CareerBeacon Airside/Groundside Project Manager [Halifax Int... 22938.0\n",
"Tacoma Community College Anthropology Professor - Part-time 22790.0\n",
"IRS, Office of Chief Counsel Program Analyst (12-Month Roster) 22774.0\n",
"... ...\n",
"鴻海精密工業股份有限公司 HR Specialist - Payroll & Benefit 0.0\n",
" Material Planner 0.0\n",
" RFQ Specialist 0.0\n",
" Supply Chain Program Manager 0.0\n",
"🌟Daniel-Scott Recruitment Ltd🌟 IT Manager 0.0\n",
"\n",
"[801276 rows x 1 columns]"
]
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%time\n",
"df_merged.groupby(['company',\"job_title\"]).agg({\n",
" \"summary_length\":\"mean\"}).sort_values(by='summary_length', ascending = False).fillna(0)"
]
},
{
"cell_type": "markdown",
"id": "IME4urGYQ3qS",
"metadata": {
"id": "IME4urGYQ3qS"
},
"source": [
"We went down from around 5 seconds to less than a second here. This is in line with our speedups on other operations!"
]
},
{
"cell_type": "code",
"execution_count": 41,
"id": "adc00726-f151-41f4-8731-a1ce1f83eea2",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 458
},
"id": "adc00726-f151-41f4-8731-a1ce1f83eea2",
"outputId": "46423696-b167-4ffe-bb3b-9de7f3e6d668"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 13.7 ms, sys: 20.3 ms, total: 34 ms\n",
"Wall time: 156 ms\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>job_title</th>\n",
" <th>job_location</th>\n",
" <th>summary_length</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>🔥Nurse Manager, Patient Services - Operating Room</td>\n",
" <td>Lake George, NY</td>\n",
" <td>7342.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>🔥Behavioral Health RN 3 12s</td>\n",
" <td>Glens Falls, NY</td>\n",
" <td>2787.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>🔥 Surgical Technologist - Evenings</td>\n",
" <td>Lake George, NY</td>\n",
" <td>2920.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>🔥 Physician Practice Clinical Lead RN</td>\n",
" <td>Saratoga Springs, NY</td>\n",
" <td>2945.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>🔥 Physican Practice LPN - Green</td>\n",
" <td>Lake George, NY</td>\n",
" <td>2969.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1104106</th>\n",
" <td>\"Attorney\" (Gov Appt/Non-Merit) Jobs</td>\n",
" <td>Kentucky, United States</td>\n",
" <td>2427.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1104107</th>\n",
" <td>\"Accountant\"</td>\n",
" <td>Shavano Park, TX</td>\n",
" <td>1497.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1104108</th>\n",
" <td>\"Accountant\"</td>\n",
" <td>Basking Ridge, NJ</td>\n",
" <td>1073.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1104109</th>\n",
" <td>\"Accountant\"</td>\n",
" <td>Austin, TX</td>\n",
" <td>1993.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1104110</th>\n",
" <td>\"A\" Softball Coach - Central Middle School</td>\n",
" <td>East Corinth, ME</td>\n",
" <td>718.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>1104111 rows × 3 columns</p>\n",
"</div>"
],
"text/plain": [
" job_title \\\n",
"0 🔥Nurse Manager, Patient Services - Operating Room \n",
"1 🔥Behavioral Health RN 3 12s \n",
"2 🔥 Surgical Technologist - Evenings \n",
"3 🔥 Physician Practice Clinical Lead RN \n",
"4 🔥 Physican Practice LPN - Green \n",
"... ... \n",
"1104106 \"Attorney\" (Gov Appt/Non-Merit) Jobs \n",
"1104107 \"Accountant\" \n",
"1104108 \"Accountant\" \n",
"1104109 \"Accountant\" \n",
"1104110 \"A\" Softball Coach - Central Middle School \n",
"\n",
" job_location summary_length \n",
"0 Lake George, NY 7342.0 \n",
"1 Glens Falls, NY 2787.0 \n",
"2 Lake George, NY 2920.0 \n",
"3 Saratoga Springs, NY 2945.0 \n",
"4 Lake George, NY 2969.0 \n",
"... ... ... \n",
"1104106 Kentucky, United States 2427.0 \n",
"1104107 Shavano Park, TX 1497.0 \n",
"1104108 Basking Ridge, NJ 1073.0 \n",
"1104109 Austin, TX 1993.0 \n",
"1104110 East Corinth, ME 718.0 \n",
"\n",
"[1104111 rows x 3 columns]"
]
},
"execution_count": 41,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%time\n",
"# Group by company, job_title, and month, and calculate the mean of summary_length\n",
"grouped_df = df_merged.groupby(['job_title', 'job_location']).agg({'summary_length': 'mean'})\n",
"\n",
"# Reset index to sort by job_title and month\n",
"grouped_df = grouped_df.reset_index()\n",
"\n",
"# Sort by job_title and month\n",
"sorted_df = grouped_df.sort_values(by=['job_title', 'job_location','summary_length'],\n",
" ascending=False).reset_index(drop=True).fillna(0)\n",
"sorted_df"
]
},
{
"cell_type": "markdown",
"id": "08c97b81-64c5-48fb-8fe0-d36789cf3deb",
"metadata": {
"id": "08c97b81-64c5-48fb-8fe0-d36789cf3deb"
},
"source": [
"The acceleration is consistently 10x+ for complex aggregations and sorting that involve multiple columns."
]
},
{
"cell_type": "markdown",
"id": "9bcc719b-666a-4bc9-97d6-16f448b5c707",
"metadata": {
"id": "9bcc719b-666a-4bc9-97d6-16f448b5c707"
},
"source": [
"# Summary\n",
"\n",
"With cudf.pandas, you can keep using pandas as your primary dataframe library. When things start to get a little slow, just load the `cudf.pandas` extension and enjoy the incredible speedups.\n",
"\n",
"To learn more about cudf.pandas, we encourage you to visit https://rapids.ai/cudf-pandas."
]
}
],
"metadata": {
"accelerator": "GPU",
"colab": {
"gpuType": "T4",
"provenance": []
},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.11"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

File diff suppressed because one or more lines are too long