{
"cells": [
{
"cell_type": "markdown",
"id": "84635d55-68a2-468b-ac09-9029ebdab55f",
"metadata": {
"id": "84635d55-68a2-468b-ac09-9029ebdab55f"
},
"source": [
"# Accelerating large string data processing with cudf pandas accelerator mode (cudf.pandas)\n",
"cuDF is a Python GPU DataFrame library (built on the Apache Arrow columnar memory format) for loading, joining, aggregating, filtering, and otherwise manipulating tabular data using a DataFrame style API in the style of pandas.\n",
"\n",
"cuDF now provides a pandas accelerator mode (`cudf.pandas`), allowing you to bring accelerated computing to your pandas workflows without requiring any code change.\n",
"\n",
"This notebook demonstrates how cuDF pandas accelerator mode can help accelerate processing of datasets with large string fields (4 GB+) processing by simply adding a `%load_ext` command. We have introduced this feature as part of our Rapids 24.08 release.\n",
"\n",
"**Author:** Allison Ding, Mitesh Patel
\n",
"**Date:** October 3, 2025"
]
},
{
"cell_type": "markdown",
"id": "bb8fe7ab-c055-40e9-897d-c62c72f28a16",
"metadata": {
"id": "bb8fe7ab-c055-40e9-897d-c62c72f28a16"
},
"source": [
"# ⚠️ Verify your setup\n",
"\n",
"First, we'll verify that you are running with an NVIDIA GPU."
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "a88b8586-cfdd-4d31-9b4d-9be8508f7ba0",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "a88b8586-cfdd-4d31-9b4d-9be8508f7ba0",
"outputId": "18525b64-b34b-40e3-ed3a-1ad56ae794b5"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Fri Oct 3 23:16:52 2025 \n",
"+-----------------------------------------------------------------------------------------+\n",
"| NVIDIA-SMI 580.82.09 Driver Version: 580.82.09 CUDA Version: 13.0 |\n",
"+-----------------------------------------+------------------------+----------------------+\n",
"| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |\n",
"| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |\n",
"| | | MIG M. |\n",
"|=========================================+========================+======================|\n",
"| 0 NVIDIA GB10 Off | 0000000F:01:00.0 Off | N/A |\n",
"| N/A 44C P0 10W / N/A | Not Supported | 0% Default |\n",
"| | | N/A |\n",
"+-----------------------------------------+------------------------+----------------------+\n",
"\n",
"+-----------------------------------------------------------------------------------------+\n",
"| Processes: |\n",
"| GPU GI CI PID Type Process name GPU Memory |\n",
"| ID ID Usage |\n",
"|=========================================================================================|\n",
"| 0 N/A N/A 3405 G /usr/lib/xorg/Xorg 242MiB |\n",
"| 0 N/A N/A 3562 G /usr/bin/gnome-shell 53MiB |\n",
"| 0 N/A N/A 214921 C .../envs/rapids-25.10/bin/python 196MiB |\n",
"+-----------------------------------------------------------------------------------------+\n"
]
}
],
"source": [
"!nvidia-smi # this should display information about available GPUs"
]
},
{
"cell_type": "markdown",
"id": "5cd58071-4371-428b-8a02-9cd66e6cb91f",
"metadata": {
"id": "5cd58071-4371-428b-8a02-9cd66e6cb91f"
},
"source": [
"# Download the data"
]
},
{
"cell_type": "markdown",
"id": "9eb67713-7cf4-415a-bce7-ff4695862faa",
"metadata": {
"id": "9eb67713-7cf4-415a-bce7-ff4695862faa"
},
"source": [
"## Overview\n",
"The data we'll be working with summarizes job postings data that a developer working at a job listing firm might analyze to understand posting trends.\n",
"\n",
"We'll need to download a curated copy of this [Kaggle dataset](https://www.kaggle.com/datasets/asaniczka/1-3m-linkedin-jobs-and-skills-2024/data?select=job_summary.csv) directly from the kaggle API. \n",
"\n",
"**Data License and Terms**
\n",
"As this dataset originates from a Kaggle dataset, it's governed by that dataset's license and terms of use, which is the Open Data Commons license. Review here:https://opendatacommons.org/licenses/by/1-0/index.html. For each dataset an user elects to use, the user is responsible for checking if the dataset license is fit for the intended purpose.\n",
"\n",
"**Are there restrictions on how I can use this data? **\n",
"For each dataset an user elects to use, the user is responsible for checking if the dataset license is fit for the intended purpose.\n",
"\n",
"## Get the Data\n",
"First, [please follow these instructions from Kaggle to download and/or updating your Kaggle API token to get acces the dataset](https://www.kaggle.com/discussions/general/74235). \n",
"\n",
"Once generated, make sure to have the **kaggle.json** file in the same folder as the notebook\n",
"\n",
"Next, run this code below, which should also take 1-2 minutes:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "406838c6-267c-423e-82ab-ea13d5fa9c90",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Requirement already satisfied: kaggle in /home/nvidia/miniconda3/envs/rapids-25.10/lib/python3.12/site-packages (1.7.4.5)\n",
"Requirement already satisfied: bleach in /home/nvidia/miniconda3/envs/rapids-25.10/lib/python3.12/site-packages (from kaggle) (6.2.0)\n",
"Requirement already satisfied: certifi>=14.05.14 in /home/nvidia/miniconda3/envs/rapids-25.10/lib/python3.12/site-packages (from kaggle) (2025.8.3)\n",
"Requirement already satisfied: charset-normalizer in /home/nvidia/miniconda3/envs/rapids-25.10/lib/python3.12/site-packages (from kaggle) (3.4.3)\n",
"Requirement already satisfied: idna in /home/nvidia/miniconda3/envs/rapids-25.10/lib/python3.12/site-packages (from kaggle) (3.10)\n",
"Requirement already satisfied: protobuf in /home/nvidia/miniconda3/envs/rapids-25.10/lib/python3.12/site-packages (from kaggle) (6.32.1)\n",
"Requirement already satisfied: python-dateutil>=2.5.3 in /home/nvidia/miniconda3/envs/rapids-25.10/lib/python3.12/site-packages (from kaggle) (2.9.0.post0)\n",
"Requirement already satisfied: python-slugify in /home/nvidia/miniconda3/envs/rapids-25.10/lib/python3.12/site-packages (from kaggle) (8.0.4)\n",
"Requirement already satisfied: requests in /home/nvidia/miniconda3/envs/rapids-25.10/lib/python3.12/site-packages (from kaggle) (2.32.5)\n",
"Requirement already satisfied: setuptools>=21.0.0 in /home/nvidia/miniconda3/envs/rapids-25.10/lib/python3.12/site-packages (from kaggle) (80.9.0)\n",
"Requirement already satisfied: six>=1.10 in /home/nvidia/miniconda3/envs/rapids-25.10/lib/python3.12/site-packages (from kaggle) (1.17.0)\n",
"Requirement already satisfied: text-unidecode in /home/nvidia/miniconda3/envs/rapids-25.10/lib/python3.12/site-packages (from kaggle) (1.3)\n",
"Requirement already satisfied: tqdm in /home/nvidia/miniconda3/envs/rapids-25.10/lib/python3.12/site-packages (from kaggle) (4.67.1)\n",
"Requirement already satisfied: urllib3>=1.15.1 in /home/nvidia/miniconda3/envs/rapids-25.10/lib/python3.12/site-packages (from kaggle) (2.5.0)\n",
"Requirement already satisfied: webencodings in /home/nvidia/miniconda3/envs/rapids-25.10/lib/python3.12/site-packages (from kaggle) (0.5.1)\n"
]
}
],
"source": [
"!pip install kaggle\n",
"!mkdir -p ~/.kaggle\n",
"!cp kaggle.json ~/.kaggle/\n",
"!chmod 600 ~/.kaggle/kaggle.json"
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "3efacb3c-5f3d-4ff0-b32a-76bbb80b5f74",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "3efacb3c-5f3d-4ff0-b32a-76bbb80b5f74",
"outputId": "5fe4a878-cf57-44f9-e40e-ed413035b150"
},
"outputs": [],
"source": [
"# Download the dataset through kaggle API-\n",
"!kaggle datasets download -d asaniczka/1-3m-linkedin-jobs-and-skills-2024\n",
"#unzip the file to access contents\n",
"!unzip 1-3m-linkedin-jobs-and-skills-2024.zip"
]
},
{
"cell_type": "markdown",
"id": "2__ZMVe6LaBJ",
"metadata": {
"id": "2__ZMVe6LaBJ"
},
"source": [
"# Analysis with cuDF Pandas"
]
},
{
"cell_type": "markdown",
"id": "df47f304-2b30-4380-afd5-0613b63d103d",
"metadata": {},
"source": [
"The magic command `%load_ext cudf.pandas` enables GPU acceleration for pandas data processing in a Jupyter notebook, allowing most pandas operations to automatically execute on NVIDIA GPUs for improved performance. \n",
"\n",
"With this extension loaded before importing pandas, your code can use standard pandas syntax while gaining the benefits of GPU speedup, automatically falling back to CPU execution for operations not supported on the GPU. This provides a seamless way to accelerate existing pandas workflows with zero code changes, especially for large data analytics tasks or machine learning preprocessing."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "e5cd2520-30a6-41c1-b7c5-5abe0eb90d82",
"metadata": {},
"outputs": [],
"source": [
"%load_ext cudf.pandas"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "eadb8d77-cb45-4c7c-ae9f-77e47a4f29b3",
"metadata": {
"id": "eadb8d77-cb45-4c7c-ae9f-77e47a4f29b3"
},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np"
]
},
{
"cell_type": "markdown",
"id": "196268f2-6169-4ed7-a9e6-db9078caa6ab",
"metadata": {
"id": "196268f2-6169-4ed7-a9e6-db9078caa6ab"
},
"source": [
"We'll run a piece of code to get a feel what GPU-acceleration brings to pandas workflows."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ae3b6a16-ff72-4421-b43c-06c33f57ec12",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "ae3b6a16-ff72-4421-b43c-06c33f57ec12",
"outputId": "656acbf7-078f-42b3-832d-ad4e84e01c70"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 185 ms, sys: 2.08 s, total: 2.27 s\n",
"Wall time: 2.95 s\n",
"Dataset Size (in GB): 4.76\n"
]
}
],
"source": [
"%%time \n",
"job_summary_df = pd.read_csv(\"job_summary.csv\", dtype=('str'))\n",
"print(\"Dataset Size (in GB):\",round(job_summary_df.memory_usage(\n",
" deep=True).sum()/(1024**3),2))"
]
},
{
"cell_type": "markdown",
"id": "01c506e1-f135-4afb-8fc7-23e72c05d73c",
"metadata": {
"id": "01c506e1-f135-4afb-8fc7-23e72c05d73c"
},
"source": [
"The same dataset takes about around 1.5 minutes to load with pandas. That's around **5x speedup** with no changes to the code!"
]
},
{
"cell_type": "markdown",
"id": "d9d0a0e1-1d74-494d-bd12-b829f11eeede",
"metadata": {
"id": "d9d0a0e1-1d74-494d-bd12-b829f11eeede"
},
"source": [
"Let's load the remaining two datasets as well:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "12e4cf7e-8824-4822-9d30-46b81ba2acd7",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "12e4cf7e-8824-4822-9d30-46b81ba2acd7",
"outputId": "5ca1be17-09e3-40ab-928b-82176bf597bf"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 45.3 ms, sys: 199 ms, total: 244 ms\n",
"Wall time: 354 ms\n"
]
}
],
"source": [
"%%time\n",
"job_skills_df = pd.read_csv(\"job_skills.csv\", dtype=('str'))\n",
"job_postings_df = pd.read_csv(\"linkedin_job_postings.csv\", dtype=('str'))"
]
},
{
"cell_type": "code",
"execution_count": 38,
"id": "13c8f9da-121f-4311-8a79-274425363e5e",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 276
},
"id": "13c8f9da-121f-4311-8a79-274425363e5e",
"outputId": "a73599c1-05b2-4f56-a190-c69c017bb330"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 4.46 ms, sys: 3.1 ms, total: 7.56 ms\n",
"Wall time: 46.3 ms\n"
]
},
{
"data": {
"text/plain": [
"0 957\n",
"1 3816\n",
"2 5314\n",
"3 2774\n",
"4 2749\n",
"Name: summary_length, dtype: int32"
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%time\n",
"job_summary_df['summary_length'] = job_summary_df['job_summary'].str.len()\n",
"job_summary_df['summary_length'].head()"
]
},
{
"cell_type": "markdown",
"id": "67b68792-5c64-4ebd-9d80-cf6ff55baeef",
"metadata": {
"id": "67b68792-5c64-4ebd-9d80-cf6ff55baeef"
},
"source": [
"That was lightning fast! We went from around 10+ (with pandas) to a few milliseconds."
]
},
{
"cell_type": "code",
"execution_count": 39,
"id": "31e1cc84-debb-4da7-bc20-5c7139f786f7",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 504
},
"id": "31e1cc84-debb-4da7-bc20-5c7139f786f7",
"outputId": "2d89fc49-7e5b-41db-c25b-441d54480711"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 39.8 ms, sys: 30 ms, total: 69.8 ms\n",
"Wall time: 211 ms\n"
]
},
{
"data": {
"text/html": [
"
| \n", " | job_link | \n", "last_processed_time | \n", "got_summary | \n", "got_ner | \n", "is_being_worked | \n", "job_title | \n", "company | \n", "job_location | \n", "first_seen | \n", "search_city | \n", "search_country | \n", "search_position | \n", "job_level | \n", "job_type | \n", "job_summary | \n", "summary_length | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", "https://www.linkedin.com/jobs/view/account-exe... | \n", "2024-01-21 07:12:29.00256+00 | \n", "t | \n", "t | \n", "f | \n", "Account Executive - Dispensing (NorCal/Norther... | \n", "BD | \n", "San Diego, CA | \n", "2024-01-15 | \n", "Coronado | \n", "United States | \n", "Color Maker | \n", "Mid senior | \n", "Onsite | \n", "Responsibilities\\nJob Description Summary\\nJob... | \n", "4602 | \n", "
| 1 | \n", "https://www.linkedin.com/jobs/view/registered-... | \n", "2024-01-21 07:39:58.88137+00 | \n", "t | \n", "t | \n", "f | \n", "Registered Nurse - RN Care Manager | \n", "Trinity Health MI | \n", "Norton Shores, MI | \n", "2024-01-14 | \n", "Grand Haven | \n", "United States | \n", "Director Nursing Service | \n", "Mid senior | \n", "Onsite | \n", "Employment Type:\\nFull time\\nShift:\\nDescripti... | \n", "2950 | \n", "
| 2 | \n", "https://www.linkedin.com/jobs/view/restaurant-... | \n", "2024-01-21 07:40:00.251126+00 | \n", "t | \n", "t | \n", "f | \n", "RESTAURANT SUPERVISOR - THE FORKLIFT | \n", "Wasatch Adaptive Sports | \n", "Sandy, UT | \n", "2024-01-14 | \n", "Tooele | \n", "United States | \n", "Stand-In | \n", "Mid senior | \n", "Onsite | \n", "Job Details\\nDescription\\nWhat You'll Do\\nAs a... | \n", "4571 | \n", "
| 3 | \n", "https://www.linkedin.com/jobs/view/independent... | \n", "2024-01-21 07:40:00.308133+00 | \n", "t | \n", "t | \n", "f | \n", "Independent Real Estate Agent | \n", "Howard Hanna | Rand Realty | \n", "Englewood Cliffs, NJ | \n", "2024-01-16 | \n", "Pinehurst | \n", "United States | \n", "Real-Estate Clerk | \n", "Mid senior | \n", "Onsite | \n", "Who We Are\\nRand Realty is a family-owned brok... | \n", "3944 | \n", "
| 4 | \n", "https://www.linkedin.com/jobs/view/group-unit-... | \n", "2024-01-19 09:45:09.215838+00 | \n", "f | \n", "f | \n", "f | \n", "Group/Unit Supervisor (Systems Support Manager... | \n", "IRS, Office of Chief Counsel | \n", "Chamblee, GA | \n", "2024-01-17 | \n", "Gadsden | \n", "United States | \n", "Supervisor Travel-Information Center | \n", "Mid senior | \n", "Onsite | \n", "None | \n", "<NA> | \n", "
| \n", " | \n", " | summary_length | \n", "
|---|---|---|
| company | \n", "job_title | \n", "\n", " |
| ClickJobs.io | \n", "Adolescent Behavioral Health Therapist - Substance Use Specialty (Entry Senior Level) Psychiatry | \n", "23748.0 | \n", "
| Mt. San Antonio College | \n", "Chief, Police and Campus Safety | \n", "22998.0 | \n", "
| CareerBeacon | \n", "Airside/Groundside Project Manager [Halifax International Airport Authority] | \n", "22938.0 | \n", "
| Tacoma Community College | \n", "Anthropology Professor - Part-time | \n", "22790.0 | \n", "
| IRS, Office of Chief Counsel | \n", "Program Analyst (12-Month Roster) | \n", "22774.0 | \n", "
| ... | \n", "... | \n", "... | \n", "
| 鴻海精密工業股份有限公司 | \n", "HR Specialist - Payroll & Benefit | \n", "0.0 | \n", "
| Material Planner | \n", "0.0 | \n", "|
| RFQ Specialist | \n", "0.0 | \n", "|
| Supply Chain Program Manager | \n", "0.0 | \n", "|
| 🌟Daniel-Scott Recruitment Ltd🌟 | \n", "IT Manager | \n", "0.0 | \n", "
801276 rows × 1 columns
\n", "| \n", " | job_title | \n", "job_location | \n", "summary_length | \n", "
|---|---|---|---|
| 0 | \n", "🔥Nurse Manager, Patient Services - Operating Room | \n", "Lake George, NY | \n", "7342.0 | \n", "
| 1 | \n", "🔥Behavioral Health RN 3 12s | \n", "Glens Falls, NY | \n", "2787.0 | \n", "
| 2 | \n", "🔥 Surgical Technologist - Evenings | \n", "Lake George, NY | \n", "2920.0 | \n", "
| 3 | \n", "🔥 Physician Practice Clinical Lead RN | \n", "Saratoga Springs, NY | \n", "2945.0 | \n", "
| 4 | \n", "🔥 Physican Practice LPN - Green | \n", "Lake George, NY | \n", "2969.0 | \n", "
| ... | \n", "... | \n", "... | \n", "... | \n", "
| 1104106 | \n", "\"Attorney\" (Gov Appt/Non-Merit) Jobs | \n", "Kentucky, United States | \n", "2427.0 | \n", "
| 1104107 | \n", "\"Accountant\" | \n", "Shavano Park, TX | \n", "1497.0 | \n", "
| 1104108 | \n", "\"Accountant\" | \n", "Basking Ridge, NJ | \n", "1073.0 | \n", "
| 1104109 | \n", "\"Accountant\" | \n", "Austin, TX | \n", "1993.0 | \n", "
| 1104110 | \n", "\"A\" Softball Coach - Central Middle School | \n", "East Corinth, ME | \n", "718.0 | \n", "
1104111 rows × 3 columns
\n", "