{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "84635d55-68a2-468b-ac09-9029ebdab55f",
   "metadata": {
    "id": "84635d55-68a2-468b-ac09-9029ebdab55f"
   },
   "source": [
    "# Accelerating large string data processing with cudf pandas accelerator mode (cudf.pandas)\n",
    "<a href=\"https://github.com/rapidsai/cudf\">cuDF</a> is a Python GPU DataFrame library (built on the Apache Arrow columnar memory format) for loading, joining, aggregating, filtering, and otherwise manipulating tabular data using a DataFrame style API in the style of pandas.\n",
    "\n",
    "cuDF now provides a <a href=\"https://rapids.ai/cudf-pandas/\">pandas accelerator mode</a> (`cudf.pandas`), allowing you to bring accelerated computing to your pandas workflows without requiring any code change.\n",
    "\n",
    "This notebook demonstrates how cuDF pandas accelerator mode can help accelerate processing of datasets with large string fields (4 GB+) processing by simply adding a `%load_ext` command. We have introduced this feature as part of our Rapids 24.08 release.\n",
    "\n",
    "**Author:** Allison Ding, Mitesh Patel <br>\n",
    "**Date:** October 3, 2025"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bb8fe7ab-c055-40e9-897d-c62c72f28a16",
   "metadata": {
    "id": "bb8fe7ab-c055-40e9-897d-c62c72f28a16"
   },
   "source": [
    "# ⚠️ Verify your setup\n",
    "\n",
    "First, we'll verify that you are running with an NVIDIA GPU."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "a88b8586-cfdd-4d31-9b4d-9be8508f7ba0",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "a88b8586-cfdd-4d31-9b4d-9be8508f7ba0",
    "outputId": "18525b64-b34b-40e3-ed3a-1ad56ae794b5"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Fri Oct  3 23:16:52 2025       \n",
      "+-----------------------------------------------------------------------------------------+\n",
      "| NVIDIA-SMI 580.82.09              Driver Version: 580.82.09      CUDA Version: 13.0     |\n",
      "+-----------------------------------------+------------------------+----------------------+\n",
      "| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |\n",
      "| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |\n",
      "|                                         |                        |               MIG M. |\n",
      "|=========================================+========================+======================|\n",
      "|   0  NVIDIA GB10                    Off |   0000000F:01:00.0 Off |                  N/A |\n",
      "| N/A   44C    P0             10W /  N/A  | Not Supported          |      0%      Default |\n",
      "|                                         |                        |                  N/A |\n",
      "+-----------------------------------------+------------------------+----------------------+\n",
      "\n",
      "+-----------------------------------------------------------------------------------------+\n",
      "| Processes:                                                                              |\n",
      "|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |\n",
      "|        ID   ID                                                               Usage      |\n",
      "|=========================================================================================|\n",
      "|    0   N/A  N/A            3405      G   /usr/lib/xorg/Xorg                      242MiB |\n",
      "|    0   N/A  N/A            3562      G   /usr/bin/gnome-shell                     53MiB |\n",
      "|    0   N/A  N/A          214921      C   .../envs/rapids-25.10/bin/python        196MiB |\n",
      "+-----------------------------------------------------------------------------------------+\n"
     ]
    }
   ],
   "source": [
    "!nvidia-smi  # this should display information about available GPUs"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5cd58071-4371-428b-8a02-9cd66e6cb91f",
   "metadata": {
    "id": "5cd58071-4371-428b-8a02-9cd66e6cb91f"
   },
   "source": [
    "# Download the data"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9eb67713-7cf4-415a-bce7-ff4695862faa",
   "metadata": {
    "id": "9eb67713-7cf4-415a-bce7-ff4695862faa"
   },
   "source": [
    "## Overview\n",
    "The data we'll be working with summarizes job postings data that a developer working at a job listing firm might analyze to understand posting trends.\n",
    "\n",
    "We'll need to download a curated copy of this [Kaggle dataset](https://www.kaggle.com/datasets/asaniczka/1-3m-linkedin-jobs-and-skills-2024/data?select=job_summary.csv) directly from the kaggle API.  \n",
    "\n",
    "**Data License and Terms** <br>\n",
    "As this dataset originates from a Kaggle dataset, it's governed by that dataset's license and terms of use, which is the Open Data Commons license. Review here:https://opendatacommons.org/licenses/by/1-0/index.html. For each dataset an user elects to use, the user is responsible for checking if the dataset license is fit for the intended purpose.\n",
    "\n",
    "**Are there restrictions on how I can use this data? </br>**\n",
    "For each dataset an user elects to use, the user is responsible for checking if the dataset license is fit for the intended purpose.\n",
    "\n",
    "## Get the Data\n",
    "First, [please follow these instructions from Kaggle to download and/or updating your Kaggle API token to get acces the dataset](https://www.kaggle.com/discussions/general/74235).  \n",
    "\n",
    "Once generated, make sure to have the **kaggle.json** file in the same folder as the notebook\n",
    "\n",
    "Next, run this code below, which should also take 1-2 minutes:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "406838c6-267c-423e-82ab-ea13d5fa9c90",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Requirement already satisfied: kaggle in /home/nvidia/miniconda3/envs/rapids-25.10/lib/python3.12/site-packages (1.7.4.5)\n",
      "Requirement already satisfied: bleach in /home/nvidia/miniconda3/envs/rapids-25.10/lib/python3.12/site-packages (from kaggle) (6.2.0)\n",
      "Requirement already satisfied: certifi>=14.05.14 in /home/nvidia/miniconda3/envs/rapids-25.10/lib/python3.12/site-packages (from kaggle) (2025.8.3)\n",
      "Requirement already satisfied: charset-normalizer in /home/nvidia/miniconda3/envs/rapids-25.10/lib/python3.12/site-packages (from kaggle) (3.4.3)\n",
      "Requirement already satisfied: idna in /home/nvidia/miniconda3/envs/rapids-25.10/lib/python3.12/site-packages (from kaggle) (3.10)\n",
      "Requirement already satisfied: protobuf in /home/nvidia/miniconda3/envs/rapids-25.10/lib/python3.12/site-packages (from kaggle) (6.32.1)\n",
      "Requirement already satisfied: python-dateutil>=2.5.3 in /home/nvidia/miniconda3/envs/rapids-25.10/lib/python3.12/site-packages (from kaggle) (2.9.0.post0)\n",
      "Requirement already satisfied: python-slugify in /home/nvidia/miniconda3/envs/rapids-25.10/lib/python3.12/site-packages (from kaggle) (8.0.4)\n",
      "Requirement already satisfied: requests in /home/nvidia/miniconda3/envs/rapids-25.10/lib/python3.12/site-packages (from kaggle) (2.32.5)\n",
      "Requirement already satisfied: setuptools>=21.0.0 in /home/nvidia/miniconda3/envs/rapids-25.10/lib/python3.12/site-packages (from kaggle) (80.9.0)\n",
      "Requirement already satisfied: six>=1.10 in /home/nvidia/miniconda3/envs/rapids-25.10/lib/python3.12/site-packages (from kaggle) (1.17.0)\n",
      "Requirement already satisfied: text-unidecode in /home/nvidia/miniconda3/envs/rapids-25.10/lib/python3.12/site-packages (from kaggle) (1.3)\n",
      "Requirement already satisfied: tqdm in /home/nvidia/miniconda3/envs/rapids-25.10/lib/python3.12/site-packages (from kaggle) (4.67.1)\n",
      "Requirement already satisfied: urllib3>=1.15.1 in /home/nvidia/miniconda3/envs/rapids-25.10/lib/python3.12/site-packages (from kaggle) (2.5.0)\n",
      "Requirement already satisfied: webencodings in /home/nvidia/miniconda3/envs/rapids-25.10/lib/python3.12/site-packages (from kaggle) (0.5.1)\n"
     ]
    }
   ],
   "source": [
    "!pip install kaggle\n",
    "!mkdir -p ~/.kaggle\n",
    "!cp kaggle.json ~/.kaggle/\n",
    "!chmod 600 ~/.kaggle/kaggle.json"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "3efacb3c-5f3d-4ff0-b32a-76bbb80b5f74",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "3efacb3c-5f3d-4ff0-b32a-76bbb80b5f74",
    "outputId": "5fe4a878-cf57-44f9-e40e-ed413035b150"
   },
   "outputs": [],
   "source": [
    "# Download the dataset through kaggle API-\n",
    "!kaggle datasets download -d asaniczka/1-3m-linkedin-jobs-and-skills-2024\n",
    "#unzip the file to access contents\n",
    "!unzip 1-3m-linkedin-jobs-and-skills-2024.zip"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2__ZMVe6LaBJ",
   "metadata": {
    "id": "2__ZMVe6LaBJ"
   },
   "source": [
    "# Analysis with cuDF Pandas"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "df47f304-2b30-4380-afd5-0613b63d103d",
   "metadata": {},
   "source": [
    "The magic command `%load_ext cudf.pandas` enables GPU acceleration for pandas data processing in a Jupyter notebook, allowing most pandas operations to automatically execute on NVIDIA GPUs for improved performance. \n",
    "\n",
    "With this extension loaded before importing pandas, your code can use standard pandas syntax while gaining the benefits of GPU speedup, automatically falling back to CPU execution for operations not supported on the GPU. This provides a seamless way to accelerate existing pandas workflows with zero code changes, especially for large data analytics tasks or machine learning preprocessing."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "e5cd2520-30a6-41c1-b7c5-5abe0eb90d82",
   "metadata": {},
   "outputs": [],
   "source": [
    "%load_ext cudf.pandas"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "eadb8d77-cb45-4c7c-ae9f-77e47a4f29b3",
   "metadata": {
    "id": "eadb8d77-cb45-4c7c-ae9f-77e47a4f29b3"
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "196268f2-6169-4ed7-a9e6-db9078caa6ab",
   "metadata": {
    "id": "196268f2-6169-4ed7-a9e6-db9078caa6ab"
   },
   "source": [
    "We'll run a piece of code to get a feel what GPU-acceleration brings to pandas workflows."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ae3b6a16-ff72-4421-b43c-06c33f57ec12",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "ae3b6a16-ff72-4421-b43c-06c33f57ec12",
    "outputId": "656acbf7-078f-42b3-832d-ad4e84e01c70"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 185 ms, sys: 2.08 s, total: 2.27 s\n",
      "Wall time: 2.95 s\n",
      "Dataset Size (in GB): 4.76\n"
     ]
    }
   ],
   "source": [
    "%%time \n",
    "job_summary_df = pd.read_csv(\"job_summary.csv\", dtype=('str'))\n",
    "print(\"Dataset Size (in GB):\",round(job_summary_df.memory_usage(\n",
    "    deep=True).sum()/(1024**3),2))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "01c506e1-f135-4afb-8fc7-23e72c05d73c",
   "metadata": {
    "id": "01c506e1-f135-4afb-8fc7-23e72c05d73c"
   },
   "source": [
    "The same dataset takes about around 1.5 minutes to load with pandas. That's around **5x speedup** with no changes to the code!"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d9d0a0e1-1d74-494d-bd12-b829f11eeede",
   "metadata": {
    "id": "d9d0a0e1-1d74-494d-bd12-b829f11eeede"
   },
   "source": [
    "Let's load the remaining two datasets as well:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "12e4cf7e-8824-4822-9d30-46b81ba2acd7",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "12e4cf7e-8824-4822-9d30-46b81ba2acd7",
    "outputId": "5ca1be17-09e3-40ab-928b-82176bf597bf"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 45.3 ms, sys: 199 ms, total: 244 ms\n",
      "Wall time: 354 ms\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "job_skills_df = pd.read_csv(\"job_skills.csv\", dtype=('str'))\n",
    "job_postings_df = pd.read_csv(\"linkedin_job_postings.csv\", dtype=('str'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "id": "13c8f9da-121f-4311-8a79-274425363e5e",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 276
    },
    "id": "13c8f9da-121f-4311-8a79-274425363e5e",
    "outputId": "a73599c1-05b2-4f56-a190-c69c017bb330"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 4.46 ms, sys: 3.1 ms, total: 7.56 ms\n",
      "Wall time: 46.3 ms\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "0     957\n",
       "1    3816\n",
       "2    5314\n",
       "3    2774\n",
       "4    2749\n",
       "Name: summary_length, dtype: int32"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%time\n",
    "job_summary_df['summary_length'] = job_summary_df['job_summary'].str.len()\n",
    "job_summary_df['summary_length'].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "67b68792-5c64-4ebd-9d80-cf6ff55baeef",
   "metadata": {
    "id": "67b68792-5c64-4ebd-9d80-cf6ff55baeef"
   },
   "source": [
    "That was lightning fast! We went from around 10+ (with pandas) to a few milliseconds."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "id": "31e1cc84-debb-4da7-bc20-5c7139f786f7",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 504
    },
    "id": "31e1cc84-debb-4da7-bc20-5c7139f786f7",
    "outputId": "2d89fc49-7e5b-41db-c25b-441d54480711"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 39.8 ms, sys: 30 ms, total: 69.8 ms\n",
      "Wall time: 211 ms\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>job_link</th>\n",
       "      <th>last_processed_time</th>\n",
       "      <th>got_summary</th>\n",
       "      <th>got_ner</th>\n",
       "      <th>is_being_worked</th>\n",
       "      <th>job_title</th>\n",
       "      <th>company</th>\n",
       "      <th>job_location</th>\n",
       "      <th>first_seen</th>\n",
       "      <th>search_city</th>\n",
       "      <th>search_country</th>\n",
       "      <th>search_position</th>\n",
       "      <th>job_level</th>\n",
       "      <th>job_type</th>\n",
       "      <th>job_summary</th>\n",
       "      <th>summary_length</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>https://www.linkedin.com/jobs/view/account-exe...</td>\n",
       "      <td>2024-01-21 07:12:29.00256+00</td>\n",
       "      <td>t</td>\n",
       "      <td>t</td>\n",
       "      <td>f</td>\n",
       "      <td>Account Executive - Dispensing (NorCal/Norther...</td>\n",
       "      <td>BD</td>\n",
       "      <td>San Diego, CA</td>\n",
       "      <td>2024-01-15</td>\n",
       "      <td>Coronado</td>\n",
       "      <td>United States</td>\n",
       "      <td>Color Maker</td>\n",
       "      <td>Mid senior</td>\n",
       "      <td>Onsite</td>\n",
       "      <td>Responsibilities\\nJob Description Summary\\nJob...</td>\n",
       "      <td>4602</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>https://www.linkedin.com/jobs/view/registered-...</td>\n",
       "      <td>2024-01-21 07:39:58.88137+00</td>\n",
       "      <td>t</td>\n",
       "      <td>t</td>\n",
       "      <td>f</td>\n",
       "      <td>Registered Nurse - RN Care Manager</td>\n",
       "      <td>Trinity Health MI</td>\n",
       "      <td>Norton Shores, MI</td>\n",
       "      <td>2024-01-14</td>\n",
       "      <td>Grand Haven</td>\n",
       "      <td>United States</td>\n",
       "      <td>Director Nursing Service</td>\n",
       "      <td>Mid senior</td>\n",
       "      <td>Onsite</td>\n",
       "      <td>Employment Type:\\nFull time\\nShift:\\nDescripti...</td>\n",
       "      <td>2950</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>https://www.linkedin.com/jobs/view/restaurant-...</td>\n",
       "      <td>2024-01-21 07:40:00.251126+00</td>\n",
       "      <td>t</td>\n",
       "      <td>t</td>\n",
       "      <td>f</td>\n",
       "      <td>RESTAURANT SUPERVISOR - THE FORKLIFT</td>\n",
       "      <td>Wasatch Adaptive Sports</td>\n",
       "      <td>Sandy, UT</td>\n",
       "      <td>2024-01-14</td>\n",
       "      <td>Tooele</td>\n",
       "      <td>United States</td>\n",
       "      <td>Stand-In</td>\n",
       "      <td>Mid senior</td>\n",
       "      <td>Onsite</td>\n",
       "      <td>Job Details\\nDescription\\nWhat You'll Do\\nAs a...</td>\n",
       "      <td>4571</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>https://www.linkedin.com/jobs/view/independent...</td>\n",
       "      <td>2024-01-21 07:40:00.308133+00</td>\n",
       "      <td>t</td>\n",
       "      <td>t</td>\n",
       "      <td>f</td>\n",
       "      <td>Independent Real Estate Agent</td>\n",
       "      <td>Howard Hanna | Rand Realty</td>\n",
       "      <td>Englewood Cliffs, NJ</td>\n",
       "      <td>2024-01-16</td>\n",
       "      <td>Pinehurst</td>\n",
       "      <td>United States</td>\n",
       "      <td>Real-Estate Clerk</td>\n",
       "      <td>Mid senior</td>\n",
       "      <td>Onsite</td>\n",
       "      <td>Who We Are\\nRand Realty is a family-owned brok...</td>\n",
       "      <td>3944</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>https://www.linkedin.com/jobs/view/group-unit-...</td>\n",
       "      <td>2024-01-19 09:45:09.215838+00</td>\n",
       "      <td>f</td>\n",
       "      <td>f</td>\n",
       "      <td>f</td>\n",
       "      <td>Group/Unit Supervisor (Systems Support Manager...</td>\n",
       "      <td>IRS, Office of Chief Counsel</td>\n",
       "      <td>Chamblee, GA</td>\n",
       "      <td>2024-01-17</td>\n",
       "      <td>Gadsden</td>\n",
       "      <td>United States</td>\n",
       "      <td>Supervisor Travel-Information Center</td>\n",
       "      <td>Mid senior</td>\n",
       "      <td>Onsite</td>\n",
       "      <td>None</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                            job_link  \\\n",
       "0  https://www.linkedin.com/jobs/view/account-exe...   \n",
       "1  https://www.linkedin.com/jobs/view/registered-...   \n",
       "2  https://www.linkedin.com/jobs/view/restaurant-...   \n",
       "3  https://www.linkedin.com/jobs/view/independent...   \n",
       "4  https://www.linkedin.com/jobs/view/group-unit-...   \n",
       "\n",
       "             last_processed_time got_summary got_ner is_being_worked  \\\n",
       "0   2024-01-21 07:12:29.00256+00           t       t               f   \n",
       "1   2024-01-21 07:39:58.88137+00           t       t               f   \n",
       "2  2024-01-21 07:40:00.251126+00           t       t               f   \n",
       "3  2024-01-21 07:40:00.308133+00           t       t               f   \n",
       "4  2024-01-19 09:45:09.215838+00           f       f               f   \n",
       "\n",
       "                                           job_title  \\\n",
       "0  Account Executive - Dispensing (NorCal/Norther...   \n",
       "1                 Registered Nurse - RN Care Manager   \n",
       "2               RESTAURANT SUPERVISOR - THE FORKLIFT   \n",
       "3                      Independent Real Estate Agent   \n",
       "4  Group/Unit Supervisor (Systems Support Manager...   \n",
       "\n",
       "                        company          job_location  first_seen  \\\n",
       "0                            BD         San Diego, CA  2024-01-15   \n",
       "1             Trinity Health MI     Norton Shores, MI  2024-01-14   \n",
       "2       Wasatch Adaptive Sports             Sandy, UT  2024-01-14   \n",
       "3    Howard Hanna | Rand Realty  Englewood Cliffs, NJ  2024-01-16   \n",
       "4  IRS, Office of Chief Counsel          Chamblee, GA  2024-01-17   \n",
       "\n",
       "   search_city search_country                       search_position  \\\n",
       "0     Coronado  United States                           Color Maker   \n",
       "1  Grand Haven  United States              Director Nursing Service   \n",
       "2       Tooele  United States                              Stand-In   \n",
       "3    Pinehurst  United States                     Real-Estate Clerk   \n",
       "4      Gadsden  United States  Supervisor Travel-Information Center   \n",
       "\n",
       "    job_level job_type                                        job_summary  \\\n",
       "0  Mid senior   Onsite  Responsibilities\\nJob Description Summary\\nJob...   \n",
       "1  Mid senior   Onsite  Employment Type:\\nFull time\\nShift:\\nDescripti...   \n",
       "2  Mid senior   Onsite  Job Details\\nDescription\\nWhat You'll Do\\nAs a...   \n",
       "3  Mid senior   Onsite  Who We Are\\nRand Realty is a family-owned brok...   \n",
       "4  Mid senior   Onsite                                               None   \n",
       "\n",
       "  summary_length  \n",
       "0           4602  \n",
       "1           2950  \n",
       "2           4571  \n",
       "3           3944  \n",
       "4           <NA>  "
      ]
     },
     "execution_count": 39,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%time\n",
    "df_merged=pd.merge(job_postings_df, job_summary_df, how=\"left\", on=\"job_link\")\n",
    "df_merged.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "id": "0160a559-2b17-40a6-ad9d-34ce746236d0",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 490
    },
    "id": "0160a559-2b17-40a6-ad9d-34ce746236d0",
    "outputId": "e397c28b-a90d-42d2-8a9a-4c6260c45b38"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 33.2 ms, sys: 17.3 ms, total: 50.6 ms\n",
      "Wall time: 120 ms\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>summary_length</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>company</th>\n",
       "      <th>job_title</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>ClickJobs.io</th>\n",
       "      <th>Adolescent Behavioral Health Therapist - Substance Use Specialty (Entry Senior Level) Psychiatry</th>\n",
       "      <td>23748.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Mt. San Antonio College</th>\n",
       "      <th>Chief, Police and Campus Safety</th>\n",
       "      <td>22998.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>CareerBeacon</th>\n",
       "      <th>Airside/Groundside Project Manager [Halifax International Airport Authority]</th>\n",
       "      <td>22938.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Tacoma Community College</th>\n",
       "      <th>Anthropology Professor - Part-time</th>\n",
       "      <td>22790.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>IRS, Office of Chief Counsel</th>\n",
       "      <th>Program Analyst (12-Month Roster)</th>\n",
       "      <td>22774.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"4\" valign=\"top\">鴻海精密工業股份有限公司</th>\n",
       "      <th>HR Specialist - Payroll &amp; Benefit</th>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Material Planner</th>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>RFQ Specialist</th>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Supply Chain Program Manager</th>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>🌟Daniel-Scott Recruitment Ltd🌟</th>\n",
       "      <th>IT Manager</th>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>801276 rows × 1 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                                                   summary_length\n",
       "company                        job_title                                                         \n",
       "ClickJobs.io                   Adolescent Behavioral Health Therapist - Substa...         23748.0\n",
       "Mt. San Antonio College        Chief, Police and Campus Safety                            22998.0\n",
       "CareerBeacon                   Airside/Groundside Project Manager [Halifax Int...         22938.0\n",
       "Tacoma Community College       Anthropology Professor - Part-time                         22790.0\n",
       "IRS, Office of Chief Counsel   Program Analyst (12-Month Roster)                          22774.0\n",
       "...                                                                                           ...\n",
       "鴻海精密工業股份有限公司                   HR Specialist - Payroll & Benefit                              0.0\n",
       "                               Material Planner                                               0.0\n",
       "                               RFQ Specialist                                                 0.0\n",
       "                               Supply Chain Program Manager                                   0.0\n",
       "🌟Daniel-Scott Recruitment Ltd🌟 IT Manager                                                     0.0\n",
       "\n",
       "[801276 rows x 1 columns]"
      ]
     },
     "execution_count": 40,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%time\n",
    "df_merged.groupby(['company',\"job_title\"]).agg({\n",
    "    \"summary_length\":\"mean\"}).sort_values(by='summary_length', ascending = False).fillna(0)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "IME4urGYQ3qS",
   "metadata": {
    "id": "IME4urGYQ3qS"
   },
   "source": [
    "We went down from around 5 seconds to less than a second here. This is in line with our speedups on other operations!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "id": "adc00726-f151-41f4-8731-a1ce1f83eea2",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 458
    },
    "id": "adc00726-f151-41f4-8731-a1ce1f83eea2",
    "outputId": "46423696-b167-4ffe-bb3b-9de7f3e6d668"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 13.7 ms, sys: 20.3 ms, total: 34 ms\n",
      "Wall time: 156 ms\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>job_title</th>\n",
       "      <th>job_location</th>\n",
       "      <th>summary_length</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>🔥Nurse Manager, Patient Services - Operating Room</td>\n",
       "      <td>Lake George, NY</td>\n",
       "      <td>7342.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>🔥Behavioral Health RN 3 12s</td>\n",
       "      <td>Glens Falls, NY</td>\n",
       "      <td>2787.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>🔥 Surgical Technologist - Evenings</td>\n",
       "      <td>Lake George, NY</td>\n",
       "      <td>2920.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>🔥 Physician Practice Clinical Lead RN</td>\n",
       "      <td>Saratoga Springs, NY</td>\n",
       "      <td>2945.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>🔥 Physican Practice LPN - Green</td>\n",
       "      <td>Lake George, NY</td>\n",
       "      <td>2969.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1104106</th>\n",
       "      <td>\"Attorney\" (Gov Appt/Non-Merit) Jobs</td>\n",
       "      <td>Kentucky, United States</td>\n",
       "      <td>2427.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1104107</th>\n",
       "      <td>\"Accountant\"</td>\n",
       "      <td>Shavano Park, TX</td>\n",
       "      <td>1497.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1104108</th>\n",
       "      <td>\"Accountant\"</td>\n",
       "      <td>Basking Ridge, NJ</td>\n",
       "      <td>1073.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1104109</th>\n",
       "      <td>\"Accountant\"</td>\n",
       "      <td>Austin, TX</td>\n",
       "      <td>1993.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1104110</th>\n",
       "      <td>\"A\" Softball Coach - Central Middle School</td>\n",
       "      <td>East Corinth, ME</td>\n",
       "      <td>718.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>1104111 rows × 3 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                 job_title  \\\n",
       "0        🔥Nurse Manager, Patient Services - Operating Room   \n",
       "1                              🔥Behavioral Health RN 3 12s   \n",
       "2                       🔥 Surgical Technologist - Evenings   \n",
       "3                    🔥 Physician Practice Clinical Lead RN   \n",
       "4                          🔥 Physican Practice LPN - Green   \n",
       "...                                                    ...   \n",
       "1104106               \"Attorney\" (Gov Appt/Non-Merit) Jobs   \n",
       "1104107                                       \"Accountant\"   \n",
       "1104108                                       \"Accountant\"   \n",
       "1104109                                       \"Accountant\"   \n",
       "1104110         \"A\" Softball Coach - Central Middle School   \n",
       "\n",
       "                    job_location  summary_length  \n",
       "0                Lake George, NY          7342.0  \n",
       "1                Glens Falls, NY          2787.0  \n",
       "2                Lake George, NY          2920.0  \n",
       "3           Saratoga Springs, NY          2945.0  \n",
       "4                Lake George, NY          2969.0  \n",
       "...                          ...             ...  \n",
       "1104106  Kentucky, United States          2427.0  \n",
       "1104107         Shavano Park, TX          1497.0  \n",
       "1104108        Basking Ridge, NJ          1073.0  \n",
       "1104109               Austin, TX          1993.0  \n",
       "1104110         East Corinth, ME           718.0  \n",
       "\n",
       "[1104111 rows x 3 columns]"
      ]
     },
     "execution_count": 41,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%time\n",
    "# Group by company, job_title, and month, and calculate the mean of summary_length\n",
    "grouped_df = df_merged.groupby(['job_title', 'job_location']).agg({'summary_length': 'mean'})\n",
    "\n",
    "# Reset index to sort by job_title and month\n",
    "grouped_df = grouped_df.reset_index()\n",
    "\n",
    "# Sort by job_title and month\n",
    "sorted_df = grouped_df.sort_values(by=['job_title', 'job_location','summary_length'],\n",
    "                                   ascending=False).reset_index(drop=True).fillna(0)\n",
    "sorted_df"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "08c97b81-64c5-48fb-8fe0-d36789cf3deb",
   "metadata": {
    "id": "08c97b81-64c5-48fb-8fe0-d36789cf3deb"
   },
   "source": [
    "The acceleration is consistently 10x+ for complex aggregations and sorting that involve multiple columns."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9bcc719b-666a-4bc9-97d6-16f448b5c707",
   "metadata": {
    "id": "9bcc719b-666a-4bc9-97d6-16f448b5c707"
   },
   "source": [
    "# Summary\n",
    "\n",
    "With cudf.pandas, you can keep using pandas as your primary dataframe library. When things start to get a little slow, just load the `cudf.pandas` extension and enjoy the incredible speedups.\n",
    "\n",
    "To learn more about cudf.pandas, we encourage you to visit https://rapids.ai/cudf-pandas."
   ]
  }
 ],
 "metadata": {
  "accelerator": "GPU",
  "colab": {
   "gpuType": "T4",
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}