{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Ablation Studies\n",
    "\n",
    "The gold standard in building complex machine learning models is proving that each constituent part of the model contributes something to the proposed solution. \n",
    "\n",
    "Ablation studies play a pivotal role in this process by systematically dissecting machine learning models and evaluating the impact of individual components. By selectively removing or disabling specific features, layers, or modules within the model and observing the resulting changes in performance, we can assess the significance of each component in achieving the desired outcome. \n",
    "\n",
    "Ablation studies offer invaluable insights into the inner workings of complex models, shedding light on which elements are essential for model performance and which may be redundant or less influential. This rigorous approach not only validates the effectiveness of the model architecture but also provides guidance for model refinement and optimization, ultimately advancing the field of machine learning and enhancing the reproducibility and reliability of research findings.\n",
    "\n",
    "In this section, we’ll finally discuss how to present complex machine learning models in publications and ensure the viability of each part we engineered to solve our particular problem set. \n",
    "\n",
    "(Granted on a toy problem, so there's not too much we can ablate...)\n",
    "\n",
    "First we'll build a quick model, as we did in [the Data notebook](/notebooks/0-basic-data-prep-and-model.html)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-12-13T01:42:25.558462Z",
     "iopub.status.busy": "2022-12-13T01:42:25.558462Z",
     "iopub.status.idle": "2022-12-13T01:42:25.570463Z",
     "shell.execute_reply": "2022-12-13T01:42:25.569963Z"
    }
   },
   "outputs": [],
   "source": [
    "import warnings\n",
    "warnings.filterwarnings('ignore')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "54158e1d",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-12-13T01:42:25.572964Z",
     "iopub.status.busy": "2022-12-13T01:42:25.572964Z",
     "iopub.status.idle": "2022-12-13T01:42:25.585465Z",
     "shell.execute_reply": "2022-12-13T01:42:25.585465Z"
    }
   },
   "outputs": [],
   "source": [
    "from pathlib import Path\n",
    "\n",
    "DATA_FOLDER = Path(\"..\", \"..\") / \"data\"\n",
    "DATA_FILEPATH = DATA_FOLDER / \"penguins_clean.csv\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-12-13T01:42:25.588466Z",
     "iopub.status.busy": "2022-12-13T01:42:25.587966Z",
     "iopub.status.idle": "2022-12-13T01:42:25.973559Z",
     "shell.execute_reply": "2022-12-13T01:42:25.973059Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Culmen Length (mm)</th>\n",
       "      <th>Culmen Depth (mm)</th>\n",
       "      <th>Flipper Length (mm)</th>\n",
       "      <th>Sex</th>\n",
       "      <th>Species</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>39.1</td>\n",
       "      <td>18.7</td>\n",
       "      <td>181.0</td>\n",
       "      <td>MALE</td>\n",
       "      <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>39.5</td>\n",
       "      <td>17.4</td>\n",
       "      <td>186.0</td>\n",
       "      <td>FEMALE</td>\n",
       "      <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>40.3</td>\n",
       "      <td>18.0</td>\n",
       "      <td>195.0</td>\n",
       "      <td>FEMALE</td>\n",
       "      <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>36.7</td>\n",
       "      <td>19.3</td>\n",
       "      <td>193.0</td>\n",
       "      <td>FEMALE</td>\n",
       "      <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>39.3</td>\n",
       "      <td>20.6</td>\n",
       "      <td>190.0</td>\n",
       "      <td>MALE</td>\n",
       "      <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Culmen Length (mm)  Culmen Depth (mm)  Flipper Length (mm)     Sex  \\\n",
       "0                39.1               18.7                181.0    MALE   \n",
       "1                39.5               17.4                186.0  FEMALE   \n",
       "2                40.3               18.0                195.0  FEMALE   \n",
       "3                36.7               19.3                193.0  FEMALE   \n",
       "4                39.3               20.6                190.0    MALE   \n",
       "\n",
       "                               Species  \n",
       "0  Adelie Penguin (Pygoscelis adeliae)  \n",
       "1  Adelie Penguin (Pygoscelis adeliae)  \n",
       "2  Adelie Penguin (Pygoscelis adeliae)  \n",
       "3  Adelie Penguin (Pygoscelis adeliae)  \n",
       "4  Adelie Penguin (Pygoscelis adeliae)  "
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "penguins = pd.read_csv(DATA_FILEPATH)\n",
    "penguins.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-12-13T01:42:25.976060Z",
     "iopub.status.busy": "2022-12-13T01:42:25.976060Z",
     "iopub.status.idle": "2022-12-13T01:42:26.144033Z",
     "shell.execute_reply": "2022-12-13T01:42:26.143532Z"
    }
   },
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "num_features = [\"Culmen Length (mm)\", \"Culmen Depth (mm)\", \"Flipper Length (mm)\"]\n",
    "cat_features = [\"Sex\"]\n",
    "features = num_features + cat_features\n",
    "target = [\"Species\"]\n",
    "\n",
    "X_train, X_test, y_train, y_test = train_test_split(penguins[features], penguins[target[0]], stratify=penguins[target[0]], train_size=.7, random_state=42)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-12-13T01:42:26.146533Z",
     "iopub.status.busy": "2022-12-13T01:42:26.146533Z",
     "iopub.status.idle": "2022-12-13T01:42:26.174781Z",
     "shell.execute_reply": "2022-12-13T01:42:26.174280Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<style>#sk-container-id-1 {color: black;background-color: white;}#sk-container-id-1 pre{padding: 0;}#sk-container-id-1 div.sk-toggleable {background-color: white;}#sk-container-id-1 label.sk-toggleable__label {cursor: pointer;display: block;width: 100%;margin-bottom: 0;padding: 0.3em;box-sizing: border-box;text-align: center;}#sk-container-id-1 label.sk-toggleable__label-arrow:before {content: \"▸\";float: left;margin-right: 0.25em;color: #696969;}#sk-container-id-1 label.sk-toggleable__label-arrow:hover:before {color: black;}#sk-container-id-1 div.sk-estimator:hover label.sk-toggleable__label-arrow:before {color: black;}#sk-container-id-1 div.sk-toggleable__content {max-height: 0;max-width: 0;overflow: hidden;text-align: left;background-color: #f0f8ff;}#sk-container-id-1 div.sk-toggleable__content pre {margin: 0.2em;color: black;border-radius: 0.25em;background-color: #f0f8ff;}#sk-container-id-1 input.sk-toggleable__control:checked~div.sk-toggleable__content {max-height: 200px;max-width: 100%;overflow: auto;}#sk-container-id-1 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before {content: \"▾\";}#sk-container-id-1 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 input.sk-hidden--visually {border: 0;clip: rect(1px 1px 1px 1px);clip: rect(1px, 1px, 1px, 1px);height: 1px;margin: -1px;overflow: hidden;padding: 0;position: absolute;width: 1px;}#sk-container-id-1 div.sk-estimator {font-family: monospace;background-color: #f0f8ff;border: 1px dotted black;border-radius: 0.25em;box-sizing: border-box;margin-bottom: 0.5em;}#sk-container-id-1 div.sk-estimator:hover {background-color: #d4ebff;}#sk-container-id-1 div.sk-parallel-item::after {content: \"\";width: 100%;border-bottom: 1px solid gray;flex-grow: 1;}#sk-container-id-1 div.sk-label:hover label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 div.sk-serial::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: 0;}#sk-container-id-1 div.sk-serial {display: flex;flex-direction: column;align-items: center;background-color: white;padding-right: 0.2em;padding-left: 0.2em;position: relative;}#sk-container-id-1 div.sk-item {position: relative;z-index: 1;}#sk-container-id-1 div.sk-parallel {display: flex;align-items: stretch;justify-content: center;background-color: white;position: relative;}#sk-container-id-1 div.sk-item::before, #sk-container-id-1 div.sk-parallel-item::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: -1;}#sk-container-id-1 div.sk-parallel-item {display: flex;flex-direction: column;z-index: 1;position: relative;background-color: white;}#sk-container-id-1 div.sk-parallel-item:first-child::after {align-self: flex-end;width: 50%;}#sk-container-id-1 div.sk-parallel-item:last-child::after {align-self: flex-start;width: 50%;}#sk-container-id-1 div.sk-parallel-item:only-child::after {width: 0;}#sk-container-id-1 div.sk-dashed-wrapped {border: 1px dashed gray;margin: 0 0.4em 0.5em 0.4em;box-sizing: border-box;padding-bottom: 0.4em;background-color: white;}#sk-container-id-1 div.sk-label label {font-family: monospace;font-weight: bold;display: inline-block;line-height: 1.2em;}#sk-container-id-1 div.sk-label-container {text-align: center;}#sk-container-id-1 div.sk-container {/* jupyter's `normalize.less` sets `[hidden] { display: none; }` but bootstrap.min.css set `[hidden] { display: none !important; }` so we also need the `!important` here to be able to override the default hidden behavior on the sphinx rendered scikit-learn.org. See: https://github.com/scikit-learn/scikit-learn/issues/21755 */display: inline-block !important;position: relative;}#sk-container-id-1 div.sk-text-repr-fallback {display: none;}</style><div id=\"sk-container-id-1\" class=\"sk-top-container\"><div class=\"sk-text-repr-fallback\"><pre>Pipeline(steps=[(&#x27;preprocessor&#x27;,\n",
       "                 ColumnTransformer(transformers=[(&#x27;num&#x27;, StandardScaler(),\n",
       "                                                  [&#x27;Culmen Length (mm)&#x27;,\n",
       "                                                   &#x27;Culmen Depth (mm)&#x27;,\n",
       "                                                   &#x27;Flipper Length (mm)&#x27;]),\n",
       "                                                 (&#x27;cat&#x27;,\n",
       "                                                  OneHotEncoder(handle_unknown=&#x27;ignore&#x27;),\n",
       "                                                  [&#x27;Sex&#x27;])])),\n",
       "                (&#x27;classifier&#x27;, SVC(random_state=42))])</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br />On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class=\"sk-container\" hidden><div class=\"sk-item sk-dashed-wrapped\"><div class=\"sk-label-container\"><div class=\"sk-label sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-1\" type=\"checkbox\" ><label for=\"sk-estimator-id-1\" class=\"sk-toggleable__label sk-toggleable__label-arrow\">Pipeline</label><div class=\"sk-toggleable__content\"><pre>Pipeline(steps=[(&#x27;preprocessor&#x27;,\n",
       "                 ColumnTransformer(transformers=[(&#x27;num&#x27;, StandardScaler(),\n",
       "                                                  [&#x27;Culmen Length (mm)&#x27;,\n",
       "                                                   &#x27;Culmen Depth (mm)&#x27;,\n",
       "                                                   &#x27;Flipper Length (mm)&#x27;]),\n",
       "                                                 (&#x27;cat&#x27;,\n",
       "                                                  OneHotEncoder(handle_unknown=&#x27;ignore&#x27;),\n",
       "                                                  [&#x27;Sex&#x27;])])),\n",
       "                (&#x27;classifier&#x27;, SVC(random_state=42))])</pre></div></div></div><div class=\"sk-serial\"><div class=\"sk-item sk-dashed-wrapped\"><div class=\"sk-label-container\"><div class=\"sk-label sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-2\" type=\"checkbox\" ><label for=\"sk-estimator-id-2\" class=\"sk-toggleable__label sk-toggleable__label-arrow\">preprocessor: ColumnTransformer</label><div class=\"sk-toggleable__content\"><pre>ColumnTransformer(transformers=[(&#x27;num&#x27;, StandardScaler(),\n",
       "                                 [&#x27;Culmen Length (mm)&#x27;, &#x27;Culmen Depth (mm)&#x27;,\n",
       "                                  &#x27;Flipper Length (mm)&#x27;]),\n",
       "                                (&#x27;cat&#x27;, OneHotEncoder(handle_unknown=&#x27;ignore&#x27;),\n",
       "                                 [&#x27;Sex&#x27;])])</pre></div></div></div><div class=\"sk-parallel\"><div class=\"sk-parallel-item\"><div class=\"sk-item\"><div class=\"sk-label-container\"><div class=\"sk-label sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-3\" type=\"checkbox\" ><label for=\"sk-estimator-id-3\" class=\"sk-toggleable__label sk-toggleable__label-arrow\">num</label><div class=\"sk-toggleable__content\"><pre>[&#x27;Culmen Length (mm)&#x27;, &#x27;Culmen Depth (mm)&#x27;, &#x27;Flipper Length (mm)&#x27;]</pre></div></div></div><div class=\"sk-serial\"><div class=\"sk-item\"><div class=\"sk-estimator sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-4\" type=\"checkbox\" ><label for=\"sk-estimator-id-4\" class=\"sk-toggleable__label sk-toggleable__label-arrow\">StandardScaler</label><div class=\"sk-toggleable__content\"><pre>StandardScaler()</pre></div></div></div></div></div></div><div class=\"sk-parallel-item\"><div class=\"sk-item\"><div class=\"sk-label-container\"><div class=\"sk-label sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-5\" type=\"checkbox\" ><label for=\"sk-estimator-id-5\" class=\"sk-toggleable__label sk-toggleable__label-arrow\">cat</label><div class=\"sk-toggleable__content\"><pre>[&#x27;Sex&#x27;]</pre></div></div></div><div class=\"sk-serial\"><div class=\"sk-item\"><div class=\"sk-estimator sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-6\" type=\"checkbox\" ><label for=\"sk-estimator-id-6\" class=\"sk-toggleable__label sk-toggleable__label-arrow\">OneHotEncoder</label><div class=\"sk-toggleable__content\"><pre>OneHotEncoder(handle_unknown=&#x27;ignore&#x27;)</pre></div></div></div></div></div></div></div></div><div class=\"sk-item\"><div class=\"sk-estimator sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-7\" type=\"checkbox\" ><label for=\"sk-estimator-id-7\" class=\"sk-toggleable__label sk-toggleable__label-arrow\">SVC</label><div class=\"sk-toggleable__content\"><pre>SVC(random_state=42)</pre></div></div></div></div></div></div></div>"
      ],
      "text/plain": [
       "Pipeline(steps=[('preprocessor',\n",
       "                 ColumnTransformer(transformers=[('num', StandardScaler(),\n",
       "                                                  ['Culmen Length (mm)',\n",
       "                                                   'Culmen Depth (mm)',\n",
       "                                                   'Flipper Length (mm)']),\n",
       "                                                 ('cat',\n",
       "                                                  OneHotEncoder(handle_unknown='ignore'),\n",
       "                                                  ['Sex'])])),\n",
       "                ('classifier', SVC(random_state=42))])"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import numpy as np\n",
    "from sklearn.svm import SVC\n",
    "from sklearn.compose import ColumnTransformer\n",
    "from sklearn.pipeline import Pipeline\n",
    "from sklearn.preprocessing import StandardScaler, OneHotEncoder\n",
    "from sklearn.model_selection import cross_val_score\n",
    "\n",
    "num_transformer = StandardScaler()\n",
    "cat_transformer = OneHotEncoder(handle_unknown='ignore')\n",
    "\n",
    "preprocessor = ColumnTransformer(transformers=[\n",
    "    ('num', num_transformer, num_features),\n",
    "    ('cat', cat_transformer, cat_features)\n",
    "])\n",
    "\n",
    "\n",
    "model = Pipeline(steps=[\n",
    "    ('preprocessor', preprocessor),\n",
    "    ('classifier', SVC(random_state=42)),\n",
    "])\n",
    "model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-12-13T01:42:26.177281Z",
     "iopub.status.busy": "2022-12-13T01:42:26.176780Z",
     "iopub.status.idle": "2022-12-13T01:42:26.205564Z",
     "shell.execute_reply": "2022-12-13T01:42:26.205063Z"
    },
    "lines_to_next_cell": 2
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "scores = cross_val_score(model, X_test, y_test, cv=10)\n",
    "scores"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can save those initial scores as `Full` in a scoring dataframe."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-12-13T01:42:26.208065Z",
     "iopub.status.busy": "2022-12-13T01:42:26.208065Z",
     "iopub.status.idle": "2022-12-13T01:42:26.236301Z",
     "shell.execute_reply": "2022-12-13T01:42:26.235800Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Average</th>\n",
       "      <th>Deviation</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Full</th>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     Average Deviation\n",
       "Full     1.0       0.0"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "scoring = pd.DataFrame(columns=[\"Average\", \"Deviation\"])\n",
    "scoring.loc[\"Full\", :] = [scores.mean(), scores.std()]\n",
    "scoring"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's compare this to a model that doesn't scale the numeric inputs. \n",
    "\n",
    "Here using the pipelines comes in handy, because switching off certain components just changes those into a `noop`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-12-13T01:42:26.238801Z",
     "iopub.status.busy": "2022-12-13T01:42:26.238801Z",
     "iopub.status.idle": "2022-12-13T01:42:26.267291Z",
     "shell.execute_reply": "2022-12-13T01:42:26.266791Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Average</th>\n",
       "      <th>Deviation</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Full</th>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>No Standardisation</th>\n",
       "      <td>0.435455</td>\n",
       "      <td>0.045172</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Single Column Sex</th>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                     Average Deviation\n",
       "Full                     1.0       0.0\n",
       "No Standardisation  0.435455  0.045172\n",
       "Single Column Sex        1.0       0.0"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# num_transformer = StandardScaler()\n",
    "cat_transformer = OneHotEncoder(handle_unknown='ignore')\n",
    "\n",
    "preprocessor = ColumnTransformer(transformers=[\n",
    "    # ('num', num_transformer, num_features),\n",
    "    ('cat', cat_transformer, cat_features)\n",
    "])\n",
    "\n",
    "\n",
    "model2 = Pipeline(\n",
    "    steps=[\n",
    "        (\"preprocessor\", preprocessor),\n",
    "        (\"classifier\", SVC(random_state=42)),\n",
    "    ]\n",
    ")\n",
    "\n",
    "scores = cross_val_score(model2, X_test, y_test, cv=10)\n",
    "\n",
    "scoring.loc[\"No Standardisation\",:] = [scores.mean(), scores.std()]\n",
    "scoring"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Those scores decrease significantly when we remove the standardisation step. This is because the SVM algorithm is sensitive to the scale of the features. The standardisation step is crucial to ensure that the model can learn from the data.\n",
    "\n",
    "Now we can evaluate if the model should use a singular column for `Sex`, since it is basically a binary column in our research data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-12-13T01:42:26.285110Z",
     "iopub.status.busy": "2022-12-13T01:42:26.284610Z",
     "iopub.status.idle": "2022-12-13T01:42:26.313615Z",
     "shell.execute_reply": "2022-12-13T01:42:26.313115Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Average</th>\n",
       "      <th>Deviation</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Full</th>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>No Standardisation</th>\n",
       "      <td>0.435455</td>\n",
       "      <td>0.045172</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Single Column Sex</th>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                     Average Deviation\n",
       "Full                     1.0       0.0\n",
       "No Standardisation  0.435455  0.045172\n",
       "Single Column Sex        1.0       0.0"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "\n",
    "num_transformer = StandardScaler()\n",
    "cat_transformer = OneHotEncoder(handle_unknown='ignore', drop='if_binary')\n",
    "\n",
    "preprocessor = ColumnTransformer(transformers=[\n",
    "    ('num', num_transformer, num_features),\n",
    "    ('cat', cat_transformer, cat_features)\n",
    "])\n",
    "\n",
    "\n",
    "model2 = Pipeline(steps=[\n",
    "    ('preprocessor', preprocessor),\n",
    "    ('classifier', SVC()),\n",
    "])\n",
    "\n",
    "scores = cross_val_score(model2, X_test, y_test, cv=10)\n",
    "\n",
    "scoring.loc[\"Single Column Sex\",:] = [scores.mean(), scores.std()]\n",
    "scoring"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This seems to not affect the model, so we can actually encode the catergorical information as binary in this case, instead of adding a feature column for `Male` and `Female`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Clearly this is a toy example and we knew that switching off standardisation would have this effect. In the real world, we would, however, switch off entire components of neural networks to evaluate the impact they have on our model performance. This strengthens the claims we make in our research significantly and usually leads to easier reviews. "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.10.8 ('pydata-global-2022-ml-repro')",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.7"
  },
  "vscode": {
   "interpreter": {
    "hash": "d7369b48cea8bb1af6d88d25f2646d14ea11b68d7457d74f06fbf0d68480668d"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}