{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Software Testing of Machine Learning Projects\n",
    "\n",
    "Machine learning code is very hard to test. \n",
    "\n",
    "Due to the nature of the our models, we often have soft failures in the model that are difficult to test against. That basically means, they look like they're doing what they're supposed to, but secretly they're not because of some bug.\n",
    "\n",
    "Writing software tests in science, is already incredibly hard, so in this section we’ll touch on \n",
    "\n",
    "- some fairly simple tests we can implement to ensure consistency of our input data\n",
    "- avoid bad bugs in data loading procedures\n",
    "- some strategies to probe our models\n",
    "\n",
    "First we'll split the data from [the Data notebook](/notebooks/0-basic-data-prep-and-model.html) and load the model from [the Sharing notebook](https://ml.recipes/notebooks/3-model-sharing.html)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "54158e1d",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-12-13T01:42:19.516786Z",
     "iopub.status.busy": "2022-12-13T01:42:19.516286Z",
     "iopub.status.idle": "2022-12-13T01:42:19.528286Z",
     "shell.execute_reply": "2022-12-13T01:42:19.527786Z"
    }
   },
   "outputs": [],
   "source": [
    "from pathlib import Path\n",
    "\n",
    "DATA_FOLDER = Path(\"..\", \"..\") / \"data\"\n",
    "DATA_FILEPATH = DATA_FOLDER / \"penguins_clean.csv\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-12-13T01:42:19.531286Z",
     "iopub.status.busy": "2022-12-13T01:42:19.530786Z",
     "iopub.status.idle": "2022-12-13T01:42:19.946859Z",
     "shell.execute_reply": "2022-12-13T01:42:19.946359Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Culmen Length (mm)</th>\n",
       "      <th>Culmen Depth (mm)</th>\n",
       "      <th>Flipper Length (mm)</th>\n",
       "      <th>Sex</th>\n",
       "      <th>Species</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>39.1</td>\n",
       "      <td>18.7</td>\n",
       "      <td>181.0</td>\n",
       "      <td>MALE</td>\n",
       "      <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>39.5</td>\n",
       "      <td>17.4</td>\n",
       "      <td>186.0</td>\n",
       "      <td>FEMALE</td>\n",
       "      <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>40.3</td>\n",
       "      <td>18.0</td>\n",
       "      <td>195.0</td>\n",
       "      <td>FEMALE</td>\n",
       "      <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>36.7</td>\n",
       "      <td>19.3</td>\n",
       "      <td>193.0</td>\n",
       "      <td>FEMALE</td>\n",
       "      <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>39.3</td>\n",
       "      <td>20.6</td>\n",
       "      <td>190.0</td>\n",
       "      <td>MALE</td>\n",
       "      <td>Adelie Penguin (Pygoscelis adeliae)</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Culmen Length (mm)  Culmen Depth (mm)  Flipper Length (mm)     Sex  \\\n",
       "0                39.1               18.7                181.0    MALE   \n",
       "1                39.5               17.4                186.0  FEMALE   \n",
       "2                40.3               18.0                195.0  FEMALE   \n",
       "3                36.7               19.3                193.0  FEMALE   \n",
       "4                39.3               20.6                190.0    MALE   \n",
       "\n",
       "                               Species  \n",
       "0  Adelie Penguin (Pygoscelis adeliae)  \n",
       "1  Adelie Penguin (Pygoscelis adeliae)  \n",
       "2  Adelie Penguin (Pygoscelis adeliae)  \n",
       "3  Adelie Penguin (Pygoscelis adeliae)  \n",
       "4  Adelie Penguin (Pygoscelis adeliae)  "
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "penguins = pd.read_csv(DATA_FILEPATH)\n",
    "penguins.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-12-13T01:42:19.949361Z",
     "iopub.status.busy": "2022-12-13T01:42:19.948860Z",
     "iopub.status.idle": "2022-12-13T01:42:20.132754Z",
     "shell.execute_reply": "2022-12-13T01:42:20.132252Z"
    }
   },
   "outputs": [
    {
     "ename": "ModuleNotFoundError",
     "evalue": "No module named 'sklearn'",
     "output_type": "error",
     "traceback": [
      "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[1;31mModuleNotFoundError\u001b[0m                       Traceback (most recent call last)",
      "Cell \u001b[1;32mIn [3], line 1\u001b[0m\n\u001b[1;32m----> 1\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01msklearn\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01mmodel_selection\u001b[39;00m \u001b[38;5;28;01mimport\u001b[39;00m train_test_split\n\u001b[0;32m      2\u001b[0m num_features \u001b[38;5;241m=\u001b[39m [\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mCulmen Length (mm)\u001b[39m\u001b[38;5;124m\"\u001b[39m, \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mCulmen Depth (mm)\u001b[39m\u001b[38;5;124m\"\u001b[39m, \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mFlipper Length (mm)\u001b[39m\u001b[38;5;124m\"\u001b[39m]\n\u001b[0;32m      3\u001b[0m cat_features \u001b[38;5;241m=\u001b[39m [\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mSex\u001b[39m\u001b[38;5;124m\"\u001b[39m]\n",
      "\u001b[1;31mModuleNotFoundError\u001b[0m: No module named 'sklearn'"
     ]
    }
   ],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "num_features = [\"Culmen Length (mm)\", \"Culmen Depth (mm)\", \"Flipper Length (mm)\"]\n",
    "cat_features = [\"Sex\"]\n",
    "features = num_features + cat_features\n",
    "target = [\"Species\"]\n",
    "\n",
    "X_train, X_test, y_train, y_test = train_test_split(penguins[features], penguins[target], stratify=penguins[target[0]], train_size=.7, random_state=42)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-12-13T01:42:20.135753Z",
     "iopub.status.busy": "2022-12-13T01:42:20.135253Z",
     "iopub.status.idle": "2022-12-13T01:42:20.163765Z",
     "shell.execute_reply": "2022-12-13T01:42:20.163264Z"
    },
    "lines_to_next_cell": 1
   },
   "outputs": [
    {
     "ename": "ModuleNotFoundError",
     "evalue": "No module named 'sklearn'",
     "output_type": "error",
     "traceback": [
      "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[1;31mModuleNotFoundError\u001b[0m                       Traceback (most recent call last)",
      "Cell \u001b[1;32mIn [4], line 1\u001b[0m\n\u001b[1;32m----> 1\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01msklearn\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01msvm\u001b[39;00m \u001b[38;5;28;01mimport\u001b[39;00m SVC\n\u001b[0;32m      2\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01msklearn\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01mcompose\u001b[39;00m \u001b[38;5;28;01mimport\u001b[39;00m ColumnTransformer\n\u001b[0;32m      3\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01msklearn\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01mpipeline\u001b[39;00m \u001b[38;5;28;01mimport\u001b[39;00m Pipeline\n",
      "\u001b[1;31mModuleNotFoundError\u001b[0m: No module named 'sklearn'"
     ]
    }
   ],
   "source": [
    "from sklearn.svm import SVC\n",
    "from sklearn.compose import ColumnTransformer\n",
    "from sklearn.pipeline import Pipeline\n",
    "from sklearn.preprocessing import StandardScaler, OneHotEncoder\n",
    "import joblib\n",
    "from joblib import load\n",
    "\n",
    "MODEL_FOLDER = Path(\"..\", \"..\") / \"model\"\n",
    "MODEL_EXPORT_FILE = MODEL_FOLDER / \"svc.joblib\"\n",
    "\n",
    "clf = load(MODEL_EXPORT_FILE)\n",
    "clf.score(X_test, y_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Deterministic Tests\n",
    "When I work with neural networks, implementing a new layer, method, or fancy thing, I try to write a test for that layer. The `Conv2D` layer in Keras and Pytorch for example should always do the same exact thing, when they convole a kernel with an image.\n",
    "\n",
    "Consider writing a small `pytest` test that takes a simple numpy array and tests against a known output.\n",
    "\n",
    "You can check out the `keras` test suite [here](https://github.com/keras-team/keras/tree/master/keras/tests) and an example how they validate the [input and output shapes](https://github.com/keras-team/keras/blob/18248b084f932e294402f0b772b49ed162c25208/keras/testing_infra/test_utils.py#L217).\n",
    "\n",
    "Admittedly this isn't always easy to do and can go beyond the need for research scripts."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data Tests for Models\n",
    "\n",
    "An even easier test is by essentially reusing the notebook from the [Model Evaluation](https://ml.recipes/notebooks/1-model-evaluation.html) and writing a test function for it.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-12-13T01:42:20.166266Z",
     "iopub.status.busy": "2022-12-13T01:42:20.165765Z",
     "iopub.status.idle": "2022-12-13T01:42:20.179267Z",
     "shell.execute_reply": "2022-12-13T01:42:20.178768Z"
    }
   },
   "outputs": [],
   "source": [
    "def test_penguins(clf):\n",
    "    # Define data you definitely know the answer to\n",
    "    test_data = pd.DataFrame([[34.6, 21.1, 198.0, \"MALE\"],\n",
    "                              [46.1, 18.2, 178.0, \"FEMALE\"],\n",
    "                              [52.5, 15.6, 221.0, \"MALE\"]],\n",
    "             columns=[\"Culmen Length (mm)\", \"Culmen Depth (mm)\", \"Flipper Length (mm)\", \"Sex\"])\n",
    "    # Define target to the data\n",
    "    test_target = ['Adelie Penguin (Pygoscelis adeliae)',\n",
    "                   'Chinstrap penguin (Pygoscelis antarctica)',\n",
    "                   'Gentoo penguin (Pygoscelis papua)']\n",
    "    # Assert the model should get these right.\n",
    "    assert clf.score(test_data, test_target) == 1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-12-13T01:42:20.181768Z",
     "iopub.status.busy": "2022-12-13T01:42:20.181768Z",
     "iopub.status.idle": "2022-12-13T01:42:20.210252Z",
     "shell.execute_reply": "2022-12-13T01:42:20.209751Z"
    }
   },
   "outputs": [
    {
     "ename": "NameError",
     "evalue": "name 'clf' is not defined",
     "output_type": "error",
     "traceback": [
      "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[1;31mNameError\u001b[0m                                 Traceback (most recent call last)",
      "Cell \u001b[1;32mIn [6], line 1\u001b[0m\n\u001b[1;32m----> 1\u001b[0m test_penguins(\u001b[43mclf\u001b[49m)\n",
      "\u001b[1;31mNameError\u001b[0m: name 'clf' is not defined"
     ]
    }
   ],
   "source": [
    "test_penguins(clf)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This means we have some samples in the data, where we clearly know they should be part of one class and we can use these to test the model."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Automated Testing of Docstring Examples\n",
    "\n",
    "There is an even easier way to run simple tests. This can be useful when we write specific functions to pre-process our data.\n",
    "In the Model Sharing notebook, we looked into auto-generating docstrings.\n",
    "\n",
    "We can upgrade our docstring and get free software tests out of it!\n",
    "\n",
    "This is called doctest and usually useful to keep docstring examples up to date and write quick unit tests for a function.\n",
    "\n",
    "This makes future users (including yourself from the future) quite happy."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-12-13T01:42:20.213751Z",
     "iopub.status.busy": "2022-12-13T01:42:20.213252Z",
     "iopub.status.idle": "2022-12-13T01:42:20.241058Z",
     "shell.execute_reply": "2022-12-13T01:42:20.240557Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "TestResults(failed=0, attempted=1)"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "def shorten_class_name(df: pd.DataFrame) -> pd.DataFrame:\n",
    "    \"\"\"Shorten the class names of the penguins to the shortest version\n",
    "\n",
    "    Parameters\n",
    "    ----------\n",
    "    df : pd.DataFrame\n",
    "        Dataframe containing the Species column with penguins\n",
    "\n",
    "    Returns\n",
    "    -------\n",
    "    pd.DataFrame\n",
    "        Normalised dataframe with shortened names\n",
    "\n",
    "    Examples\n",
    "    --------\n",
    "    >>> shorten_class_name(pd.DataFrame([[1,2,3,\"Adelie Penguin (Pygoscelis adeliae)\"]], columns=[\"1\",\"2\",\"3\",\"Species\"]))\n",
    "       1  2  3 Species\n",
    "    0  1  2  3  Adelie\n",
    "    \"\"\"\n",
    "    df[\"Species\"] = df.Species.str.split(r\" [Pp]enguin\", n=1, expand=True)[0]\n",
    "\n",
    "    return df\n",
    "\n",
    "import doctest\n",
    "doctest.testmod()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-12-13T01:42:20.243558Z",
     "iopub.status.busy": "2022-12-13T01:42:20.243558Z",
     "iopub.status.idle": "2022-12-13T01:42:20.256561Z",
     "shell.execute_reply": "2022-12-13T01:42:20.256060Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Culmen Length (mm)</th>\n",
       "      <th>Culmen Depth (mm)</th>\n",
       "      <th>Flipper Length (mm)</th>\n",
       "      <th>Sex</th>\n",
       "      <th>Species</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>39.1</td>\n",
       "      <td>18.7</td>\n",
       "      <td>181.0</td>\n",
       "      <td>MALE</td>\n",
       "      <td>Adelie</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>39.5</td>\n",
       "      <td>17.4</td>\n",
       "      <td>186.0</td>\n",
       "      <td>FEMALE</td>\n",
       "      <td>Adelie</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>40.3</td>\n",
       "      <td>18.0</td>\n",
       "      <td>195.0</td>\n",
       "      <td>FEMALE</td>\n",
       "      <td>Adelie</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>36.7</td>\n",
       "      <td>19.3</td>\n",
       "      <td>193.0</td>\n",
       "      <td>FEMALE</td>\n",
       "      <td>Adelie</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>39.3</td>\n",
       "      <td>20.6</td>\n",
       "      <td>190.0</td>\n",
       "      <td>MALE</td>\n",
       "      <td>Adelie</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Culmen Length (mm)  Culmen Depth (mm)  Flipper Length (mm)     Sex Species\n",
       "0                39.1               18.7                181.0    MALE  Adelie\n",
       "1                39.5               17.4                186.0  FEMALE  Adelie\n",
       "2                40.3               18.0                195.0  FEMALE  Adelie\n",
       "3                36.7               19.3                193.0  FEMALE  Adelie\n",
       "4                39.3               20.6                190.0    MALE  Adelie"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "shorten_class_name(penguins).head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So these give a nice example of usage in the docstring, an expected output and a first test case that is validated by our test suite."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Input Data Validation\n",
    "You validate that the data that users are providing matches what your model is expecting.\n",
    "\n",
    "These tools are often used in production systems to determine whether APIs usage and user inputs are formatted correctly.\n",
    "\n",
    "Example tools are:\n",
    "- [Great Expectations](https://greatexpectations.io/)\n",
    "- [Pandera](https://pandera.readthedocs.io/)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-12-13T01:42:20.259561Z",
     "iopub.status.busy": "2022-12-13T01:42:20.259062Z",
     "iopub.status.idle": "2022-12-13T01:42:20.287267Z",
     "shell.execute_reply": "2022-12-13T01:42:20.286767Z"
    }
   },
   "outputs": [
    {
     "ename": "ModuleNotFoundError",
     "evalue": "No module named 'pandera'",
     "output_type": "error",
     "traceback": [
      "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[1;31mModuleNotFoundError\u001b[0m                       Traceback (most recent call last)",
      "Cell \u001b[1;32mIn [9], line 1\u001b[0m\n\u001b[1;32m----> 1\u001b[0m \u001b[38;5;28;01mimport\u001b[39;00m \u001b[38;5;21;01mpandera\u001b[39;00m \u001b[38;5;28;01mas\u001b[39;00m \u001b[38;5;21;01mpa\u001b[39;00m\n",
      "\u001b[1;31mModuleNotFoundError\u001b[0m: No module named 'pandera'"
     ]
    }
   ],
   "source": [
    "import pandera as pa"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-12-13T01:42:20.289767Z",
     "iopub.status.busy": "2022-12-13T01:42:20.289267Z",
     "iopub.status.idle": "2022-12-13T01:42:20.318259Z",
     "shell.execute_reply": "2022-12-13T01:42:20.317759Z"
    }
   },
   "outputs": [
    {
     "ename": "NameError",
     "evalue": "name 'X_train' is not defined",
     "output_type": "error",
     "traceback": [
      "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[1;31mNameError\u001b[0m                                 Traceback (most recent call last)",
      "Cell \u001b[1;32mIn [10], line 2\u001b[0m\n\u001b[0;32m      1\u001b[0m \u001b[38;5;66;03m# data to validate\u001b[39;00m\n\u001b[1;32m----> 2\u001b[0m \u001b[43mX_train\u001b[49m\u001b[38;5;241m.\u001b[39mdescribe()\n",
      "\u001b[1;31mNameError\u001b[0m: name 'X_train' is not defined"
     ]
    }
   ],
   "source": [
    "# data to validate\n",
    "X_train.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following code is supposed to fail to see what happens if the schema doesn't match!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-12-13T01:42:20.321260Z",
     "iopub.status.busy": "2022-12-13T01:42:20.320760Z",
     "iopub.status.idle": "2022-12-13T01:42:20.349063Z",
     "shell.execute_reply": "2022-12-13T01:42:20.348563Z"
    }
   },
   "outputs": [
    {
     "ename": "NameError",
     "evalue": "name 'pa' is not defined",
     "output_type": "error",
     "traceback": [
      "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[1;31mNameError\u001b[0m                                 Traceback (most recent call last)",
      "Cell \u001b[1;32mIn [11], line 2\u001b[0m\n\u001b[0;32m      1\u001b[0m \u001b[38;5;66;03m# define schema\u001b[39;00m\n\u001b[1;32m----> 2\u001b[0m schema \u001b[38;5;241m=\u001b[39m \u001b[43mpa\u001b[49m\u001b[38;5;241m.\u001b[39mDataFrameSchema({\n\u001b[0;32m      3\u001b[0m     \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mCulmen Length (mm)\u001b[39m\u001b[38;5;124m\"\u001b[39m: pa\u001b[38;5;241m.\u001b[39mColumn(\u001b[38;5;28mfloat\u001b[39m, checks\u001b[38;5;241m=\u001b[39m[pa\u001b[38;5;241m.\u001b[39mCheck\u001b[38;5;241m.\u001b[39mge(\u001b[38;5;241m30\u001b[39m),\n\u001b[0;32m      4\u001b[0m                                                    pa\u001b[38;5;241m.\u001b[39mCheck\u001b[38;5;241m.\u001b[39mle(\u001b[38;5;241m60\u001b[39m)]),\n\u001b[0;32m      5\u001b[0m     \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mCulmen Depth (mm)\u001b[39m\u001b[38;5;124m\"\u001b[39m: pa\u001b[38;5;241m.\u001b[39mColumn(\u001b[38;5;28mfloat\u001b[39m, checks\u001b[38;5;241m=\u001b[39m[pa\u001b[38;5;241m.\u001b[39mCheck\u001b[38;5;241m.\u001b[39mge(\u001b[38;5;241m13\u001b[39m),\n\u001b[0;32m      6\u001b[0m                                                   pa\u001b[38;5;241m.\u001b[39mCheck\u001b[38;5;241m.\u001b[39mle(\u001b[38;5;241m22\u001b[39m)]),\n\u001b[0;32m      7\u001b[0m     \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mFlipper Length (mm)\u001b[39m\u001b[38;5;124m\"\u001b[39m: pa\u001b[38;5;241m.\u001b[39mColumn(\u001b[38;5;28mfloat\u001b[39m, checks\u001b[38;5;241m=\u001b[39m[pa\u001b[38;5;241m.\u001b[39mCheck\u001b[38;5;241m.\u001b[39mge(\u001b[38;5;241m170\u001b[39m),\n\u001b[0;32m      8\u001b[0m                                                     pa\u001b[38;5;241m.\u001b[39mCheck\u001b[38;5;241m.\u001b[39mle(\u001b[38;5;241m235\u001b[39m)]),\n\u001b[0;32m      9\u001b[0m     \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mSex\u001b[39m\u001b[38;5;124m\"\u001b[39m: pa\u001b[38;5;241m.\u001b[39mColumn(\u001b[38;5;28mstr\u001b[39m, checks\u001b[38;5;241m=\u001b[39mpa\u001b[38;5;241m.\u001b[39mCheck\u001b[38;5;241m.\u001b[39misin([\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mMALE\u001b[39m\u001b[38;5;124m\"\u001b[39m,\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mFEMALE\u001b[39m\u001b[38;5;124m\"\u001b[39m])),\n\u001b[0;32m     10\u001b[0m })\n\u001b[0;32m     12\u001b[0m validated_test \u001b[38;5;241m=\u001b[39m schema(X_train)\n",
      "\u001b[1;31mNameError\u001b[0m: name 'pa' is not defined"
     ]
    }
   ],
   "source": [
    "# define schema\n",
    "schema = pa.DataFrameSchema({\n",
    "    \"Culmen Length (mm)\": pa.Column(float, checks=[pa.Check.ge(30),\n",
    "                                                   pa.Check.le(60)]),\n",
    "    \"Culmen Depth (mm)\": pa.Column(float, checks=[pa.Check.ge(13),\n",
    "                                                  pa.Check.le(22)]),\n",
    "    \"Flipper Length (mm)\": pa.Column(float, checks=[pa.Check.ge(170),\n",
    "                                                    pa.Check.le(235)]),\n",
    "    \"Sex\": pa.Column(str, checks=pa.Check.isin([\"MALE\",\"FEMALE\"])),\n",
    "})\n",
    "\n",
    "validated_test = schema(X_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-12-13T01:42:20.352063Z",
     "iopub.status.busy": "2022-12-13T01:42:20.352063Z",
     "iopub.status.idle": "2022-12-13T01:42:20.379758Z",
     "shell.execute_reply": "2022-12-13T01:42:20.379257Z"
    }
   },
   "outputs": [
    {
     "ename": "NameError",
     "evalue": "name 'X_test' is not defined",
     "output_type": "error",
     "traceback": [
      "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[1;31mNameError\u001b[0m                                 Traceback (most recent call last)",
      "Cell \u001b[1;32mIn [12], line 1\u001b[0m\n\u001b[1;32m----> 1\u001b[0m \u001b[43mX_test\u001b[49m\u001b[38;5;241m.\u001b[39mSex\u001b[38;5;241m.\u001b[39munique()\n",
      "\u001b[1;31mNameError\u001b[0m: name 'X_test' is not defined"
     ]
    }
   ],
   "source": [
    "X_test.Sex.unique()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-12-13T01:42:20.382758Z",
     "iopub.status.busy": "2022-12-13T01:42:20.382259Z",
     "iopub.status.idle": "2022-12-13T01:42:20.410766Z",
     "shell.execute_reply": "2022-12-13T01:42:20.410265Z"
    }
   },
   "outputs": [
    {
     "ename": "NameError",
     "evalue": "name 'X_test' is not defined",
     "output_type": "error",
     "traceback": [
      "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[1;31mNameError\u001b[0m                                 Traceback (most recent call last)",
      "Cell \u001b[1;32mIn [13], line 1\u001b[0m\n\u001b[1;32m----> 1\u001b[0m \u001b[43mX_test\u001b[49m\u001b[38;5;241m.\u001b[39mloc[\u001b[38;5;241m259\u001b[39m]\n",
      "\u001b[1;31mNameError\u001b[0m: name 'X_test' is not defined"
     ]
    }
   ],
   "source": [
    "X_test.loc[259]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Can you fix the data to conform to the schema?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "lines_to_next_cell": 2
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.10.8 ('pydata-global-2022-ml-repro')",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.8"
  },
  "vscode": {
   "interpreter": {
    "hash": "d7369b48cea8bb1af6d88d25f2646d14ea11b68d7457d74f06fbf0d68480668d"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}