{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Model Evaluation\n", "\n", "Applying machine learning in an applied science context is often method work. We build a prototype model and expect want to show that this method can be applied to our specific problem. This means that we have to guarantee that the insights we glean from this application generalize to new data from the same problem set.\n", "\n", "This is why we usually import `train_test_split()` from scikit-learn to get a validation set and a test set. But in my experience, in real-world applications, this isn’t always enough. In science, we usually deal with data that has some kind of correlation in some kind of dimension. Sometimes we have geospatial data and have to account for Tobler’s Law, i.e. things that are closer to each other matter more to each other than those data points at a larger distance. Sometimes we have temporal correlations, dealing with time series, where data points closer in time may influence each other.\n", "\n", "Not taking care of proper validation, will often lead to additional review cycles in a paper submission. It might lead to a rejection of the manuscript which is bad enough. In the worst case scenario, our research might report incorrect conclusions and have to be retracted. No one wants rejections or even retractions.\n", "\n", "So we’ll go into some methods to properly evaluate machine learning models even when our data is not “independent and identically distributed”." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "execution": { "iopub.execute_input": "2022-12-13T01:42:10.608399Z", "iopub.status.busy": "2022-12-13T01:42:10.607899Z", "iopub.status.idle": "2022-12-13T01:42:10.619565Z", "shell.execute_reply": "2022-12-13T01:42:10.619064Z" }, "tags": [] }, "outputs": [], "source": [ "from pathlib import Path\n", "\n", "DATA_FOLDER = Path(\"..\", \"..\") / \"data\"\n", "DATA_FILEPATH = DATA_FOLDER / \"penguins_clean.csv\"" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "execution": { "iopub.execute_input": "2022-12-13T01:42:10.622066Z", "iopub.status.busy": "2022-12-13T01:42:10.621565Z", "iopub.status.idle": "2022-12-13T01:42:11.022636Z", "shell.execute_reply": "2022-12-13T01:42:11.022136Z" }, "lines_to_next_cell": 2, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Culmen Length (mm)Culmen Depth (mm)Flipper Length (mm)SexSpecies
039.118.7181.0MALEAdelie Penguin (Pygoscelis adeliae)
139.517.4186.0FEMALEAdelie Penguin (Pygoscelis adeliae)
240.318.0195.0FEMALEAdelie Penguin (Pygoscelis adeliae)
336.719.3193.0FEMALEAdelie Penguin (Pygoscelis adeliae)
439.320.6190.0MALEAdelie Penguin (Pygoscelis adeliae)
\n", "
" ], "text/plain": [ " Culmen Length (mm) Culmen Depth (mm) Flipper Length (mm) Sex \\\n", "0 39.1 18.7 181.0 MALE \n", "1 39.5 17.4 186.0 FEMALE \n", "2 40.3 18.0 195.0 FEMALE \n", "3 36.7 19.3 193.0 FEMALE \n", "4 39.3 20.6 190.0 MALE \n", "\n", " Species \n", "0 Adelie Penguin (Pygoscelis adeliae) \n", "1 Adelie Penguin (Pygoscelis adeliae) \n", "2 Adelie Penguin (Pygoscelis adeliae) \n", "3 Adelie Penguin (Pygoscelis adeliae) \n", "4 Adelie Penguin (Pygoscelis adeliae) " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "penguins = pd.read_csv(DATA_FILEPATH)\n", "penguins.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data Splitting \n", "The simplest method of splitting data into a training and test data set is `train_test_split()`, which randomly selects samples from our dataframe. \n", "\n", "This method essentially makes a very big assumption. That assumption being that our data is \"independent and identically distributed\" or i.i.d..\n", "\n", "That simply means that each measurement for a penguin does not depend on another measurement. Luckily for penguins that is mostly true. For other data? Not so much.\n", "And it means that we expect that we have a similar distribution of measurements of our penguins to the unseen data or future measurements.\n", "\n", "
\n", "Tip: The i.i.d. assumption lies at the core of most machine learning and is an important concept to dive into and understand.
" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "execution": { "iopub.execute_input": "2022-12-13T01:42:11.025637Z", "iopub.status.busy": "2022-12-13T01:42:11.025136Z", "iopub.status.idle": "2022-12-13T01:42:11.193166Z", "shell.execute_reply": "2022-12-13T01:42:11.192666Z" }, "tags": [] }, "outputs": [ { "ename": "ModuleNotFoundError", "evalue": "No module named 'sklearn'", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mModuleNotFoundError\u001b[0m Traceback (most recent call last)", "Cell \u001b[1;32mIn [3], line 1\u001b[0m\n\u001b[1;32m----> 1\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01msklearn\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01mmodel_selection\u001b[39;00m \u001b[38;5;28;01mimport\u001b[39;00m train_test_split\n\u001b[0;32m 2\u001b[0m num_features \u001b[38;5;241m=\u001b[39m [\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mCulmen Length (mm)\u001b[39m\u001b[38;5;124m\"\u001b[39m, \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mCulmen Depth (mm)\u001b[39m\u001b[38;5;124m\"\u001b[39m, \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mFlipper Length (mm)\u001b[39m\u001b[38;5;124m\"\u001b[39m]\n\u001b[0;32m 3\u001b[0m cat_features \u001b[38;5;241m=\u001b[39m [\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mSex\u001b[39m\u001b[38;5;124m\"\u001b[39m]\n", "\u001b[1;31mModuleNotFoundError\u001b[0m: No module named 'sklearn'" ] } ], "source": [ "from sklearn.model_selection import train_test_split\n", "num_features = [\"Culmen Length (mm)\", \"Culmen Depth (mm)\", \"Flipper Length (mm)\"]\n", "cat_features = [\"Sex\"]\n", "features = num_features + cat_features\n", "target = [\"Species\"]\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(penguins[features], penguins[target], train_size=.7, random_state=42)\n", "X_train" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Stratification \n", "Usually, our target class or another feature we use isn't distributed equally." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "execution": { "iopub.execute_input": "2022-12-13T01:42:11.195666Z", "iopub.status.busy": "2022-12-13T01:42:11.195666Z", "iopub.status.idle": "2022-12-13T01:42:11.596221Z", "shell.execute_reply": "2022-12-13T01:42:11.595720Z" } }, "outputs": [], "source": [ "from matplotlib import pyplot as plt" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "execution": { "iopub.execute_input": "2022-12-13T01:42:11.599721Z", "iopub.status.busy": "2022-12-13T01:42:11.599221Z", "iopub.status.idle": "2022-12-13T01:42:11.735745Z", "shell.execute_reply": "2022-12-13T01:42:11.735245Z" }, "tags": [] }, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "penguins.groupby(\"Species\").Sex.count().plot(kind=\"bar\")\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this case it's not very extreme. We have around twice as many Adelie than Chinstrap penguins.\n", "\n", "However, this can mean that we accidentally have almost no Chinstrap penguins in our training data, as it randomly overselects Adelie penguins." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "execution": { "iopub.execute_input": "2022-12-13T01:42:11.738746Z", "iopub.status.busy": "2022-12-13T01:42:11.738246Z", "iopub.status.idle": "2022-12-13T01:42:11.766733Z", "shell.execute_reply": "2022-12-13T01:42:11.766231Z" }, "tags": [] }, "outputs": [ { "ename": "NameError", "evalue": "name 'y_train' is not defined", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mNameError\u001b[0m Traceback (most recent call last)", "Cell \u001b[1;32mIn [6], line 1\u001b[0m\n\u001b[1;32m----> 1\u001b[0m \u001b[43my_train\u001b[49m\u001b[38;5;241m.\u001b[39mreset_index()\u001b[38;5;241m.\u001b[39mgroupby([\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mSpecies\u001b[39m\u001b[38;5;124m\"\u001b[39m])\u001b[38;5;241m.\u001b[39mcount()\n", "\u001b[1;31mNameError\u001b[0m: name 'y_train' is not defined" ] } ], "source": [ "y_train.reset_index().groupby([\"Species\"]).count()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can address this by applying **stratification**.\n", "\n", "That is simply achieved by randomly sampling *within a class** (or strata) rather than randomly sampling from the entire dataframe." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "execution": { "iopub.execute_input": "2022-12-13T01:42:11.769232Z", "iopub.status.busy": "2022-12-13T01:42:11.769232Z", "iopub.status.idle": "2022-12-13T01:42:11.797561Z", "shell.execute_reply": "2022-12-13T01:42:11.797060Z" }, "tags": [] }, "outputs": [ { "ename": "NameError", "evalue": "name 'features' is not defined", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mNameError\u001b[0m Traceback (most recent call last)", "Cell \u001b[1;32mIn [7], line 1\u001b[0m\n\u001b[1;32m----> 1\u001b[0m X, y \u001b[38;5;241m=\u001b[39m penguins[\u001b[43mfeatures\u001b[49m], penguins[target[\u001b[38;5;241m0\u001b[39m]]\n\u001b[0;32m 2\u001b[0m X_train, X_test, y_train, y_test \u001b[38;5;241m=\u001b[39m train_test_split(X, y, train_size\u001b[38;5;241m=\u001b[39m\u001b[38;5;241m.7\u001b[39m, random_state\u001b[38;5;241m=\u001b[39m\u001b[38;5;241m42\u001b[39m, stratify\u001b[38;5;241m=\u001b[39my)\n", "\u001b[1;31mNameError\u001b[0m: name 'features' is not defined" ] } ], "source": [ "X, y = penguins[features], penguins[target[0]]\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=.7, random_state=42, stratify=y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To qualitatively assess the effect of stratification, let's plot class distribution in both _training_ and _test_ sets:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "execution": { "iopub.execute_input": "2022-12-13T01:42:11.800061Z", "iopub.status.busy": "2022-12-13T01:42:11.800061Z", "iopub.status.idle": "2022-12-13T01:42:11.999103Z", "shell.execute_reply": "2022-12-13T01:42:11.998601Z" } }, "outputs": [ { "ename": "NameError", "evalue": "name 'y_train' is not defined", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mNameError\u001b[0m Traceback (most recent call last)", "Cell \u001b[1;32mIn [8], line 3\u001b[0m\n\u001b[0;32m 1\u001b[0m fig, (ax1, ax2) \u001b[38;5;241m=\u001b[39m plt\u001b[38;5;241m.\u001b[39msubplots(\u001b[38;5;241m1\u001b[39m, \u001b[38;5;241m2\u001b[39m, figsize\u001b[38;5;241m=\u001b[39m(\u001b[38;5;241m12\u001b[39m, \u001b[38;5;241m8\u001b[39m))\n\u001b[1;32m----> 3\u001b[0m \u001b[43my_train\u001b[49m\u001b[38;5;241m.\u001b[39mreset_index()\u001b[38;5;241m.\u001b[39mgroupby(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mSpecies\u001b[39m\u001b[38;5;124m\"\u001b[39m)\u001b[38;5;241m.\u001b[39mcount()\u001b[38;5;241m.\u001b[39mplot(kind\u001b[38;5;241m=\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mbar\u001b[39m\u001b[38;5;124m\"\u001b[39m, ax\u001b[38;5;241m=\u001b[39max1, ylim\u001b[38;5;241m=\u001b[39m(\u001b[38;5;241m0\u001b[39m, \u001b[38;5;28mlen\u001b[39m(y)), title\u001b[38;5;241m=\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mTraining\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n\u001b[0;32m 4\u001b[0m y_test\u001b[38;5;241m.\u001b[39mreset_index()\u001b[38;5;241m.\u001b[39mgroupby(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mSpecies\u001b[39m\u001b[38;5;124m\"\u001b[39m)\u001b[38;5;241m.\u001b[39mcount()\u001b[38;5;241m.\u001b[39mplot(kind\u001b[38;5;241m=\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mbar\u001b[39m\u001b[38;5;124m\"\u001b[39m, ax\u001b[38;5;241m=\u001b[39max2, ylim\u001b[38;5;241m=\u001b[39m(\u001b[38;5;241m0\u001b[39m, \u001b[38;5;28mlen\u001b[39m(y)), title\u001b[38;5;241m=\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mTest\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n\u001b[0;32m 5\u001b[0m plt\u001b[38;5;241m.\u001b[39mshow()\n", "\u001b[1;31mNameError\u001b[0m: name 'y_train' is not defined" ] }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 8))\n", "\n", "y_train.reset_index().groupby(\"Species\").count().plot(kind=\"bar\", ax=ax1, ylim=(0, len(y)), title=\"Training\")\n", "y_test.reset_index().groupby(\"Species\").count().plot(kind=\"bar\", ax=ax2, ylim=(0, len(y)), title=\"Test\")\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's quickly train a model to evaluate" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "execution": { "iopub.execute_input": "2022-12-13T01:42:12.002104Z", "iopub.status.busy": "2022-12-13T01:42:12.001603Z", "iopub.status.idle": "2022-12-13T01:42:12.030607Z", "shell.execute_reply": "2022-12-13T01:42:12.029608Z" }, "tags": [] }, "outputs": [ { "ename": "ModuleNotFoundError", "evalue": "No module named 'sklearn'", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mModuleNotFoundError\u001b[0m Traceback (most recent call last)", "Cell \u001b[1;32mIn [9], line 1\u001b[0m\n\u001b[1;32m----> 1\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01msklearn\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01msvm\u001b[39;00m \u001b[38;5;28;01mimport\u001b[39;00m SVC\n\u001b[0;32m 2\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01msklearn\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01mcompose\u001b[39;00m \u001b[38;5;28;01mimport\u001b[39;00m ColumnTransformer\n\u001b[0;32m 3\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01msklearn\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01mpipeline\u001b[39;00m \u001b[38;5;28;01mimport\u001b[39;00m Pipeline\n", "\u001b[1;31mModuleNotFoundError\u001b[0m: No module named 'sklearn'" ] } ], "source": [ "from sklearn.svm import SVC\n", "from sklearn.compose import ColumnTransformer\n", "from sklearn.pipeline import Pipeline\n", "from sklearn.preprocessing import StandardScaler, OneHotEncoder\n", "\n", "num_transformer = StandardScaler()\n", "cat_transformer = OneHotEncoder(handle_unknown='ignore')\n", "\n", "preprocessor = ColumnTransformer(transformers=[\n", " ('num', num_transformer, num_features),\n", " ('cat', cat_transformer, cat_features)\n", "])\n", "\n", "model = Pipeline(steps=[\n", " ('preprocessor', preprocessor),\n", " ('classifier', SVC()),\n", "])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This difference is not drastic, as we can see in the plot above.\n", "That changes however, when we have minority classes with much less data than the majority class.\n", "\n", "Either way it's worth it to keep in mind that stratification exists. The `stratify=` keyword takes any type of vector as long as it matches the dimension of the dataframe.\n", "\n", "## Cross-Validation\n", "Cross-validation is often considered the gold standard in statistical applications and machine learning.\n", "\n", "Cross-validation splits the data into folds, of which one is held out as the validation set and the rest is used to train.\n", "Subsequently, models are trained on the other folds in a round-robin style. That way we have models that are trained and evaluated on every sample of the dataset.\n", "![Scikit-learn cross validation](https://scikit-learn.org/stable/_images/grid_search_cross_validation.png)\n", "*Scikit-learn cross-validation schema. [[Source](https://scikit-learn.org/stable/modules/cross_validation.html)]*\n", "\n", "Cross-validation is particularly useful when we don't have a lot of data or the data is highly heterogeneous." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "execution": { "iopub.execute_input": "2022-12-13T01:42:12.033107Z", "iopub.status.busy": "2022-12-13T01:42:12.033107Z", "iopub.status.idle": "2022-12-13T01:42:12.061074Z", "shell.execute_reply": "2022-12-13T01:42:12.060572Z" }, "tags": [] }, "outputs": [ { "ename": "ModuleNotFoundError", "evalue": "No module named 'sklearn'", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mModuleNotFoundError\u001b[0m Traceback (most recent call last)", "Cell \u001b[1;32mIn [10], line 1\u001b[0m\n\u001b[1;32m----> 1\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01msklearn\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01mmodel_selection\u001b[39;00m \u001b[38;5;28;01mimport\u001b[39;00m cross_val_score\n\u001b[0;32m 3\u001b[0m scores \u001b[38;5;241m=\u001b[39m cross_val_score(model, X_train, y_train, cv\u001b[38;5;241m=\u001b[39m\u001b[38;5;241m5\u001b[39m)\n\u001b[0;32m 4\u001b[0m scores\n", "\u001b[1;31mModuleNotFoundError\u001b[0m: No module named 'sklearn'" ] } ], "source": [ "from sklearn.model_selection import cross_val_score\n", "\n", "scores = cross_val_score(model, X_train, y_train, cv=5)\n", "scores" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "execution": { "iopub.execute_input": "2022-12-13T01:42:12.063573Z", "iopub.status.busy": "2022-12-13T01:42:12.063573Z", "iopub.status.idle": "2022-12-13T01:42:12.092079Z", "shell.execute_reply": "2022-12-13T01:42:12.091579Z" }, "tags": [] }, "outputs": [ { "ename": "NameError", "evalue": "name 'scores' is not defined", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mNameError\u001b[0m Traceback (most recent call last)", "Cell \u001b[1;32mIn [11], line 1\u001b[0m\n\u001b[1;32m----> 1\u001b[0m \u001b[38;5;28mprint\u001b[39m(\u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;132;01m{\u001b[39;00m\u001b[43mscores\u001b[49m\u001b[38;5;241m.\u001b[39mmean()\u001b[38;5;132;01m:\u001b[39;00m\u001b[38;5;124m0.2f\u001b[39m\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m accuracy with a standard deviation of \u001b[39m\u001b[38;5;132;01m{\u001b[39;00mscores\u001b[38;5;241m.\u001b[39mstd()\u001b[38;5;132;01m:\u001b[39;00m\u001b[38;5;124m0.2f\u001b[39m\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m\"\u001b[39m)\n", "\u001b[1;31mNameError\u001b[0m: name 'scores' is not defined" ] } ], "source": [ "print(f\"{scores.mean():0.2f} accuracy with a standard deviation of {scores.std():0.2f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we know there are some folds this support-vector machine will do exceptional on and others it does quite well on only getting a few samples wrong." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Model Evaluation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Brilliant! So let's recap for a moment what we have done so far, in preparation for our (final) **Model evaluation**.\n", "\n", "We have:\n", "\n", "- prepared the model pipeline: `sklearn.pipeline.Pipeline` with `preprocessor + model`\n", "- generated **train** and **test** data partitions (with stratification): `(X_train, y_train)` and `(X_test, y_test)`, respectively\n", " - stratification guaranteed that those partitions will retain class distributions\n", "- assessed model performance via **cross validation** (i.e. `cross_val_score`) on `X_train`(!!)\n", " - this had the objective of verifying model consistency on multiple data partitioning\n", "\n", "Now we need the complete our last step, namely \"assess how the model we chose in CV\" (we only had one model, so that was an easy choice :D ) will perform on _future data_!\n", "And we have a _candidate_ as representative for these data: `X_test`.\n", "\n", "Please note that `X_test` has never been used so far (as it should have!). The take away message here is: _generate test partition, and forget about it until the last step!_\n", "\n", "\n", "Thanks to `CV`, We have an indication of how the `SVC` classifier behaves on multiple \"version\" of the training set. We calculated an average score of `0.99` accuracy, therefore we decided this model is to be trusted for predictions on _unseen data_.\n", "\n", "Now all we need to do, is to prove this assertion.\n", "\n", "To do so we need to: \n", " - train a new model on the entire **training set**\n", " - evaluate it's performance on **test set** (using the metric of choice - presumably the same metric we chose in CV!)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "execution": { "iopub.execute_input": "2022-12-13T01:42:12.095579Z", "iopub.status.busy": "2022-12-13T01:42:12.095078Z", "iopub.status.idle": "2022-12-13T01:42:12.123083Z", "shell.execute_reply": "2022-12-13T01:42:12.122583Z" } }, "outputs": [ { "ename": "NameError", "evalue": "name 'Pipeline' is not defined", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mNameError\u001b[0m Traceback (most recent call last)", "Cell \u001b[1;32mIn [12], line 2\u001b[0m\n\u001b[0;32m 1\u001b[0m \u001b[38;5;66;03m# training\u001b[39;00m\n\u001b[1;32m----> 2\u001b[0m model \u001b[38;5;241m=\u001b[39m \u001b[43mPipeline\u001b[49m(steps\u001b[38;5;241m=\u001b[39m[\n\u001b[0;32m 3\u001b[0m (\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mpreprocessor\u001b[39m\u001b[38;5;124m'\u001b[39m, preprocessor),\n\u001b[0;32m 4\u001b[0m (\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mclassifier\u001b[39m\u001b[38;5;124m'\u001b[39m, SVC()),\n\u001b[0;32m 5\u001b[0m ])\n\u001b[0;32m 6\u001b[0m classifier \u001b[38;5;241m=\u001b[39m model\u001b[38;5;241m.\u001b[39mfit(X_train, y_train)\n", "\u001b[1;31mNameError\u001b[0m: name 'Pipeline' is not defined" ] } ], "source": [ "# training\n", "model = Pipeline(steps=[\n", " ('preprocessor', preprocessor),\n", " ('classifier', SVC()),\n", "])\n", "classifier = model.fit(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "execution": { "iopub.execute_input": "2022-12-13T01:42:12.126084Z", "iopub.status.busy": "2022-12-13T01:42:12.126084Z", "iopub.status.idle": "2022-12-13T01:42:12.154089Z", "shell.execute_reply": "2022-12-13T01:42:12.153589Z" } }, "outputs": [ { "ename": "ModuleNotFoundError", "evalue": "No module named 'sklearn'", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mModuleNotFoundError\u001b[0m Traceback (most recent call last)", "Cell \u001b[1;32mIn [13], line 2\u001b[0m\n\u001b[0;32m 1\u001b[0m \u001b[38;5;66;03m# Model evaluation\u001b[39;00m\n\u001b[1;32m----> 2\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01msklearn\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01mmetrics\u001b[39;00m \u001b[38;5;28;01mimport\u001b[39;00m accuracy_score\n\u001b[0;32m 4\u001b[0m y_pred \u001b[38;5;241m=\u001b[39m classifier\u001b[38;5;241m.\u001b[39mpredict(X_test)\n\u001b[0;32m 5\u001b[0m \u001b[38;5;28mprint\u001b[39m(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mTEST ACC: \u001b[39m\u001b[38;5;124m\"\u001b[39m, accuracy_score(y_true\u001b[38;5;241m=\u001b[39my_test, y_pred\u001b[38;5;241m=\u001b[39my_pred))\n", "\u001b[1;31mModuleNotFoundError\u001b[0m: No module named 'sklearn'" ] } ], "source": [ "# Model evaluation\n", "from sklearn.metrics import accuracy_score\n", "\n", "y_pred = classifier.predict(X_test)\n", "print(\"TEST ACC: \", accuracy_score(y_true=y_test, y_pred=y_pred))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can finally say that we have concluded our model evaluation - with a fantastic score of `0.96` Accuracy on the test set." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Choosing the appropriate Evaluation Metric" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ok, now for the mere sake of considering a more realistic data scenario, let's pretend our reference dataset is composed by only samples from two (out of the three) classes we have. In particular, we will crafting our dataset by choosing the most and the least represented classes, respectively. \n", "\n", "The very idea is to explore whether the choice of appropriate metrics could make the difference in our machine learning models evaluation." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's recall class distributions in our dataset:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "execution": { "iopub.execute_input": "2022-12-13T01:42:12.157591Z", "iopub.status.busy": "2022-12-13T01:42:12.157090Z", "iopub.status.idle": "2022-12-13T01:42:12.185068Z", "shell.execute_reply": "2022-12-13T01:42:12.184567Z" } }, "outputs": [ { "ename": "NameError", "evalue": "name 'y' is not defined", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mNameError\u001b[0m Traceback (most recent call last)", "Cell \u001b[1;32mIn [14], line 1\u001b[0m\n\u001b[1;32m----> 1\u001b[0m \u001b[43my\u001b[49m\u001b[38;5;241m.\u001b[39mreset_index()\u001b[38;5;241m.\u001b[39mgroupby([\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mSpecies\u001b[39m\u001b[38;5;124m\"\u001b[39m])\u001b[38;5;241m.\u001b[39mcount()\n", "\u001b[1;31mNameError\u001b[0m: name 'y' is not defined" ] } ], "source": [ "y.reset_index().groupby([\"Species\"]).count()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So let's select samples from the first two classes, `Adelie Penguin` and `Chinstrap penguin`:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "execution": { "iopub.execute_input": "2022-12-13T01:42:12.188068Z", "iopub.status.busy": "2022-12-13T01:42:12.187569Z", "iopub.status.idle": "2022-12-13T01:42:12.200571Z", "shell.execute_reply": "2022-12-13T01:42:12.200070Z" } }, "outputs": [], "source": [ "samples = penguins[((penguins[\"Species\"].str.startswith(\"Adelie\")) | (penguins[\"Species\"].str.startswith(\"Chinstrap\")))]" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "execution": { "iopub.execute_input": "2022-12-13T01:42:12.203070Z", "iopub.status.busy": "2022-12-13T01:42:12.203070Z", "iopub.status.idle": "2022-12-13T01:42:12.216074Z", "shell.execute_reply": "2022-12-13T01:42:12.215573Z" } }, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "samples.shape[0] == 146 + 68 # quick verification" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To make things even harder for our machine learning model, let's also see if we could get rid of _clearly_ separating features in this toy dataset" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "execution": { "iopub.execute_input": "2022-12-13T01:42:12.219073Z", "iopub.status.busy": "2022-12-13T01:42:12.218573Z", "iopub.status.idle": "2022-12-13T01:42:12.247078Z", "shell.execute_reply": "2022-12-13T01:42:12.246578Z" } }, "outputs": [ { "ename": "ModuleNotFoundError", "evalue": "No module named 'seaborn'", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mModuleNotFoundError\u001b[0m Traceback (most recent call last)", "Cell \u001b[1;32mIn [17], line 1\u001b[0m\n\u001b[1;32m----> 1\u001b[0m \u001b[38;5;28;01mimport\u001b[39;00m \u001b[38;5;21;01mseaborn\u001b[39;00m \u001b[38;5;28;01mas\u001b[39;00m \u001b[38;5;21;01msns\u001b[39;00m\n\u001b[0;32m 3\u001b[0m pairplot_figure \u001b[38;5;241m=\u001b[39m sns\u001b[38;5;241m.\u001b[39mpairplot(samples, hue\u001b[38;5;241m=\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mSpecies\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n", "\u001b[1;31mModuleNotFoundError\u001b[0m: No module named 'seaborn'" ] } ], "source": [ "import seaborn as sns\n", "\n", "pairplot_figure = sns.pairplot(samples, hue=\"Species\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "OK so if we get to choose, we could definitely say that in this dataset, the `Flipper Length` in combination with the `Culmen Depth` leads to the hardest classification task for our machine learning model.\n", "\n", "Therefore, here is the plan:\n", "- we select only those to numerical features (_iow_ we will get rid of the `Culmen Lenght` feature)\n", "- we will apply an identical _Model evaluation_ pipeline as we did in our previous example\n", " - Cross Validation + Evaluation on Test set\n", "\n", "The very difference this time is that we will use multiple metrics to evaluate our model to prove our point on _carefully selecting evaluation metrics_." ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "execution": { "iopub.execute_input": "2022-12-13T01:42:12.250078Z", "iopub.status.busy": "2022-12-13T01:42:12.249578Z", "iopub.status.idle": "2022-12-13T01:42:12.262580Z", "shell.execute_reply": "2022-12-13T01:42:12.262080Z" } }, "outputs": [], "source": [ "num_features = [\"Culmen Length (mm)\", \"Culmen Depth (mm)\", \"Flipper Length (mm)\"]\n", "selected_num_features = num_features[1:]\n", "cat_features = [\"Sex\"]\n", "features = selected_num_features + cat_features" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "execution": { "iopub.execute_input": "2022-12-13T01:42:12.264580Z", "iopub.status.busy": "2022-12-13T01:42:12.264580Z", "iopub.status.idle": "2022-12-13T01:42:12.293586Z", "shell.execute_reply": "2022-12-13T01:42:12.293087Z" } }, "outputs": [ { "ename": "NameError", "evalue": "name 'StandardScaler' is not defined", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mNameError\u001b[0m Traceback (most recent call last)", "Cell \u001b[1;32mIn [19], line 1\u001b[0m\n\u001b[1;32m----> 1\u001b[0m num_transformer \u001b[38;5;241m=\u001b[39m \u001b[43mStandardScaler\u001b[49m()\n\u001b[0;32m 2\u001b[0m cat_transformer \u001b[38;5;241m=\u001b[39m OneHotEncoder(handle_unknown\u001b[38;5;241m=\u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mignore\u001b[39m\u001b[38;5;124m'\u001b[39m)\n\u001b[0;32m 4\u001b[0m preprocessor \u001b[38;5;241m=\u001b[39m ColumnTransformer(transformers\u001b[38;5;241m=\u001b[39m[\n\u001b[0;32m 5\u001b[0m (\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mnum\u001b[39m\u001b[38;5;124m'\u001b[39m, num_transformer, selected_num_features), \u001b[38;5;66;03m# note here, we will only preprocess selected numerical features\u001b[39;00m\n\u001b[0;32m 6\u001b[0m (\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mcat\u001b[39m\u001b[38;5;124m'\u001b[39m, cat_transformer, cat_features)\n\u001b[0;32m 7\u001b[0m ])\n", "\u001b[1;31mNameError\u001b[0m: name 'StandardScaler' is not defined" ] } ], "source": [ "num_transformer = StandardScaler()\n", "cat_transformer = OneHotEncoder(handle_unknown='ignore')\n", "\n", "preprocessor = ColumnTransformer(transformers=[\n", " ('num', num_transformer, selected_num_features), # note here, we will only preprocess selected numerical features\n", " ('cat', cat_transformer, cat_features)\n", "])\n", "\n", "model = Pipeline(steps=[\n", " ('preprocessor', preprocessor),\n", " ('classifier', SVC()),\n", "])" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "execution": { "iopub.execute_input": "2022-12-13T01:42:12.296586Z", "iopub.status.busy": "2022-12-13T01:42:12.296087Z", "iopub.status.idle": "2022-12-13T01:42:12.324591Z", "shell.execute_reply": "2022-12-13T01:42:12.324091Z" } }, "outputs": [ { "ename": "NameError", "evalue": "name 'target' is not defined", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mNameError\u001b[0m Traceback (most recent call last)", "Cell \u001b[1;32mIn [20], line 1\u001b[0m\n\u001b[1;32m----> 1\u001b[0m X, y \u001b[38;5;241m=\u001b[39m samples[features], samples[\u001b[43mtarget\u001b[49m[\u001b[38;5;241m0\u001b[39m]]\n\u001b[0;32m 2\u001b[0m X_train, X_test, y_train, y_test \u001b[38;5;241m=\u001b[39m train_test_split(X, y, train_size\u001b[38;5;241m=\u001b[39m\u001b[38;5;241m.7\u001b[39m, random_state\u001b[38;5;241m=\u001b[39m\u001b[38;5;241m42\u001b[39m, stratify\u001b[38;5;241m=\u001b[39my) \u001b[38;5;66;03m# we also stratify on classes\u001b[39;00m\n", "\u001b[1;31mNameError\u001b[0m: name 'target' is not defined" ] } ], "source": [ "X, y = samples[features], samples[target[0]]\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=.7, random_state=42, stratify=y) # we also stratify on classes" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "execution": { "iopub.execute_input": "2022-12-13T01:42:12.327092Z", "iopub.status.busy": "2022-12-13T01:42:12.327092Z", "iopub.status.idle": "2022-12-13T01:42:12.355597Z", "shell.execute_reply": "2022-12-13T01:42:12.355096Z" } }, "outputs": [ { "ename": "NameError", "evalue": "name 'y_train' is not defined", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mNameError\u001b[0m Traceback (most recent call last)", "Cell \u001b[1;32mIn [21], line 1\u001b[0m\n\u001b[1;32m----> 1\u001b[0m \u001b[43my_train\u001b[49m\u001b[38;5;241m.\u001b[39mreset_index()\u001b[38;5;241m.\u001b[39mgroupby(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mSpecies\u001b[39m\u001b[38;5;124m\"\u001b[39m)\u001b[38;5;241m.\u001b[39mcount()\n", "\u001b[1;31mNameError\u001b[0m: name 'y_train' is not defined" ] } ], "source": [ "y_train.reset_index().groupby(\"Species\").count()" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "execution": { "iopub.execute_input": "2022-12-13T01:42:12.358597Z", "iopub.status.busy": "2022-12-13T01:42:12.358097Z", "iopub.status.idle": "2022-12-13T01:42:12.386314Z", "shell.execute_reply": "2022-12-13T01:42:12.385813Z" } }, "outputs": [ { "ename": "NameError", "evalue": "name 'y_test' is not defined", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mNameError\u001b[0m Traceback (most recent call last)", "Cell \u001b[1;32mIn [22], line 1\u001b[0m\n\u001b[1;32m----> 1\u001b[0m \u001b[43my_test\u001b[49m\u001b[38;5;241m.\u001b[39mreset_index()\u001b[38;5;241m.\u001b[39mgroupby(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mSpecies\u001b[39m\u001b[38;5;124m\"\u001b[39m)\u001b[38;5;241m.\u001b[39mcount()\n", "\u001b[1;31mNameError\u001b[0m: name 'y_test' is not defined" ] } ], "source": [ "y_test.reset_index().groupby(\"Species\").count()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In our evaluation pipeline we will be using keep record both **accuracy** (`ACC`) and **matthew correlation coefficient** (`MCC`)" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "execution": { "iopub.execute_input": "2022-12-13T01:42:12.389314Z", "iopub.status.busy": "2022-12-13T01:42:12.389314Z", "iopub.status.idle": "2022-12-13T01:42:12.417304Z", "shell.execute_reply": "2022-12-13T01:42:12.416804Z" } }, "outputs": [ { "ename": "ModuleNotFoundError", "evalue": "No module named 'sklearn'", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mModuleNotFoundError\u001b[0m Traceback (most recent call last)", "Cell \u001b[1;32mIn [23], line 1\u001b[0m\n\u001b[1;32m----> 1\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01msklearn\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01mmodel_selection\u001b[39;00m \u001b[38;5;28;01mimport\u001b[39;00m cross_validate\n\u001b[0;32m 2\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01msklearn\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01mmetrics\u001b[39;00m \u001b[38;5;28;01mimport\u001b[39;00m make_scorer\n\u001b[0;32m 3\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01msklearn\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01mmetrics\u001b[39;00m \u001b[38;5;28;01mimport\u001b[39;00m matthews_corrcoef \u001b[38;5;28;01mas\u001b[39;00m mcc\n", "\u001b[1;31mModuleNotFoundError\u001b[0m: No module named 'sklearn'" ] } ], "source": [ "from sklearn.model_selection import cross_validate\n", "from sklearn.metrics import make_scorer\n", "from sklearn.metrics import matthews_corrcoef as mcc\n", "from sklearn.metrics import accuracy_score as acc\n", "\n", "mcc_scorer = make_scorer(mcc)\n", "acc_scorer = make_scorer(acc)\n", "scores = cross_validate(model, X_train, y_train, cv=5,\n", " scoring={\"MCC\": mcc_scorer, \"ACC\": acc_scorer})\n", "scores" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "execution": { "iopub.execute_input": "2022-12-13T01:42:12.420305Z", "iopub.status.busy": "2022-12-13T01:42:12.419805Z", "iopub.status.idle": "2022-12-13T01:42:12.448324Z", "shell.execute_reply": "2022-12-13T01:42:12.447823Z" } }, "outputs": [ { "ename": "NameError", "evalue": "name 'scores' is not defined", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mNameError\u001b[0m Traceback (most recent call last)", "Cell \u001b[1;32mIn [24], line 3\u001b[0m\n\u001b[0;32m 1\u001b[0m \u001b[38;5;28;01mimport\u001b[39;00m \u001b[38;5;21;01mnumpy\u001b[39;00m \u001b[38;5;28;01mas\u001b[39;00m \u001b[38;5;21;01mnp\u001b[39;00m\n\u001b[1;32m----> 3\u001b[0m \u001b[38;5;28mprint\u001b[39m(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mAvg ACC in CV: \u001b[39m\u001b[38;5;124m\"\u001b[39m, np\u001b[38;5;241m.\u001b[39maverage(\u001b[43mscores\u001b[49m[\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mtest_ACC\u001b[39m\u001b[38;5;124m\"\u001b[39m]))\n\u001b[0;32m 4\u001b[0m \u001b[38;5;28mprint\u001b[39m(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mAvg MCC in CV: \u001b[39m\u001b[38;5;124m\"\u001b[39m, np\u001b[38;5;241m.\u001b[39maverage(scores[\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mtest_MCC\u001b[39m\u001b[38;5;124m\"\u001b[39m]))\n", "\u001b[1;31mNameError\u001b[0m: name 'scores' is not defined" ] } ], "source": [ "import numpy as np\n", "\n", "print(\"Avg ACC in CV: \", np.average(scores[\"test_ACC\"]))\n", "print(\"Avg MCC in CV: \", np.average(scores[\"test_MCC\"]))" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "execution": { "iopub.execute_input": "2022-12-13T01:42:12.451324Z", "iopub.status.busy": "2022-12-13T01:42:12.450824Z", "iopub.status.idle": "2022-12-13T01:42:12.479064Z", "shell.execute_reply": "2022-12-13T01:42:12.478563Z" } }, "outputs": [ { "ename": "NameError", "evalue": "name 'model' is not defined", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mNameError\u001b[0m Traceback (most recent call last)", "Cell \u001b[1;32mIn [25], line 1\u001b[0m\n\u001b[1;32m----> 1\u001b[0m model \u001b[38;5;241m=\u001b[39m \u001b[43mmodel\u001b[49m\u001b[38;5;241m.\u001b[39mfit(X_train, y_train)\n\u001b[0;32m 3\u001b[0m \u001b[38;5;28mprint\u001b[39m(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mACC: \u001b[39m\u001b[38;5;124m\"\u001b[39m, acc_scorer(model, X_test, y_test))\n\u001b[0;32m 4\u001b[0m \u001b[38;5;28mprint\u001b[39m(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mMCC: \u001b[39m\u001b[38;5;124m\"\u001b[39m, mcc_scorer(model, X_test, y_test))\n", "\u001b[1;31mNameError\u001b[0m: name 'model' is not defined" ] } ], "source": [ "model = model.fit(X_train, y_train)\n", "\n", "print(\"ACC: \", acc_scorer(model, X_test, y_test))\n", "print(\"MCC: \", mcc_scorer(model, X_test, y_test))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To see exactly what happened, let's have a look at the **Confusion matrix**" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "execution": { "iopub.execute_input": "2022-12-13T01:42:12.482064Z", "iopub.status.busy": "2022-12-13T01:42:12.482064Z", "iopub.status.idle": "2022-12-13T01:42:12.510068Z", "shell.execute_reply": "2022-12-13T01:42:12.509568Z" } }, "outputs": [ { "ename": "ModuleNotFoundError", "evalue": "No module named 'sklearn'", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mModuleNotFoundError\u001b[0m Traceback (most recent call last)", "Cell \u001b[1;32mIn [26], line 1\u001b[0m\n\u001b[1;32m----> 1\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01msklearn\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01mmetrics\u001b[39;00m \u001b[38;5;28;01mimport\u001b[39;00m ConfusionMatrixDisplay\n\u001b[0;32m 2\u001b[0m fig, ax \u001b[38;5;241m=\u001b[39m plt\u001b[38;5;241m.\u001b[39msubplots(figsize\u001b[38;5;241m=\u001b[39m(\u001b[38;5;241m15\u001b[39m, \u001b[38;5;241m10\u001b[39m))\n\u001b[0;32m 3\u001b[0m ConfusionMatrixDisplay\u001b[38;5;241m.\u001b[39mfrom_estimator(model, X_test, y_test, ax\u001b[38;5;241m=\u001b[39max)\n", "\u001b[1;31mModuleNotFoundError\u001b[0m: No module named 'sklearn'" ] } ], "source": [ "from sklearn.metrics import ConfusionMatrixDisplay\n", "fig, ax = plt.subplots(figsize=(15, 10))\n", "ConfusionMatrixDisplay.from_estimator(model, X_test, y_test, ax=ax)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As expected, the model did a pretty bad job in classifying *Chinstrap Penguins* and the `MCC` was able to catch that, whilst `ACC` could not as it only considers correctly classified samples!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Time-series Validation\n", "\n", "But validation can get tricky if time gets involved.\n", "\n", "Imagine we measured the growth of baby penguin Hank over time and wanted to us machine learning to project the development of Hank. Then our data suddenly isn't i.i.d. anymore, since it is dependent in the time dimension.\n", "\n", "Were we to split our data randomly for our training and test set, we would test on data points that lie in between training points, where even a simple linear interpolation can do a fairly decent job.\n", "\n", "Therefor, we need to split our measurements along the time axis\n", "![Scikit-learn time series validation](https://scikit-learn.org/stable/_images/sphx_glr_plot_cv_indices_013.png)\n", "*Scikit-learn Time Series CV [[Source]](https://scikit-learn.org/stable/modules/cross_validation.html#time-series-split).*" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "execution": { "iopub.execute_input": "2022-12-13T01:42:12.513069Z", "iopub.status.busy": "2022-12-13T01:42:12.512568Z", "iopub.status.idle": "2022-12-13T01:42:12.541078Z", "shell.execute_reply": "2022-12-13T01:42:12.540577Z" }, "lines_to_next_cell": 2 }, "outputs": [ { "ename": "ModuleNotFoundError", "evalue": "No module named 'sklearn'", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mModuleNotFoundError\u001b[0m Traceback (most recent call last)", "Cell \u001b[1;32mIn [27], line 2\u001b[0m\n\u001b[0;32m 1\u001b[0m \u001b[38;5;28;01mimport\u001b[39;00m \u001b[38;5;21;01mnumpy\u001b[39;00m \u001b[38;5;28;01mas\u001b[39;00m \u001b[38;5;21;01mnp\u001b[39;00m\n\u001b[1;32m----> 2\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01msklearn\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01mmodel_selection\u001b[39;00m \u001b[38;5;28;01mimport\u001b[39;00m TimeSeriesSplit\n\u001b[0;32m 4\u001b[0m X \u001b[38;5;241m=\u001b[39m np\u001b[38;5;241m.\u001b[39marray([[\u001b[38;5;241m1\u001b[39m, \u001b[38;5;241m2\u001b[39m], [\u001b[38;5;241m3\u001b[39m, \u001b[38;5;241m4\u001b[39m], [\u001b[38;5;241m1\u001b[39m, \u001b[38;5;241m2\u001b[39m], [\u001b[38;5;241m3\u001b[39m, \u001b[38;5;241m4\u001b[39m], [\u001b[38;5;241m1\u001b[39m, \u001b[38;5;241m2\u001b[39m], [\u001b[38;5;241m3\u001b[39m, \u001b[38;5;241m4\u001b[39m]])\n\u001b[0;32m 5\u001b[0m y \u001b[38;5;241m=\u001b[39m np\u001b[38;5;241m.\u001b[39marray([\u001b[38;5;241m1\u001b[39m, \u001b[38;5;241m2\u001b[39m, \u001b[38;5;241m3\u001b[39m, \u001b[38;5;241m4\u001b[39m, \u001b[38;5;241m5\u001b[39m, \u001b[38;5;241m6\u001b[39m])\n", "\u001b[1;31mModuleNotFoundError\u001b[0m: No module named 'sklearn'" ] } ], "source": [ "import numpy as np\n", "from sklearn.model_selection import TimeSeriesSplit\n", "\n", "X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]])\n", "y = np.array([1, 2, 3, 4, 5, 6])\n", "tscv = TimeSeriesSplit(n_splits=3)\n", "print(tscv)\n", "\n", "for train, test in tscv.split(X):\n", " print(\"%s %s\" % (train, test))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Spatial Validation\n", "\n", "Spatial data, like maps and satellite data has a similar problem.\n", "\n", "Here the data is correlated in the spatial dimension. However, we can mitigate the effect by supplying a group. In this simple example I used continents, but it's possible to group by bins on a lat-lon grid as well. \n", "\n", "Here especially, a cross-validation scheme is very important, as it is used to validate against every area on your map at least once." ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "execution": { "iopub.execute_input": "2022-12-13T01:42:12.544078Z", "iopub.status.busy": "2022-12-13T01:42:12.543578Z", "iopub.status.idle": "2022-12-13T01:42:12.572069Z", "shell.execute_reply": "2022-12-13T01:42:12.571568Z" } }, "outputs": [ { "ename": "ModuleNotFoundError", "evalue": "No module named 'sklearn'", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mModuleNotFoundError\u001b[0m Traceback (most recent call last)", "Cell \u001b[1;32mIn [28], line 1\u001b[0m\n\u001b[1;32m----> 1\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01msklearn\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01mmodel_selection\u001b[39;00m \u001b[38;5;28;01mimport\u001b[39;00m GroupKFold\n\u001b[0;32m 3\u001b[0m X \u001b[38;5;241m=\u001b[39m [\u001b[38;5;241m0.1\u001b[39m, \u001b[38;5;241m0.2\u001b[39m, \u001b[38;5;241m2.2\u001b[39m, \u001b[38;5;241m2.4\u001b[39m, \u001b[38;5;241m2.3\u001b[39m, \u001b[38;5;241m4.55\u001b[39m, \u001b[38;5;241m5.8\u001b[39m, \u001b[38;5;241m0.001\u001b[39m]\n\u001b[0;32m 4\u001b[0m y \u001b[38;5;241m=\u001b[39m [\u001b[38;5;241m1\u001b[39m, \u001b[38;5;241m2\u001b[39m, \u001b[38;5;241m4\u001b[39m, \u001b[38;5;241m2\u001b[39m, \u001b[38;5;241m2\u001b[39m, \u001b[38;5;241m3\u001b[39m, \u001b[38;5;241m4\u001b[39m, \u001b[38;5;241m5\u001b[39m]\n", "\u001b[1;31mModuleNotFoundError\u001b[0m: No module named 'sklearn'" ] } ], "source": [ "from sklearn.model_selection import GroupKFold\n", "\n", "X = [0.1, 0.2, 2.2, 2.4, 2.3, 4.55, 5.8, 0.001]\n", "y = [1, 2, 4, 2, 2, 3, 4, 5]\n", "groups = [\"Europe\", \"Africa\", \"Africa\", \"Africa\", \"America\", \"Asia\", \"Asia\", \"Europe\"]\n", "cv = GroupKFold(n_splits=4)\n", "for train, test in cv.split(X, y, groups=groups):\n", " print(\"%s %s\" % (train, test))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Conclusion\n", "\n", "A simple random split of the data works on toy problems, but real-world data is rarely i.i.d.\n", "\n", "We looked at different ways that we can evaluate models that violate the i.i.d. assumption and how we can still evaluate their performance on unseen data without obtaining artificially high scores.\n", "\n", "\n", "
\n", "Tip: Artificially high scores from leakage and cheating mean that our scientific finding hold no merit. This is often caught in review and prolongs the review process (which no one wants). But in the worst case can lead to diverting research funds in a wrong direction and paper redactions / corrections.
" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.8" }, "vscode": { "interpreter": { "hash": "d7369b48cea8bb1af6d88d25f2646d14ea11b68d7457d74f06fbf0d68480668d" } } }, "nbformat": 4, "nbformat_minor": 4 }