{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Benchmarking\n", "\n", "Another common reason for rejections of machine learning papers in applied science is the lack of proper benchmarks. This section will be fairly short, as it differs from discipline to discipline.\n", "\n", "However, any time we apply a superfancy deep neural network, we need to supply a benchmark to compare the relative performance of our model to. These models should be established methods in the field and simpler machine learning methods like a linear model, support-vector machine or a random forest." ] }, { "cell_type": "code", "execution_count": 1, "id": "54158e1d", "metadata": { "execution": { "iopub.execute_input": "2022-12-13T01:42:14.528566Z", "iopub.status.busy": "2022-12-13T01:42:14.528066Z", "iopub.status.idle": "2022-12-13T01:42:14.539840Z", "shell.execute_reply": "2022-12-13T01:42:14.539340Z" } }, "outputs": [], "source": [ "from pathlib import Path\n", "\n", "DATA_FOLDER = Path(\"..\", \"..\") / \"data\"\n", "DATA_FILEPATH = DATA_FOLDER / \"penguins_clean.csv\"" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "execution": { "iopub.execute_input": "2022-12-13T01:42:14.542341Z", "iopub.status.busy": "2022-12-13T01:42:14.542341Z", "iopub.status.idle": "2022-12-13T01:42:14.958415Z", "shell.execute_reply": "2022-12-13T01:42:14.957914Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Culmen Length (mm)Culmen Depth (mm)Flipper Length (mm)SexSpecies
039.118.7181.0MALEAdelie Penguin (Pygoscelis adeliae)
139.517.4186.0FEMALEAdelie Penguin (Pygoscelis adeliae)
240.318.0195.0FEMALEAdelie Penguin (Pygoscelis adeliae)
336.719.3193.0FEMALEAdelie Penguin (Pygoscelis adeliae)
439.320.6190.0MALEAdelie Penguin (Pygoscelis adeliae)
\n", "
" ], "text/plain": [ " Culmen Length (mm) Culmen Depth (mm) Flipper Length (mm) Sex \\\n", "0 39.1 18.7 181.0 MALE \n", "1 39.5 17.4 186.0 FEMALE \n", "2 40.3 18.0 195.0 FEMALE \n", "3 36.7 19.3 193.0 FEMALE \n", "4 39.3 20.6 190.0 MALE \n", "\n", " Species \n", "0 Adelie Penguin (Pygoscelis adeliae) \n", "1 Adelie Penguin (Pygoscelis adeliae) \n", "2 Adelie Penguin (Pygoscelis adeliae) \n", "3 Adelie Penguin (Pygoscelis adeliae) \n", "4 Adelie Penguin (Pygoscelis adeliae) " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "penguins = pd.read_csv(DATA_FILEPATH)\n", "penguins.head()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "execution": { "iopub.execute_input": "2022-12-13T01:42:14.960915Z", "iopub.status.busy": "2022-12-13T01:42:14.960915Z", "iopub.status.idle": "2022-12-13T01:42:15.128128Z", "shell.execute_reply": "2022-12-13T01:42:15.128128Z" } }, "outputs": [ { "ename": "ModuleNotFoundError", "evalue": "No module named 'sklearn'", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mModuleNotFoundError\u001b[0m Traceback (most recent call last)", "Cell \u001b[1;32mIn [3], line 1\u001b[0m\n\u001b[1;32m----> 1\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01msklearn\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01mmodel_selection\u001b[39;00m \u001b[38;5;28;01mimport\u001b[39;00m train_test_split\n\u001b[0;32m 2\u001b[0m num_features \u001b[38;5;241m=\u001b[39m [\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mCulmen Length (mm)\u001b[39m\u001b[38;5;124m\"\u001b[39m, \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mCulmen Depth (mm)\u001b[39m\u001b[38;5;124m\"\u001b[39m, \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mFlipper Length (mm)\u001b[39m\u001b[38;5;124m\"\u001b[39m]\n\u001b[0;32m 3\u001b[0m cat_features \u001b[38;5;241m=\u001b[39m [\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mSex\u001b[39m\u001b[38;5;124m\"\u001b[39m]\n", "\u001b[1;31mModuleNotFoundError\u001b[0m: No module named 'sklearn'" ] } ], "source": [ "from sklearn.model_selection import train_test_split\n", "num_features = [\"Culmen Length (mm)\", \"Culmen Depth (mm)\", \"Flipper Length (mm)\"]\n", "cat_features = [\"Sex\"]\n", "features = num_features + cat_features\n", "target = [\"Species\"]\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(penguins[features], penguins[target], stratify=penguins[target[0]], train_size=.7, random_state=42)\n", "X_train" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Dummy Classifiers\n", "One of the easiest way to build a benchmark is ensuring that our model performs better than random.\n", "\n", "
\n", "Tip: If our model is effectively as good as a coin flip, it's a bad model.
\n", "However, sometimes it isn't obvious how good or bad a model is. Take our penguin data. What counts as \"random classification\" on 3 classes that aren't equally distributed?" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "execution": { "iopub.execute_input": "2022-12-13T01:42:15.131129Z", "iopub.status.busy": "2022-12-13T01:42:15.131129Z", "iopub.status.idle": "2022-12-13T01:42:15.159448Z", "shell.execute_reply": "2022-12-13T01:42:15.158948Z" } }, "outputs": [ { "ename": "NameError", "evalue": "name 'y_train' is not defined", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mNameError\u001b[0m Traceback (most recent call last)", "Cell \u001b[1;32mIn [4], line 1\u001b[0m\n\u001b[1;32m----> 1\u001b[0m \u001b[43my_train\u001b[49m\u001b[38;5;241m.\u001b[39mreset_index()\u001b[38;5;241m.\u001b[39mgroupby([\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mSpecies\u001b[39m\u001b[38;5;124m\"\u001b[39m])\u001b[38;5;241m.\u001b[39mcount()\n", "\u001b[1;31mNameError\u001b[0m: name 'y_train' is not defined" ] } ], "source": [ "y_train.reset_index().groupby([\"Species\"]).count()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can use the `DummyClassifier` and `DummyRegressor` to show what a random model would predict." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "execution": { "iopub.execute_input": "2022-12-13T01:42:15.161948Z", "iopub.status.busy": "2022-12-13T01:42:15.161948Z", "iopub.status.idle": "2022-12-13T01:42:15.190454Z", "shell.execute_reply": "2022-12-13T01:42:15.189954Z" } }, "outputs": [ { "ename": "ModuleNotFoundError", "evalue": "No module named 'sklearn'", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mModuleNotFoundError\u001b[0m Traceback (most recent call last)", "Cell \u001b[1;32mIn [5], line 1\u001b[0m\n\u001b[1;32m----> 1\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01msklearn\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01mdummy\u001b[39;00m \u001b[38;5;28;01mimport\u001b[39;00m DummyClassifier\n\u001b[0;32m 2\u001b[0m clf \u001b[38;5;241m=\u001b[39m DummyClassifier()\n\u001b[0;32m 3\u001b[0m clf\u001b[38;5;241m.\u001b[39mfit(X_train, y_train)\n", "\u001b[1;31mModuleNotFoundError\u001b[0m: No module named 'sklearn'" ] } ], "source": [ "from sklearn.dummy import DummyClassifier\n", "clf = DummyClassifier()\n", "clf.fit(X_train, y_train)\n", "clf.score(X_test, y_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Benchmark Datasets\n", "Another great tool to use is benchmark datasets.\n", "\n", "Most fields have started creating benchmark datasets to test new methods in a controlled environment.\n", "Unfortunately, it still happens that results are over-reported because models weren't adequately evaluated as seen in [notebook 1](1-model-evaluation.ipynb).\n", "Nevertheless, it's easy to reproduce the results as both the code and data are available, so we can quickly see how legitimate reported scores are.\n", "\n", "Examples:\n", "- [Imagenet](https://www.image-net.org/) in computer vision\n", "- [WeatherBench](https://github.com/pangeo-data/WeatherBench) in meteorology\n", "- [ChestX-ray8](https://paperswithcode.com/dataset/chestx-ray8) in medical imaging" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Domain Methods\n", "Any method is stronger if it is verified against standard methods in the field.\n", "\n", "A weather forecast post-processing method should be evaluated against a standard for forecast post-processing.\n", "\n", "This is where domain expertise is important." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Linear and Standard Models\n", "In addition to the Dummy methods, we also want to evaluate our fancy solutions against very simple models.\n", "\n", "Personally, I like using:\n", "- [Linear Models](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)\n", "- [Random Forests](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)\n", "\n", "As an exercise try implementing baseline models to compare against the support-vector machine with preprocessing." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.10.8 ('pydata-global-2022-ml-repro')", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.8" }, "vscode": { "interpreter": { "hash": "d7369b48cea8bb1af6d88d25f2646d14ea11b68d7457d74f06fbf0d68480668d" } } }, "nbformat": 4, "nbformat_minor": 2 }