Benchmarking

2.3. Benchmarking#

Another common reason for rejections of machine learning papers in applied science is the lack of proper benchmarks. This section will be fairly short, as it differs from discipline to discipline.

However, any time we apply a superfancy deep neural network, we need to supply a benchmark to compare the relative performance of our model to. These models should be established methods in the field and simpler machine learning methods like a linear model, support-vector machine or a random forest.

from pathlib import Path

DATA_FOLDER = Path("..", "..") / "data"
DATA_FILEPATH = DATA_FOLDER / "penguins_clean.csv"

import pandas as pd
penguins = pd.read_csv(DATA_FILEPATH)
penguins.head()

	Culmen Length (mm)	Culmen Depth (mm)	Flipper Length (mm)	Sex	Species
0	39.1	18.7	181.0	MALE	Adelie Penguin (Pygoscelis adeliae)
1	39.5	17.4	186.0	FEMALE	Adelie Penguin (Pygoscelis adeliae)
2	40.3	18.0	195.0	FEMALE	Adelie Penguin (Pygoscelis adeliae)
3	36.7	19.3	193.0	FEMALE	Adelie Penguin (Pygoscelis adeliae)
4	39.3	20.6	190.0	MALE	Adelie Penguin (Pygoscelis adeliae)

from sklearn.model_selection import train_test_split
num_features = ["Culmen Length (mm)", "Culmen Depth (mm)", "Flipper Length (mm)"]
cat_features = ["Sex"]
features = num_features + cat_features
target = ["Species"]

X_train, X_test, y_train, y_test = train_test_split(penguins[features], penguins[target], stratify=penguins[target[0]], train_size=.7, random_state=42)
X_train

	Culmen Length (mm)	Culmen Depth (mm)	Flipper Length (mm)	Sex
221	51.1	16.3	220.0	MALE
315	49.8	17.3	198.0	FEMALE
262	46.8	14.3	215.0	FEMALE
191	45.5	13.9	210.0	FEMALE
8	38.6	21.2	191.0	MALE
...	...	...	...	...
9	34.6	21.1	198.0	MALE
96	37.7	16.0	183.0	FEMALE
184	48.7	15.7	208.0	MALE
212	43.5	14.2	220.0	FEMALE
64	33.5	19.0	190.0	FEMALE

233 rows × 4 columns

2.3.1. Dummy Classifiers#

One of the easiest way to build a benchmark is ensuring that our model performs better than random.

Tip: If our model is effectively as good as a coin flip, it's a bad model.

However, sometimes it isn't obvious how good or bad a model is. Take our penguin data. What counts as "random classification" on 3 classes that aren't equally distributed?

y_train.reset_index().groupby(["Species"]).count()

	index
Species
Adelie Penguin (Pygoscelis adeliae)	102
Chinstrap penguin (Pygoscelis antarctica)	47
Gentoo penguin (Pygoscelis papua)	84

We can use the DummyClassifier and DummyRegressor to show what a random model would predict.

from sklearn.dummy import DummyClassifier
clf = DummyClassifier()
clf.fit(X_train, y_train)
clf.score(X_test, y_test)

0.43564356435643564

2.3.2. Benchmark Datasets#

Another great tool to use is benchmark datasets.

Most fields have started creating benchmark datasets to test new methods in a controlled environment. Unfortunately, it still happens that results are over-reported because models weren’t adequately evaluated as seen in notebook 1. Nevertheless, it’s easy to reproduce the results as both the code and data are available, so we can quickly see how legitimate reported scores are.

Examples:

Imagenet in computer vision
WeatherBench in meteorology
ChestX-ray8 in medical imaging

2.3.3. Domain Methods#

Any method is stronger if it is verified against standard methods in the field.

A weather forecast post-processing method should be evaluated against a standard for forecast post-processing.

This is where domain expertise is important.

2.3.4. Linear and Standard Models#

In addition to the Dummy methods, we also want to evaluate our fancy solutions against very simple models.

Personally, I like using:

As an exercise try implementing baseline models to compare against the support-vector machine with preprocessing.