6.2. Ablation Studies#

The gold standard in building complex machine learning models is proving that each constituent part of the model contributes something to the proposed solution.

Ablation studies play a pivotal role in this process by systematically dissecting machine learning models and evaluating the impact of individual components. By selectively removing or disabling specific features, layers, or modules within the model and observing the resulting changes in performance, we can assess the significance of each component in achieving the desired outcome.

Ablation studies offer invaluable insights into the inner workings of complex models, shedding light on which elements are essential for model performance and which may be redundant or less influential. This rigorous approach not only validates the effectiveness of the model architecture but also provides guidance for model refinement and optimization, ultimately advancing the field of machine learning and enhancing the reproducibility and reliability of research findings.

In this section, we’ll finally discuss how to present complex machine learning models in publications and ensure the viability of each part we engineered to solve our particular problem set.

(Granted on a toy problem, so there’s not too much we can ablate…)

First we’ll build a quick model, as we did in the Data notebook.

import warnings
warnings.filterwarnings('ignore')
from pathlib import Path

DATA_FOLDER = Path("..", "..") / "data"
DATA_FILEPATH = DATA_FOLDER / "penguins_clean.csv"
import pandas as pd
penguins = pd.read_csv(DATA_FILEPATH)
penguins.head()
Culmen Length (mm) Culmen Depth (mm) Flipper Length (mm) Sex Species
0 39.1 18.7 181.0 MALE Adelie Penguin (Pygoscelis adeliae)
1 39.5 17.4 186.0 FEMALE Adelie Penguin (Pygoscelis adeliae)
2 40.3 18.0 195.0 FEMALE Adelie Penguin (Pygoscelis adeliae)
3 36.7 19.3 193.0 FEMALE Adelie Penguin (Pygoscelis adeliae)
4 39.3 20.6 190.0 MALE Adelie Penguin (Pygoscelis adeliae)
from sklearn.model_selection import train_test_split
num_features = ["Culmen Length (mm)", "Culmen Depth (mm)", "Flipper Length (mm)"]
cat_features = ["Sex"]
features = num_features + cat_features
target = ["Species"]

X_train, X_test, y_train, y_test = train_test_split(penguins[features], penguins[target[0]], stratify=penguins[target[0]], train_size=.7, random_state=42)
import numpy as np
from sklearn.svm import SVC
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import cross_val_score

num_transformer = StandardScaler()
cat_transformer = OneHotEncoder(handle_unknown='ignore')

preprocessor = ColumnTransformer(transformers=[
    ('num', num_transformer, num_features),
    ('cat', cat_transformer, cat_features)
])


model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', SVC(random_state=42)),
])
model
Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num', StandardScaler(),
                                                  ['Culmen Length (mm)',
                                                   'Culmen Depth (mm)',
                                                   'Flipper Length (mm)']),
                                                 ('cat',
                                                  OneHotEncoder(handle_unknown='ignore'),
                                                  ['Sex'])])),
                ('classifier', SVC(random_state=42))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
scores = cross_val_score(model, X_test, y_test, cv=10)
scores
array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

We can save those initial scores as Full in a scoring dataframe.

scoring = pd.DataFrame(columns=["Average", "Deviation"])
scoring.loc["Full", :] = [scores.mean(), scores.std()]
scoring
Average Deviation
Full 1.0 0.0

Let’s compare this to a model that doesn’t scale the numeric inputs.

Here using the pipelines comes in handy, because switching off certain components just changes those into a noop.

# num_transformer = StandardScaler()
cat_transformer = OneHotEncoder(handle_unknown='ignore')

preprocessor = ColumnTransformer(transformers=[
    # ('num', num_transformer, num_features),
    ('cat', cat_transformer, cat_features)
])


model2 = Pipeline(
    steps=[
        ("preprocessor", preprocessor),
        ("classifier", SVC(random_state=42)),
    ]
)

scores = cross_val_score(model2, X_test, y_test, cv=10)

scoring.loc["No Standardisation",:] = [scores.mean(), scores.std()]
scoring
Average Deviation
Full 1.0 0.0
No Standardisation 0.435455 0.045172

Those scores decrease significantly when we remove the standardisation step. This is because the SVM algorithm is sensitive to the scale of the features. The standardisation step is crucial to ensure that the model can learn from the data.

Now we can evaluate if the model should use a singular column for Sex, since it is basically a binary column in our research data.

num_transformer = StandardScaler()
cat_transformer = OneHotEncoder(handle_unknown='ignore', drop='if_binary')

preprocessor = ColumnTransformer(transformers=[
    ('num', num_transformer, num_features),
    ('cat', cat_transformer, cat_features)
])


model2 = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', SVC()),
])

scores = cross_val_score(model2, X_test, y_test, cv=10)

scoring.loc["Single Column Sex",:] = [scores.mean(), scores.std()]
scoring
Average Deviation
Full 1.0 0.0
No Standardisation 0.435455 0.045172
Single Column Sex 1.0 0.0

This seems to not affect the model, so we can actually encode the catergorical information as binary in this case, instead of adding a feature column for Male and Female.

Clearly this is a toy example and we knew that switching off standardisation would have this effect. In the real world, we would, however, switch off entire components of neural networks to evaluate the impact they have on our model performance. This strengthens the claims we make in our research significantly and usually leads to easier reviews.