3.3. Model Sharing#

Some journals will require the sharing of code or models, but even if they don’t we might benefit from it.

Anytime we share a model, we give other researchers the opportunity to replicate our studies and iterate upon them. Altruistically, this advances science, which in and of itself is a noble pursuit. However, this also increases the citations of our original research, a core metric for most researchers in academia.

In this section, we explore how we can export models and make our training codes reproducible. Saving a model from scikit-learn is easy enough. But what tools can we use to easily make our training code adaptable for others to import and try out that model? Specifically, I want to talk about:

  • Automatic Linters

  • Automatic Formatting

  • Automatic Docstrings and Documentation

  • Docker and containerization for ultimate reproducibility

3.3.1. Model Export#

Scikit learn uses the Python pickle (or rather joblib) module to persist models in storage. More information here

from pathlib import Path

DATA_FOLDER = Path("..", "..") / "data"
DATA_FILEPATH = DATA_FOLDER / "penguins_clean.csv"
import pandas as pd
penguins = pd.read_csv(DATA_FILEPATH)
penguins.head()
Culmen Length (mm) Culmen Depth (mm) Flipper Length (mm) Sex Species
0 39.1 18.7 181.0 MALE Adelie Penguin (Pygoscelis adeliae)
1 39.5 17.4 186.0 FEMALE Adelie Penguin (Pygoscelis adeliae)
2 40.3 18.0 195.0 FEMALE Adelie Penguin (Pygoscelis adeliae)
3 36.7 19.3 193.0 FEMALE Adelie Penguin (Pygoscelis adeliae)
4 39.3 20.6 190.0 MALE Adelie Penguin (Pygoscelis adeliae)
from sklearn.model_selection import train_test_split
num_features = ["Culmen Length (mm)", "Culmen Depth (mm)", "Flipper Length (mm)"]
cat_features = ["Sex"]
features = num_features + cat_features
target = ["Species"]

X_train, X_test, y_train, y_test = train_test_split(penguins[features], penguins[target[0]], stratify=penguins[target[0]], train_size=.7, random_state=42)

First we’ll build a quick model, as we did in the Data notebook.

from sklearn.svm import SVC
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder

num_transformer = StandardScaler()
cat_transformer = OneHotEncoder(handle_unknown='ignore')

preprocessor = ColumnTransformer(transformers=[
    ('num', num_transformer, num_features),
    ('cat', cat_transformer, cat_features)
])

model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', SVC(random_state=42)),
])

model.fit(X_train, y_train)
model.score(X_test, y_test)
1.0

Saving a trained scikit-learn model with Joblib is a straightforward and efficient process that ensures the preservation of model state for future use.

By employing the joblib.dump() function, users can serialize the trained model object to disk in a binary format, effectively storing its parameters, coefficients, and other essential attributes. This serialized representation allows for seamless retrieval and reuse of the trained model, enabling users to deploy it in production environments, share it with collaborators, or perform further analysis without the need to retrain the model from scratch.

Additionally, Joblib’s serialization mechanism ensures minimal overhead in terms of file size and loading time, making it an ideal choice for saving scikit-learn models efficiently. Overall, leveraging Joblib to save scikit-learn models facilitates reproducibility, scalability, and ease of deployment in real-world.

MODEL_FOLDER = Path("..", "..") / "model"
MODEL_FOLDER.mkdir(exist_ok=True)

MODEL_EXPORT_FILE = MODEL_FOLDER / "svc.joblib"
from joblib import dump, load

dump(model, MODEL_EXPORT_FILE)

clf = load(MODEL_EXPORT_FILE)
clf.score(X_test, y_test)
1.0

3.3.2. Sources of Randomness#

You may have noticed that I used random_state in some arguments.

This fixes all sources of random initialization in a model or method to this particular random seed.

Tip: Always Google the latest way to fix randomness in machine learning code. It differs from library to library and version to version.
It's easy to forget one, which defeats the entire purpose.

This works in making models reproducible. Try changing the random seed below!

from sklearn.model_selection import cross_val_score

clf = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', SVC(random_state=42)),
])
scores = cross_val_score(clf, X_train, y_train, cv=5)
scores
array([1.        , 1.        , 0.9787234 , 0.97826087, 0.97826087])

3.3.3. Good code practices#

3.3.3.1. Linting#

Tools like linters are amazing at cleaning up code. flake8 and editors like VSCode can find unused variables, trailing white-space and lines that are way too long.

They immediately show some typos that you would otherwise have to pain-stakingly search.

Flake8 tries to be as close to the PEP8 style-guide as possible and find bugs before even running the code.

3.3.3.2. Formatters#

There are automatic formatters like black that will take your code and change the formatting to comply with formatting rules.

Formatters like black don’t check the code itself for bugs, but they’re great at presenting a common code style.

They’re my shortcut to make good-looking code and make collaboration 100 times easier as the formatting is done by an automated tool.

3.3.3.3. Docstrings#

Python has documentation built into its core. For example, below is the SVC model, if you put the cursor in the brackets, you can hit Shift + Tab in Jupyter to read the documentation that is written in the docstring.

SVC()
SVC()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In VSCode for example there are tools that autogenerate a docstring based on a function footprint:

Tip: Docstrings are an essential part in telling users what a function does and what input and outputs are expected.
I see docstrings as the minimum documentation one should provide when sharing code, and the auto-generators make it extremely easy to do so.

This docstring was automatically generated. Just fill in the summary and description and people are happy!

def hello_world(name: str) -> None:
    """_summary_

    Parameters
    ----------
    name : str
        _description_
    """
    print(f"Hello {name}")

3.3.4. Dependencies#

This repository comes with a requirements.txt and a environment.yml for pip and conda.

A requirements.txt can consist of simply the package names. But ideally you want to add the version number, so people can automatically install the exact version you used in your code. This looks like pandas==1.0

The conda environment.yml can be auto-exported from your conda environment:

conda env export --from-history > environment.yml The --from-history makes it cross-platform but eliminates the version numbers.

For this project, we created a pip requirements file and an conda environment yaml file. They don’t necessarily contain versions, because this can be tricky across platforms like MacOS, Windows, and Linux unfortunately.

One way to address this issue is to define ranges like pandas>=1.0, which would install the latest pandas starting from pandas version 1.0, or limit to pandas<=1.0 below a certain version, where breaking changes were introduced.

3.3.5. Docker for ultimate reproducibility#

Docker is container that can ship an entire operating system with installed packages and data.

It makes the environment you used for your computation almost entirely reproducible.

(It’s also great practice for the business world)

Docker containers are built using the docker build command using a Dockerfile, an example docker file for Python looks like this:

# syntax=docker/dockerfile:1

FROM python:3.8-slim-buster

WORKDIR /

COPY requirements.txt requirements.txt
RUN pip3 install -r requirements.txt

COPY . .

CMD python train.py

Then this built image can be shared with other researchers that have the exact same compute environment for this experiments.