{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Model Sharing\n", "\n", "Some journals will require the sharing of code or models, but even if they don’t we might benefit from it.\n", "\n", "Anytime we share a model, we give other researchers the opportunity to replicate our studies and iterate upon them. Altruistically, this advances science, which in and of itself is a noble pursuit. However, this also increases the citations of our original research, a core metric for most researchers in academia.\n", "\n", "In this section, we explore how we can export models and make our training codes reproducible. Saving a model from scikit-learn is easy enough. But what tools can we use to easily make our training code adaptable for others to import and try out that model? Specifically, I want to talk about:\n", "\n", "- Automatic Linters\n", "- Automatic Formatting\n", "- Automatic Docstrings and Documentation\n", "- Docker and containerization for ultimate reproducibility\n" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0 }, "source": [ "## Model Export\n", "Scikit learn uses the Python `pickle` (or rather `joblib`) module to persist models in storage.\n", "More information [here](https://scikit-learn.org/stable/model_persistence.html)" ] }, { "cell_type": "code", "execution_count": 1, "id": "54158e1d", "metadata": { "execution": { "iopub.execute_input": "2022-12-13T01:42:16.975494Z", "iopub.status.busy": "2022-12-13T01:42:16.975494Z", "iopub.status.idle": "2022-12-13T01:42:16.987167Z", "shell.execute_reply": "2022-12-13T01:42:16.986667Z" } }, "outputs": [], "source": [ "from pathlib import Path\n", "\n", "DATA_FOLDER = Path(\"..\", \"..\") / \"data\"\n", "DATA_FILEPATH = DATA_FOLDER / \"penguins_clean.csv\"" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "execution": { "iopub.execute_input": "2022-12-13T01:42:16.989668Z", "iopub.status.busy": "2022-12-13T01:42:16.989167Z", "iopub.status.idle": "2022-12-13T01:42:17.390237Z", "shell.execute_reply": "2022-12-13T01:42:17.389738Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Culmen Length (mm)Culmen Depth (mm)Flipper Length (mm)SexSpecies
039.118.7181.0MALEAdelie Penguin (Pygoscelis adeliae)
139.517.4186.0FEMALEAdelie Penguin (Pygoscelis adeliae)
240.318.0195.0FEMALEAdelie Penguin (Pygoscelis adeliae)
336.719.3193.0FEMALEAdelie Penguin (Pygoscelis adeliae)
439.320.6190.0MALEAdelie Penguin (Pygoscelis adeliae)
\n", "
" ], "text/plain": [ " Culmen Length (mm) Culmen Depth (mm) Flipper Length (mm) Sex \\\n", "0 39.1 18.7 181.0 MALE \n", "1 39.5 17.4 186.0 FEMALE \n", "2 40.3 18.0 195.0 FEMALE \n", "3 36.7 19.3 193.0 FEMALE \n", "4 39.3 20.6 190.0 MALE \n", "\n", " Species \n", "0 Adelie Penguin (Pygoscelis adeliae) \n", "1 Adelie Penguin (Pygoscelis adeliae) \n", "2 Adelie Penguin (Pygoscelis adeliae) \n", "3 Adelie Penguin (Pygoscelis adeliae) \n", "4 Adelie Penguin (Pygoscelis adeliae) " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "penguins = pd.read_csv(DATA_FILEPATH)\n", "penguins.head()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "execution": { "iopub.execute_input": "2022-12-13T01:42:17.393238Z", "iopub.status.busy": "2022-12-13T01:42:17.392738Z", "iopub.status.idle": "2022-12-13T01:42:17.576270Z", "shell.execute_reply": "2022-12-13T01:42:17.575770Z" } }, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split\n", "num_features = [\"Culmen Length (mm)\", \"Culmen Depth (mm)\", \"Flipper Length (mm)\"]\n", "cat_features = [\"Sex\"]\n", "features = num_features + cat_features\n", "target = [\"Species\"]\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(penguins[features], penguins[target[0]], stratify=penguins[target[0]], train_size=.7, random_state=42)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First we'll build a quick model, as we did in [the Data notebook](/notebooks/0-basic-data-prep-and-model.html)." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "execution": { "iopub.execute_input": "2022-12-13T01:42:17.579270Z", "iopub.status.busy": "2022-12-13T01:42:17.578770Z", "iopub.status.idle": "2022-12-13T01:42:17.607270Z", "shell.execute_reply": "2022-12-13T01:42:17.606769Z" } }, "outputs": [ { "data": { "text/plain": [ "1.0" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.svm import SVC\n", "from sklearn.compose import ColumnTransformer\n", "from sklearn.pipeline import Pipeline\n", "from sklearn.preprocessing import StandardScaler, OneHotEncoder\n", "\n", "num_transformer = StandardScaler()\n", "cat_transformer = OneHotEncoder(handle_unknown='ignore')\n", "\n", "preprocessor = ColumnTransformer(transformers=[\n", " ('num', num_transformer, num_features),\n", " ('cat', cat_transformer, cat_features)\n", "])\n", "\n", "model = Pipeline(steps=[\n", " ('preprocessor', preprocessor),\n", " ('classifier', SVC(random_state=42)),\n", "])\n", "\n", "model.fit(X_train, y_train)\n", "model.score(X_test, y_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Saving a trained scikit-learn model with Joblib is a straightforward and efficient process that ensures the [preservation of model state](https://scikit-learn.org/stable/model_persistence.html) for future use. \n", "\n", "By employing the `joblib.dump()` function, users can serialize the trained model object to disk in a binary format, effectively storing its parameters, coefficients, and other essential attributes. This serialized representation allows for seamless retrieval and reuse of the trained model, enabling users to deploy it in production environments, share it with collaborators, or perform further analysis without the need to retrain the model from scratch. \n", "\n", "Additionally, Joblib's serialization mechanism ensures minimal overhead in terms of file size and loading time, making it an ideal choice for saving scikit-learn models efficiently. Overall, leveraging Joblib to save scikit-learn models facilitates reproducibility, scalability, and ease of deployment in real-world." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "execution": { "iopub.execute_input": "2022-12-13T01:42:17.609770Z", "iopub.status.busy": "2022-12-13T01:42:17.609270Z", "iopub.status.idle": "2022-12-13T01:42:17.622748Z", "shell.execute_reply": "2022-12-13T01:42:17.622248Z" } }, "outputs": [], "source": [ "MODEL_FOLDER = Path(\"..\", \"..\") / \"model\"\n", "MODEL_FOLDER.mkdir(exist_ok=True)\n", "\n", "MODEL_EXPORT_FILE = MODEL_FOLDER / \"svc.joblib\"" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "execution": { "iopub.execute_input": "2022-12-13T01:42:17.625248Z", "iopub.status.busy": "2022-12-13T01:42:17.624748Z", "iopub.status.idle": "2022-12-13T01:42:17.653756Z", "shell.execute_reply": "2022-12-13T01:42:17.653255Z" } }, "outputs": [ { "data": { "text/plain": [ "1.0" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from joblib import dump, load\n", "\n", "dump(model, MODEL_EXPORT_FILE)\n", "\n", "clf = load(MODEL_EXPORT_FILE)\n", "clf.score(X_test, y_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Sources of Randomness\n", "You may have noticed that I used `random_state` in some arguments.\n", "\n", "This fixes all sources of random initialization in a model or method to this particular random seed.\n", "\n", "
\n", "Tip: Always Google the latest way to fix randomness in machine learning code. It differs from library to library and version to version.
It's easy to forget one, which defeats the entire purpose.
\n", "\n", "This works in making models reproducible. Try changing the random seed below!" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "execution": { "iopub.execute_input": "2022-12-13T01:42:17.656756Z", "iopub.status.busy": "2022-12-13T01:42:17.656257Z", "iopub.status.idle": "2022-12-13T01:42:17.684761Z", "shell.execute_reply": "2022-12-13T01:42:17.684261Z" } }, "outputs": [ { "ename": "ModuleNotFoundError", "evalue": "No module named 'sklearn'", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mModuleNotFoundError\u001b[0m Traceback (most recent call last)", "Cell \u001b[1;32mIn [7], line 1\u001b[0m\n\u001b[1;32m----> 1\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01msklearn\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01mmodel_selection\u001b[39;00m \u001b[38;5;28;01mimport\u001b[39;00m cross_val_score\n\u001b[0;32m 3\u001b[0m clf \u001b[38;5;241m=\u001b[39m Pipeline(steps\u001b[38;5;241m=\u001b[39m[\n\u001b[0;32m 4\u001b[0m (\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mpreprocessor\u001b[39m\u001b[38;5;124m'\u001b[39m, preprocessor),\n\u001b[0;32m 5\u001b[0m (\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mclassifier\u001b[39m\u001b[38;5;124m'\u001b[39m, SVC(random_state\u001b[38;5;241m=\u001b[39m\u001b[38;5;241m42\u001b[39m)),\n\u001b[0;32m 6\u001b[0m ])\n\u001b[0;32m 7\u001b[0m scores \u001b[38;5;241m=\u001b[39m cross_val_score(clf, X_train, y_train, cv\u001b[38;5;241m=\u001b[39m\u001b[38;5;241m5\u001b[39m)\n", "\u001b[1;31mModuleNotFoundError\u001b[0m: No module named 'sklearn'" ] } ], "source": [ "from sklearn.model_selection import cross_val_score\n", "\n", "clf = Pipeline(steps=[\n", " ('preprocessor', preprocessor),\n", " ('classifier', SVC(random_state=42)),\n", "])\n", "scores = cross_val_score(clf, X_train, y_train, cv=5)\n", "scores" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Good code practices\n", "### Linting\n", "Tools like linters are amazing at cleaning up code. `flake8` and editors like VSCode can find unused variables, trailing white-space and lines that are way too long.\n", "\n", "They immediately show some typos that you would otherwise have to pain-stakingly search.\n", "\n", "[Flake8](https://flake8.pycqa.org/en/latest/) tries to be as close to the PEP8 style-guide as possible and find bugs before even running the code.\n", "\n", "### Formatters\n", "There are automatic formatters like `black` that will take your code and change the formatting to comply with formatting rules.\n", "\n", "Formatters like [black](https://github.com/psf/black) don't check the code itself for bugs, but they're great at presenting a common code style.\n", "\n", "They're my shortcut to make good-looking code and make collaboration 100 times easier as the formatting is done by an automated tool.\n", "\n", "### Docstrings\n", "Python has documentation built into its core.\n", "For example, below is the SVC model, if you put the cursor in the brackets, you can hit `Shift + Tab` in Jupyter to read the documentation that is written in the docstring." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "execution": { "iopub.execute_input": "2022-12-13T01:42:17.687762Z", "iopub.status.busy": "2022-12-13T01:42:17.687262Z", "iopub.status.idle": "2022-12-13T01:42:17.715767Z", "shell.execute_reply": "2022-12-13T01:42:17.715266Z" } }, "outputs": [ { "ename": "NameError", "evalue": "name 'SVC' is not defined", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mNameError\u001b[0m Traceback (most recent call last)", "Cell \u001b[1;32mIn [8], line 1\u001b[0m\n\u001b[1;32m----> 1\u001b[0m \u001b[43mSVC\u001b[49m()\n", "\u001b[1;31mNameError\u001b[0m: name 'SVC' is not defined" ] } ], "source": [ "SVC()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In VSCode for example there are tools that [autogenerate a docstring](https://marketplace.visualstudio.com/items?itemName=njpwerner.autodocstring) based on a function footprint:\n", "\n", "
\n", "Tip: Docstrings are an essential part in telling users what a function does and what input and outputs are expected.
\n", "I see docstrings as the minimum documentation one should provide when sharing code, and the auto-generators make it extremely easy to do so.\n", "\n", "This docstring was automatically generated. Just fill in the summary and description and people are happy!\n", "```python\n", "def hello_world(name: str) -> None:\n", " \"\"\"_summary_\n", "\n", " Parameters\n", " ----------\n", " name : str\n", " _description_\n", " \"\"\"\n", " print(f\"Hello {name}\")\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Dependencies\n", "This repository comes with a `requirements.txt` and a `environment.yml` for `pip` and `conda`.\n", "\n", "A `requirements.txt` can consist of simply the package names. But ideally you want to add the version number, so people can automatically install the exact version you used in your code.\n", "This looks like `pandas==1.0`\n", "\n", "The conda `environment.yml` can be auto-exported from your conda environment:\n", "\n", "`conda env export --from-history > environment.yml`\n", "The `--from-history` makes it cross-platform but eliminates the version numbers.\n", "\n", "For this project, we created a [pip requirements file](https://github.com/JesperDramsch/ml-for-science-reproducibility-tutorial/blob/main/requirements/tutorial.txt) and an [conda environment yaml file](https://github.com/JesperDramsch/ml-for-science-reproducibility-tutorial/blob/main/requirements/tutorial.yml). They don't necessarily contain versions, because this can be tricky across platforms like `MacOS`, `Windows`, and `Linux` unfortunately.\n", "\n", "One way to address this issue is to define ranges like `pandas>=1.0`, which would install the latest pandas starting from pandas version 1.0, or limit to `pandas<=1.0` below a certain version, where breaking changes were introduced." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Docker for ultimate reproducibility\n", "Docker is container that can ship an entire operating system with installed packages and data.\n", "\n", "It makes the environment you used for your computation almost entirely reproducible.\n", "\n", "(It's also great practice for the business world)\n", "\n", "Docker containers are built using the `docker build` command using a `Dockerfile`, an example docker file for Python looks like this:\n", "\n", "```docker\n", "# syntax=docker/dockerfile:1\n", "\n", "FROM python:3.8-slim-buster\n", "\n", "WORKDIR /\n", "\n", "COPY requirements.txt requirements.txt\n", "RUN pip3 install -r requirements.txt\n", "\n", "COPY . .\n", "\n", "CMD python train.py\n", "```\n", "\n", "Then this built image can be shared with other researchers that have the exact same compute environment for this experiments." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.10.8 ('pydata-global-2022-ml-repro')", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.7" }, "vscode": { "interpreter": { "hash": "d7369b48cea8bb1af6d88d25f2646d14ea11b68d7457d74f06fbf0d68480668d" } } }, "nbformat": 4, "nbformat_minor": 2 }