Getting to know the data#
This tutorial uses the Palmer Penguins dataset.
Data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network.
Let’s dive into some quick exploration of the data!
from pathlib import Path
DATA_FOLDER = Path("..", "..") / "data"
DATA_FILEPATH = DATA_FOLDER / "penguins.csv"
import pandas as pd
penguins_raw = pd.read_csv(DATA_FILEPATH)
penguins_raw.head()
studyName | Sample Number | Species | Region | Island | Stage | Individual ID | Clutch Completion | Date Egg | Culmen Length (mm) | Culmen Depth (mm) | Flipper Length (mm) | Body Mass (g) | Sex | Delta 15 N (o/oo) | Delta 13 C (o/oo) | Comments | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | PAL0708 | 1 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Torgersen | Adult, 1 Egg Stage | N1A1 | Yes | 2007-11-11 | 39.1 | 18.7 | 181.0 | 3750.0 | MALE | NaN | NaN | Not enough blood for isotopes. |
1 | PAL0708 | 2 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Torgersen | Adult, 1 Egg Stage | N1A2 | Yes | 2007-11-11 | 39.5 | 17.4 | 186.0 | 3800.0 | FEMALE | 8.94956 | -24.69454 | NaN |
2 | PAL0708 | 3 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Torgersen | Adult, 1 Egg Stage | N2A1 | Yes | 2007-11-16 | 40.3 | 18.0 | 195.0 | 3250.0 | FEMALE | 8.36821 | -25.33302 | NaN |
3 | PAL0708 | 4 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Torgersen | Adult, 1 Egg Stage | N2A2 | Yes | 2007-11-16 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Adult not sampled. |
4 | PAL0708 | 5 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Torgersen | Adult, 1 Egg Stage | N3A1 | Yes | 2007-11-16 | 36.7 | 19.3 | 193.0 | 3450.0 | FEMALE | 8.76651 | -25.32426 | NaN |
This looks like a lot. Let’s reduce this to some numerical columns and the species as our target column.
num_features = ["Culmen Length (mm)", "Culmen Depth (mm)", "Flipper Length (mm)"]
cat_features = ["Sex"]
features = num_features + cat_features
target = ["Species"]
penguins = penguins_raw[features+target]
penguins
Culmen Length (mm) | Culmen Depth (mm) | Flipper Length (mm) | Sex | Species | |
---|---|---|---|---|---|
0 | 39.1 | 18.7 | 181.0 | MALE | Adelie Penguin (Pygoscelis adeliae) |
1 | 39.5 | 17.4 | 186.0 | FEMALE | Adelie Penguin (Pygoscelis adeliae) |
2 | 40.3 | 18.0 | 195.0 | FEMALE | Adelie Penguin (Pygoscelis adeliae) |
3 | NaN | NaN | NaN | NaN | Adelie Penguin (Pygoscelis adeliae) |
4 | 36.7 | 19.3 | 193.0 | FEMALE | Adelie Penguin (Pygoscelis adeliae) |
... | ... | ... | ... | ... | ... |
339 | 55.8 | 19.8 | 207.0 | MALE | Chinstrap penguin (Pygoscelis antarctica) |
340 | 43.5 | 18.1 | 202.0 | FEMALE | Chinstrap penguin (Pygoscelis antarctica) |
341 | 49.6 | 18.2 | 193.0 | MALE | Chinstrap penguin (Pygoscelis antarctica) |
342 | 50.8 | 19.0 | 210.0 | MALE | Chinstrap penguin (Pygoscelis antarctica) |
343 | 50.2 | 18.7 | 198.0 | FEMALE | Chinstrap penguin (Pygoscelis antarctica) |
344 rows × 5 columns
Data Visualization#
That’s much better. So now we can look at the data in detail.
import seaborn as sns
pairplot_figure = sns.pairplot(penguins, hue="Species")

Data cleaning#
Looks like we’re getting some good separation of the clusters.
So that means we can probably do some cleaning and get ready to build some good machine learning models.
penguins = penguins.dropna(axis='rows')
penguins
Culmen Length (mm) | Culmen Depth (mm) | Flipper Length (mm) | Sex | Species | |
---|---|---|---|---|---|
0 | 39.1 | 18.7 | 181.0 | MALE | Adelie Penguin (Pygoscelis adeliae) |
1 | 39.5 | 17.4 | 186.0 | FEMALE | Adelie Penguin (Pygoscelis adeliae) |
2 | 40.3 | 18.0 | 195.0 | FEMALE | Adelie Penguin (Pygoscelis adeliae) |
4 | 36.7 | 19.3 | 193.0 | FEMALE | Adelie Penguin (Pygoscelis adeliae) |
5 | 39.3 | 20.6 | 190.0 | MALE | Adelie Penguin (Pygoscelis adeliae) |
... | ... | ... | ... | ... | ... |
339 | 55.8 | 19.8 | 207.0 | MALE | Chinstrap penguin (Pygoscelis antarctica) |
340 | 43.5 | 18.1 | 202.0 | FEMALE | Chinstrap penguin (Pygoscelis antarctica) |
341 | 49.6 | 18.2 | 193.0 | MALE | Chinstrap penguin (Pygoscelis antarctica) |
342 | 50.8 | 19.0 | 210.0 | MALE | Chinstrap penguin (Pygoscelis antarctica) |
343 | 50.2 | 18.7 | 198.0 | FEMALE | Chinstrap penguin (Pygoscelis antarctica) |
334 rows × 5 columns
DATA_CLEAN_FILEPATH = DATA_FOLDER / "penguins_clean.csv"
penguins.to_csv(DATA_CLEAN_FILEPATH, index=False)
Not too bad it looks like we lost two rows. That’s manageable, it’s a toy dataset after all.
So let’s build a small model to classify penuins!
Machine Learning#
First we need to split the data.
This way we can test whether our model learned general rules about our data, or if it just memorized the training data. When a model does not learn the generalities, this is known as overfitting.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(penguins[features], penguins[target], train_size=.7)
X_train
Culmen Length (mm) | Culmen Depth (mm) | Flipper Length (mm) | Sex | |
---|---|---|---|---|
252 | 48.5 | 15.0 | 219.0 | FEMALE |
51 | 40.1 | 18.9 | 188.0 | MALE |
87 | 36.9 | 18.6 | 189.0 | FEMALE |
165 | 48.4 | 14.6 | 213.0 | MALE |
315 | 53.5 | 19.9 | 205.0 | MALE |
... | ... | ... | ... | ... |
326 | 48.1 | 16.4 | 199.0 | FEMALE |
213 | 46.2 | 14.9 | 221.0 | MALE |
65 | 41.6 | 18.0 | 192.0 | MALE |
206 | 46.5 | 14.4 | 217.0 | FEMALE |
300 | 46.7 | 17.9 | 195.0 | FEMALE |
233 rows × 4 columns
y_train
Species | |
---|---|
252 | Gentoo penguin (Pygoscelis papua) |
51 | Adelie Penguin (Pygoscelis adeliae) |
87 | Adelie Penguin (Pygoscelis adeliae) |
165 | Gentoo penguin (Pygoscelis papua) |
315 | Chinstrap penguin (Pygoscelis antarctica) |
... | ... |
326 | Chinstrap penguin (Pygoscelis antarctica) |
213 | Gentoo penguin (Pygoscelis papua) |
65 | Adelie Penguin (Pygoscelis adeliae) |
206 | Gentoo penguin (Pygoscelis papua) |
300 | Chinstrap penguin (Pygoscelis antarctica) |
233 rows × 1 columns
Now we can build a machine learning model.
Here we’ll use the scikit-learn pipeline model. This makes it really easy for us to train prepocessors and models on the training data alone and cleanly apply to the test data set without leakage.
Pre-processing#
Statistically and scientifically valid results come from proper treatment of our data. Unfortunately, we can overfit manually if we don't split out a test set before pre-processing.
from sklearn.preprocessing import StandardScaler, OneHotEncoder
num_transformer = StandardScaler()
cat_transformer = OneHotEncoder(handle_unknown='ignore')
The ColumnTransformer
is a neat tool that can apply your preprocessing steps to the right columns in your dataset.
In fact, you could also use a Pipeline instead of “just” a StandardScaler
to use more sophisticated and complex preprocessing workflows that go beyond this toy project.
from sklearn.compose import ColumnTransformer
preprocessor = ColumnTransformer(transformers=[
('num', num_transformer, num_features),
('cat', cat_transformer, cat_features)
])
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
model = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', SVC()),
])
model
Pipeline(steps=[('preprocessor', ColumnTransformer(transformers=[('num', StandardScaler(), ['Culmen Length (mm)', 'Culmen Depth (mm)', 'Flipper Length (mm)']), ('cat', OneHotEncoder(handle_unknown='ignore'), ['Sex'])])), ('classifier', SVC())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('preprocessor', ColumnTransformer(transformers=[('num', StandardScaler(), ['Culmen Length (mm)', 'Culmen Depth (mm)', 'Flipper Length (mm)']), ('cat', OneHotEncoder(handle_unknown='ignore'), ['Sex'])])), ('classifier', SVC())])
ColumnTransformer(transformers=[('num', StandardScaler(), ['Culmen Length (mm)', 'Culmen Depth (mm)', 'Flipper Length (mm)']), ('cat', OneHotEncoder(handle_unknown='ignore'), ['Sex'])])
['Culmen Length (mm)', 'Culmen Depth (mm)', 'Flipper Length (mm)']
StandardScaler()
['Sex']
OneHotEncoder(handle_unknown='ignore')
SVC()
We can see a nice model representation here.
You can click on the different modules that will tell you which arguments were passed into the pipeline. In our case, how we handle unknown values in the OneHotEncoder.
Model Training#
Now it’s time to fit our Support-Vector Machine to our training data.
model.fit(X_train, y_train[target[0]])
Pipeline(steps=[('preprocessor', ColumnTransformer(transformers=[('num', StandardScaler(), ['Culmen Length (mm)', 'Culmen Depth (mm)', 'Flipper Length (mm)']), ('cat', OneHotEncoder(handle_unknown='ignore'), ['Sex'])])), ('classifier', SVC())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('preprocessor', ColumnTransformer(transformers=[('num', StandardScaler(), ['Culmen Length (mm)', 'Culmen Depth (mm)', 'Flipper Length (mm)']), ('cat', OneHotEncoder(handle_unknown='ignore'), ['Sex'])])), ('classifier', SVC())])
ColumnTransformer(transformers=[('num', StandardScaler(), ['Culmen Length (mm)', 'Culmen Depth (mm)', 'Flipper Length (mm)']), ('cat', OneHotEncoder(handle_unknown='ignore'), ['Sex'])])
['Culmen Length (mm)', 'Culmen Depth (mm)', 'Flipper Length (mm)']
StandardScaler()
['Sex']
OneHotEncoder(handle_unknown='ignore')
SVC()
We can see that we get a decent score on the training data.
This metric only tells us how well the model can perform on the data it has seen, we don’t know anything about generalization and actual “learning” yet.
model.score(X_train, y_train)
0.9957081545064378
To evaluate how well our model learned, we check the model against the test data one final time.
This invalidates scientific results and must be avoided. Only evaluate on the test set once!
model.score(X_test, y_test)
0.9900990099009901