{ "cells": [ { "cell_type": "markdown", "id": "2700d882", "metadata": {}, "source": [ "# Getting to know the data\n", "\n", "Understanding the data is a foundational step in any machine learning project and is crucial for achieving reproducibility in research. \n", "\n", "One of the primary reasons why getting to know the data is essential is its direct impact on the reliability and generalizability of the machine learning models. We can identify potential biases, anomalies, or inconsistencies that may affect model performance by exploring the data. \n", "\n", "This understanding enables us to apply appropriate preprocessing techniques, such as [handling missing values](#data-cleaning), [addressing class imbalances](/notebooks/1-model-evaluation.html#stratification), or [feature normalisation](#pre-processing), to ensure that the model learns meaningful patterns from the data.\n", "\n", "Moreover, gaining insights into the data can guide us as researchers in selecting the most appropriate algorithms and architectures for these tasks. Understanding the distribution and relationships within the data allows us to choose well-suited models to capture complex patterns and make accurate predictions.\n", "\n", "This tutorial uses the [Palmer Penguins dataset](https://allisonhorst.github.io/palmerpenguins/). Data were collected and made available by [Dr. Kristen Gorman](https://www.uaf.edu/cfos/people/faculty/detail/kristen-gorman.php) and the [Palmer Station, Antarctica LTER](https://pallter.marine.rutgers.edu/), a member of the [Long Term Ecological Research Network](https://lternet.edu/).\n", "\n", "Let's dive into some quick exploration of the data!" ] }, { "cell_type": "code", "execution_count": 2, "id": "54158e1d", "metadata": { "execution": { "iopub.execute_input": "2022-12-13T01:42:07.155662Z", "iopub.status.busy": "2022-12-13T01:42:07.155163Z", "iopub.status.idle": "2022-12-13T01:42:07.166664Z", "shell.execute_reply": "2022-12-13T01:42:07.166664Z" } }, "outputs": [], "source": [ "from pathlib import Path\n", "\n", "DATA_FOLDER = Path(\"..\", \"..\") / \"data\"\n", "DATA_FILEPATH = DATA_FOLDER / \"penguins.csv\"" ] }, { "cell_type": "markdown", "id": "5c8163eb", "metadata": {}, "source": [ "We'll use the `pandas` library to load an pre-process the data. It has quite a few convenience functions like loading CSVs or dropping columns. " ] }, { "cell_type": "code", "execution_count": 3, "id": "36b24fd4", "metadata": { "execution": { "iopub.execute_input": "2022-12-13T01:42:07.169664Z", "iopub.status.busy": "2022-12-13T01:42:07.169165Z", "iopub.status.idle": "2022-12-13T01:42:07.570293Z", "shell.execute_reply": "2022-12-13T01:42:07.569793Z" } }, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 4, "id": "01e133b7", "metadata": { "execution": { "iopub.execute_input": "2022-12-13T01:42:07.573794Z", "iopub.status.busy": "2022-12-13T01:42:07.573293Z", "iopub.status.idle": "2022-12-13T01:42:07.601299Z", "shell.execute_reply": "2022-12-13T01:42:07.600798Z" }, "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", " | studyName | \n", "Sample Number | \n", "Species | \n", "Region | \n", "Island | \n", "Stage | \n", "Individual ID | \n", "Clutch Completion | \n", "Date Egg | \n", "Culmen Length (mm) | \n", "Culmen Depth (mm) | \n", "Flipper Length (mm) | \n", "Body Mass (g) | \n", "Sex | \n", "Delta 15 N (o/oo) | \n", "Delta 13 C (o/oo) | \n", "Comments | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "PAL0708 | \n", "1 | \n", "Adelie Penguin (Pygoscelis adeliae) | \n", "Anvers | \n", "Torgersen | \n", "Adult, 1 Egg Stage | \n", "N1A1 | \n", "Yes | \n", "2007-11-11 | \n", "39.1 | \n", "18.7 | \n", "181.0 | \n", "3750.0 | \n", "MALE | \n", "NaN | \n", "NaN | \n", "Not enough blood for isotopes. | \n", "
1 | \n", "PAL0708 | \n", "2 | \n", "Adelie Penguin (Pygoscelis adeliae) | \n", "Anvers | \n", "Torgersen | \n", "Adult, 1 Egg Stage | \n", "N1A2 | \n", "Yes | \n", "2007-11-11 | \n", "39.5 | \n", "17.4 | \n", "186.0 | \n", "3800.0 | \n", "FEMALE | \n", "8.94956 | \n", "-24.69454 | \n", "NaN | \n", "
2 | \n", "PAL0708 | \n", "3 | \n", "Adelie Penguin (Pygoscelis adeliae) | \n", "Anvers | \n", "Torgersen | \n", "Adult, 1 Egg Stage | \n", "N2A1 | \n", "Yes | \n", "2007-11-16 | \n", "40.3 | \n", "18.0 | \n", "195.0 | \n", "3250.0 | \n", "FEMALE | \n", "8.36821 | \n", "-25.33302 | \n", "NaN | \n", "
3 | \n", "PAL0708 | \n", "4 | \n", "Adelie Penguin (Pygoscelis adeliae) | \n", "Anvers | \n", "Torgersen | \n", "Adult, 1 Egg Stage | \n", "N2A2 | \n", "Yes | \n", "2007-11-16 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "Adult not sampled. | \n", "
4 | \n", "PAL0708 | \n", "5 | \n", "Adelie Penguin (Pygoscelis adeliae) | \n", "Anvers | \n", "Torgersen | \n", "Adult, 1 Egg Stage | \n", "N3A1 | \n", "Yes | \n", "2007-11-16 | \n", "36.7 | \n", "19.3 | \n", "193.0 | \n", "3450.0 | \n", "FEMALE | \n", "8.76651 | \n", "-25.32426 | \n", "NaN | \n", "
\n", " | Culmen Length (mm) | \n", "Culmen Depth (mm) | \n", "Flipper Length (mm) | \n", "Sex | \n", "Species | \n", "
---|---|---|---|---|---|
0 | \n", "39.1 | \n", "18.7 | \n", "181.0 | \n", "MALE | \n", "Adelie Penguin (Pygoscelis adeliae) | \n", "
1 | \n", "39.5 | \n", "17.4 | \n", "186.0 | \n", "FEMALE | \n", "Adelie Penguin (Pygoscelis adeliae) | \n", "
2 | \n", "40.3 | \n", "18.0 | \n", "195.0 | \n", "FEMALE | \n", "Adelie Penguin (Pygoscelis adeliae) | \n", "
3 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "Adelie Penguin (Pygoscelis adeliae) | \n", "
4 | \n", "36.7 | \n", "19.3 | \n", "193.0 | \n", "FEMALE | \n", "Adelie Penguin (Pygoscelis adeliae) | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
339 | \n", "55.8 | \n", "19.8 | \n", "207.0 | \n", "MALE | \n", "Chinstrap penguin (Pygoscelis antarctica) | \n", "
340 | \n", "43.5 | \n", "18.1 | \n", "202.0 | \n", "FEMALE | \n", "Chinstrap penguin (Pygoscelis antarctica) | \n", "
341 | \n", "49.6 | \n", "18.2 | \n", "193.0 | \n", "MALE | \n", "Chinstrap penguin (Pygoscelis antarctica) | \n", "
342 | \n", "50.8 | \n", "19.0 | \n", "210.0 | \n", "MALE | \n", "Chinstrap penguin (Pygoscelis antarctica) | \n", "
343 | \n", "50.2 | \n", "18.7 | \n", "198.0 | \n", "FEMALE | \n", "Chinstrap penguin (Pygoscelis antarctica) | \n", "
344 rows × 5 columns
\n", "\n", " | Culmen Length (mm) | \n", "Culmen Depth (mm) | \n", "Flipper Length (mm) | \n", "Sex | \n", "Species | \n", "
---|---|---|---|---|---|
0 | \n", "39.1 | \n", "18.7 | \n", "181.0 | \n", "MALE | \n", "Adelie Penguin (Pygoscelis adeliae) | \n", "
1 | \n", "39.5 | \n", "17.4 | \n", "186.0 | \n", "FEMALE | \n", "Adelie Penguin (Pygoscelis adeliae) | \n", "
2 | \n", "40.3 | \n", "18.0 | \n", "195.0 | \n", "FEMALE | \n", "Adelie Penguin (Pygoscelis adeliae) | \n", "
4 | \n", "36.7 | \n", "19.3 | \n", "193.0 | \n", "FEMALE | \n", "Adelie Penguin (Pygoscelis adeliae) | \n", "
5 | \n", "39.3 | \n", "20.6 | \n", "190.0 | \n", "MALE | \n", "Adelie Penguin (Pygoscelis adeliae) | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
339 | \n", "55.8 | \n", "19.8 | \n", "207.0 | \n", "MALE | \n", "Chinstrap penguin (Pygoscelis antarctica) | \n", "
340 | \n", "43.5 | \n", "18.1 | \n", "202.0 | \n", "FEMALE | \n", "Chinstrap penguin (Pygoscelis antarctica) | \n", "
341 | \n", "49.6 | \n", "18.2 | \n", "193.0 | \n", "MALE | \n", "Chinstrap penguin (Pygoscelis antarctica) | \n", "
342 | \n", "50.8 | \n", "19.0 | \n", "210.0 | \n", "MALE | \n", "Chinstrap penguin (Pygoscelis antarctica) | \n", "
343 | \n", "50.2 | \n", "18.7 | \n", "198.0 | \n", "FEMALE | \n", "Chinstrap penguin (Pygoscelis antarctica) | \n", "
334 rows × 5 columns
\n", "