Train Your First Model


Last time you learned the big idea: show a computer labeled examples and it finds the pattern. Today you actually do it. Using the penguins from Phase III, you’ll train a real model to guess a penguin’s species from its measurements — in about three lines. The tool is scikit-learn, the most popular machine-learning library, built into Colab.

💡 In Colab. (scikit-learn is already installed — no pip needed.)

Set up the examples

Remember features (the clues) and label (the answer)? For penguins, the features are measurements and the label is the species:

import seaborn as sns
from sklearn.neighbors import KNeighborsClassifier

penguins = sns.load_dataset("penguins").dropna()

X = penguins[["bill_length_mm", "flipper_length_mm", "body_mass_g"]]   # features
y = penguins["species"]                                                # label

By tradition, the features are called X (a table of clues) and the label is called y (the answers). Each row of X is one penguin’s measurements; the matching y is that penguin’s species.

Train it: .fit()

This is the moment of learning. Create a model and call .fit(X, y) — “learn the pattern from these examples”:

model = KNeighborsClassifier()
model.fit(X, y)

That’s it. The model just studied hundreds of penguins and learned how measurements relate to species. fit is the learning step — the same “learn from examples” idea, now real code.

Use it: .predict()

Now give it a new penguin’s measurements and ask what species it is:

guess = model.predict([[45, 210, 4500]])
print("I think this penguin is a:", guess[0])

You handed it a bill length of 45mm, a flipper of 210mm, and a mass of 4500g — measurements it had never seen — and it predicted a species from the pattern it learned. You trained an AI!

Try it 🎯

  1. Change the three numbers and predict again. Tiny penguin? Big one?
  2. Look at a real row (penguins.head()), copy its three measurements into predict, and check if the model gets that species right.

How does KNeighbors decide?

The model you used, KNeighborsClassifier, has a wonderfully simple idea: to label a new penguin, it finds the most similar penguins it already knows (its “neighbors”) and goes with the majority. “This new one is closest to a bunch of Gentoos, so… Gentoo.” Similarity, not magic.

Think about it 🔮

You trained on bill_length_mm, flipper_length_mm, and body_mass_g. If you gave the model a penguin’s island instead of measurements, could it predict species? (No — it only learned from those three number features. A model can only use the kinds of clues it was trained on.)

Fix the bug 🐞

This trains a model but crashes on predict, because the new penguin is given as a flat list instead of a list-of-lists (the model expects a table of penguins, even if it’s just one):

model.predict([45, 210, 4500])

(Wrap it in another set of brackets — one row inside a table: model.predict([[45, 210, 4500]]).)

Your mission 🚀

Train the penguin classifier, then test it on three made-up penguins (three different sets of measurements). print each prediction. Then grab a real penguin from penguins.head() and confirm the model labels it correctly.

What you learned today

  • scikit-learn trains machine-learning models; it’s built into Colab.
  • Features go in X, labels go in y.
  • model.fit(X, y) is the learning step; model.predict([[...]]) makes a guess on new data.
  • KNeighborsClassifier decides by finding the most similar examples it knows.

You trained a model — but is it any good? Next time we test it honestly and measure its accuracy. 🎯