Train Your First Model
Last time you learned the big idea: show a computer labeled examples and it finds the pattern. Today you actually do it. Using the penguins from Phase III, you’ll train a real model to guess a penguin’s species from its measurements — in about three lines. The tool is scikit-learn, the most popular machine-learning library, built into Colab.
💡 In Colab. (scikit-learn is already installed — no
pipneeded.)
Set up the examples
Remember features (the clues) and label (the answer)? For penguins, the features are measurements and the label is the species:
import seaborn as sns
from sklearn.neighbors import KNeighborsClassifier
penguins = sns.load_dataset("penguins").dropna()
X = penguins[["bill_length_mm", "flipper_length_mm", "body_mass_g"]] # features
y = penguins["species"] # label
By tradition, the features are called X (a table of clues) and the label is called y (the answers). Each row of X is one penguin’s measurements; the matching y is that penguin’s species.
Train it: .fit()
This is the moment of learning. Create a model and call .fit(X, y) — “learn the pattern from these examples”:
model = KNeighborsClassifier()
model.fit(X, y)
That’s it. The model just studied hundreds of penguins and learned how measurements relate to species. fit is the learning step — the same “learn from examples” idea, now real code.
Use it: .predict()
Now give it a new penguin’s measurements and ask what species it is:
guess = model.predict([[45, 210, 4500]])
print("I think this penguin is a:", guess[0])
You handed it a bill length of 45mm, a flipper of 210mm, and a mass of 4500g — measurements it had never seen — and it predicted a species from the pattern it learned. You trained an AI!
Try it 🎯
- Change the three numbers and predict again. Tiny penguin? Big one?
- Look at a real row (
penguins.head()), copy its three measurements intopredict, and check if the model gets that species right.
How does KNeighbors decide?
The model you used, KNeighborsClassifier, has a wonderfully simple idea: to label a new penguin, it finds the most similar penguins it already knows (its “neighbors”) and goes with the majority. “This new one is closest to a bunch of Gentoos, so… Gentoo.” Similarity, not magic.
Think about it 🔮
You trained on bill_length_mm, flipper_length_mm, and body_mass_g. If you gave the model a penguin’s island instead of measurements, could it predict species? (No — it only learned from those three number features. A model can only use the kinds of clues it was trained on.)
Fix the bug 🐞
This trains a model but crashes on predict, because the new penguin is given as a flat list instead of a list-of-lists (the model expects a table of penguins, even if it’s just one):
model.predict([45, 210, 4500])
(Wrap it in another set of brackets — one row inside a table: model.predict([[45, 210, 4500]]).)
Your mission 🚀
Train the penguin classifier, then test it on three made-up penguins (three different sets of measurements). print each prediction. Then grab a real penguin from penguins.head() and confirm the model labels it correctly.
What you learned today
- scikit-learn trains machine-learning models; it’s built into Colab.
- Features go in
X, labels go iny. model.fit(X, y)is the learning step;model.predict([[...]])makes a guess on new data.KNeighborsClassifierdecides by finding the most similar examples it knows.
You trained a model — but is it any good? Next time we test it honestly and measure its accuracy. 🎯