Did It Actually Learn?


You trained a model last time — but is it any good? Here’s a trap: if you test it on the exact penguins it learned from, of course it does well. That’s like giving a student the test answers to study, then giving them the same test. Real testing uses questions the model hasn’t seen. Today you learn the honest way to measure a model: split your data, and check its accuracy.

💡 In Colab. scikit-learn is built in.

Split: study set and test set

The trick is to hold some examples back. The model learns from a training set and is tested on a separate test set it never saw. train_test_split does this for you:

import seaborn as sns
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

penguins = sns.load_dataset("penguins").dropna()
X = penguins[["bill_length_mm", "flipper_length_mm", "body_mass_g"]]
y = penguins["species"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

test_size=0.2 keeps 20% of the penguins aside for testing and trains on the other 80%. Now the test is fair — the model has never seen those penguins.

Train, then score

Train on the training set, then ask for the accuracy on the test set:

model = KNeighborsClassifier()
model.fit(X_train, y_train)

print("Accuracy:", model.score(X_test, y_test))

.score(...) returns the fraction the model got right on the unseen penguins — a number between 0 and 1. A score of 0.95 means it correctly identified 95 out of 100 test penguins. Run it and see your number.

What accuracy means

  • 1.0 = perfect (got every test example right).
  • 0.5 on a two-way choice = no better than a coin flip.
  • Higher is better, but perfect is rare and a little suspicious with real data.

The honest score comes from the test set. A model that aces its training examples but flops on new ones hasn’t really learned the pattern — it just memorized. Testing on unseen data catches that.

Try it 🎯

  1. Change test_size=0.2 to 0.3 (hold back 30%). Does the accuracy change much?
  2. Add a fourth feature — include "bill_depth_mm" in X. Retrain and check the accuracy.

Think about it 🔮

Why is it unfair to test the model on the same penguins it trained on? (Because it could just memorize them and look perfect, without learning a pattern that works on new penguins. The test set checks real understanding, not memory.)

Fix the bug 🐞

This trains and tests, but on the same data — so the score is misleadingly high. Fix it to test on the held-out set:

model.fit(X_train, y_train)
print("Accuracy:", model.score(X_train, y_train))

(It’s scoring on X_train — the data it studied. Score on the unseen test set instead: model.score(X_test, y_test).)

Your mission 🚀

Train the penguin model with a train/test split and print its accuracy. Then try to improve the score: experiment with different features in X (add or remove measurements) and a different test_size. Note which combination gives the best honest accuracy.

What you learned today

  • Never test a model on the data it trained on — that’s cheating.
  • train_test_split holds back a test set the model never sees.
  • model.score(X_test, y_test) gives accuracy — the fraction right on unseen data.
  • Higher is better; suspiciously perfect usually means something’s off.

Next time you build a complete “guesser” from scratch — train, test, and try it on your own examples. 🔮