Counting and Grouping
Now for the questions data is really good at: How many of each kind are there? Which group is biggest, heaviest, fastest? Today you learn the two tools that answer them: value_counts and groupby. This is the heart of what data scientists do.
💡 Load the data first:
import pandas as pd import seaborn as sns penguins = sns.load_dataset("penguins").dropna()
How many of each? value_counts
.value_counts() counts how many times each value appears in a column:
penguins["species"].value_counts()
In one line, you learn how many Adelie, Gentoo, and Chinstrap penguins are in the data. Try it on another category:
penguins["island"].value_counts()
Try it 🎯
Count how many penguins of each sex there are.
Average per group: groupby
Here’s the powerful one. groupby splits the data into groups, then computes something for each group. “What’s the average body mass for each species?”
penguins.groupby("species")["body_mass_g"].mean()
Read it left to right: take penguins, group by species, look at the body_mass_g column, and find the mean (average) of each group. The result is one number per species, so you can instantly see which species is heaviest on average.
Try it 🎯
- Average
flipper_length_mmper species. - Average
bill_length_mmper island.
Other things per group
You can ask for more than the average. Swap .mean() for:
.max()— the biggest in each group.min()— the smallest.count()— how many in each group
penguins.groupby("species")["body_mass_g"].max()
penguins.groupby("island")["species"].count()
Predict it 🔮
Gentoo penguins are the largest species. So in penguins.groupby("species")["body_mass_g"].mean(), which species do you expect to have the highest number? Run it and check. (Gentoo, by a lot. Their average body mass is well above the other two.)
Fix the bug 🐞
This is meant to find the average body mass per species, but it errors. The column to average is missing — you have to say which column:
penguins.groupby("species").mean()
(Tell it which column to average: penguins.groupby("species")["body_mass_g"].mean(). Without picking a column, pandas can get confused by text columns.)
Your mission 🚀
Investigate the penguins. In separate cells: (1) count how many penguins live on each island, (2) find the average flipper length per species, and (3) find the heaviest penguin in each species using .max(). Then write a sentence (as a print) saying which species is the heaviest on average.
What you learned today
.value_counts()counts how many of each value are in a column.groupby("col")["other"].mean()computes a value per group — the analysis workhorse.- Swap
.mean()for.max(),.min(), or.count()to ask different questions. - These few lines answer questions that would take a long loop to do by hand.
You can find answers now. Next time, we start turning them into pictures — your first chart. 📈
Comments