AI Metrics, visually!

The metrics you meet when finetuning a model — grouped by when you use them, made playful and interactive.

🗺️ The map. Finetuning has three moments, each with its own metrics:
① While training → loss & perplexity (is it learning? is it overfitting?)
② Judging labels → accuracy, precision, recall, F1 (did it classify right?)
③ Judging generated text → ROUGE, BLEU, BERTScore (did it write well?)
The sections below follow that order. Everything ties back to the fishing net.

The one mental model

🎣 Your model is a fishing net

Every metric here is just an anxious question someone asks about that net. Hold this picture and the rest follows.

Precision

"Everything I caught — is it actually fish, or did I also pull up boots?"

Of what I flagged positive, how much was right? Boots in the net (🥾) = false positives.

Recall

"All the fish in the lake — did I actually catch them, or did some swim through?"

Of what was really there, how much did I get? Escaped fish = false negatives.

Why they pull against each other: want perfect recall? Cover the whole lake with your net — you'll get every fish, but also every boot (so precision drops). Want perfect precision? Take only the one fish you're 100% sure of — definitely a fish, but you missed 999 others (so recall drops). F1 exists to stop both shortcuts.

① While training · the overfitting alarm

📉 Loss, validation loss & perplexity

Before judging quality, you watch whether the model is learning at all. During finetuning your dashboard shows loss dropping over time. The single most important habit for a beginner: watch the validation curve, not the training curve.

Drag through training epochs:

Train loss

—

model fit to training data

Val loss

—

fit to unseen data — the real signal

Perplexity

—

exp(val loss)

What is perplexity?

perplexity = e^loss. Read it as: "how many equally-likely words is the model choosing between at each step?" Perplexity 1 = perfectly certain & correct. Perplexity 50 = as unsure as picking from 50 options. Lower is better.

How to spot overfitting

Train loss always keeps dropping — the model can memorize. When val loss turns back upward while train loss falls, it's memorizing, not learning. That gap is your signal to stop early.

① While training · the headline number for language models

🎲 Perplexity & cross-entropy, in plain terms

🌤️ Think of a weather forecaster. Every day they predict tomorrow. A language model does the same — but predicts the next word.

Perplexity = how many options is the forecaster still guessing among?
• "100% sure it'll rain" → and it rains → only 1 option in play. Perfect. Perplexity = 1.
• "Could be sunny, rainy, cloudy, or snowy — no idea" → 4 options in play. Perplexity = 4.
• Guessing blindly → thousands of options in play. Perplexity = huge.
Fewer options = more confident and correct, so lower is better. The purple "doors" further down literally draw these options.

😱 And cross-entropy loss? It's just a "surprise meter." The model bets a probability on each word; then the true word is revealed:
• Bet 90% on the right word → "I expected that" → tiny surprise.
• Bet 50% → "fair enough" → medium surprise.
• Bet 1% → "wait, WHAT?!" → huge surprise.
Cross-entropy loss = the model's average surprise across all the words. Low surprise = good. That's the entire idea — the rest is just the math for "surprise."
Connection: perplexity = e^(loss) — loss is the raw training signal; perplexity is that same thing translated into "number of choices" you can picture. Like °C vs an obscure unit: same temperature, one you can feel.

Now the same idea with real numbers. A language model's job: given the words so far, predict the next one — a probability for every word in its vocabulary. Perplexity asks:

"On average, how many words is the model unsure between at each step?"

thecatsatonthe → next word ismat

The true next word is mat. Drag how much probability the model gave it:

P("mat") =

The remaining probability is spread over other words. Here's the model's guess distribution:

Probability of correct word

—

Surprise (loss) = −ln(p)

—

Perplexity = e^surprise

—

Why exponentiate the loss?

Training shows cross-entropy loss = average surprise, in abstract units (nats). perplexity = e^loss converts that surprise back into a tangible count of choices. Loss 0 → ppl 1. Loss 2.3 → ppl ≈ 10. Loss 4.6 → ppl ≈ 100. Same info, friendlier units.

What's a "good" number?

It depends on vocabulary & task, so it's only meaningful relative to a baseline. A strong modern LLM on general English sits around perplexity 3–15. Random guessing across a 50k vocab ≈ 50,000. You use it to compare: did finetuning lower my perplexity on my domain's text?

One important limitation: perplexity only measures next-word prediction on text you already have. It says nothing about whether answers are helpful, true, or well-formatted — that's what the §③ generation metrics and LLM-as-judge are for.

② Judging labels · build it by hand

The confusion matrix

Below are 12 items. The emoji is the truth: 🐟 is actually positive, 🥾 is actually negative. Click an item to toggle your model's prediction (a blue ring = "model predicts positive / caught in net"). Watch the matrix and metrics update.

Blue ring = predicted positive. Try to catch all the fish without grabbing boots.

Predicted Positive

Predicted Negative

Actually 🐟

0True Positive (caught fish)

0False Negative (fish escaped)

Actually 🥾

0False Positive (caught a boot)

0True Negative (left boot alone)

Precision

—

Recall

—

② Judging labels · the trade-off

Precision vs Recall: the slider

Real models output a score (0–1), and you pick a threshold: score ≥ threshold → predict positive. Below, 14 items have fixed scores. Drag the threshold and watch precision and recall pull in opposite directions.

Decision threshold:

Low threshold = cast a wide net (high recall, low precision). High threshold = only the sure things (high precision, low recall).

Bold = actually positive (🐟). Blue outline = predicted positive at this threshold.

Precision

—

Recall

—

② Judging labels · the classic beginner trap

⚠️ Why "accuracy" lies

Accuracy = "what fraction did I get right?" — the most intuitive metric, and the most dangerous on imbalanced data. Here's a fraud detector where only a few transactions are actually fraud. Drag how rare fraud is:

Fraud rate in the data:

Now meet the "lazy model" that just predicts "never fraud" for everything — it does zero real work:

Accuracy

—

Recall (on fraud)

—

caught 0 of the fraud

This is the whole reason precision, recall & F1 exist: they ignore the easy majority class and ask "did you catch the thing that matters?" A model can have 98% accuracy and be useless.

② Reference

The formulas, with their job

Precision

TP / (TP + FP)

"Of my catch, how much is fish?"

Use when false alarms are costly. Spam filter — blocking real email is worse than missing some spam.

Recall

TP / (TP + FN)

"Of all the fish, how many did I catch?"

Use when misses are costly. Cancer screening — a false alarm beats missing a sick patient.

F1

2·P·R / (P + R)

Harmonic mean — dies if either is low.

Use when you need balance and don't want a high score from maximizing just one side.

Why harmonic mean, not regular average? Regular average of P=1.0, R=0.0 is 0.5 (looks okay!). Harmonic mean is 0.0 — it refuses to reward a model that ignores half the problem.

③ Judging generated text · metrics for words

ROUGE = the same idea, for words

When a model generates text (summaries, translations), the "fish in the lake" become the words in the reference. ROUGE asks: how much of the reference did the generated text catch?

Reference (the ideal / "truth")

Generated (your model's output)

ROUGE-1 (words)

ROUGE-2 (pairs)

ROUGE-L (sequence)

Recall

—

Precision

—

Why ROUGE-2 / ROUGE-L matter: try Reference dog bites man and Generated man bites dog. ROUGE-1 says perfect (same 3 words!) — but the meaning is reversed. ROUGE-2 (word pairs) and ROUGE-L (order) catch what ROUGE-1 misses.

③ Judging generated text · the family

ROUGE vs BLEU — two sides of the same idea

ROUGE and BLEU both count word overlap; they just lead with different anxieties from the fishing net:

ROUGE → recall-leaning

"Did I cover everything in the reference?"

Standard for summarization — a summary that drops key points is the failure you fear. Misses hurt.

BLEU → precision-leaning

"Is everything I generated actually correct?"

Standard for translation — inventing words that don't belong is the failure you fear. It adds a "brevity penalty" so you can't get a high score by outputting just one perfect word.

Same precision/recall trade-off you saw with the fishing net — just applied to word-chunks, with each field picking the side that matches its worst failure.

③ The limitation & the modern fix

The blind spot: none of these understand meaning

ROUGE and BLEU count surface overlap. To them, "the film was great" and "the movie was excellent" share almost nothing — near-zero score, despite identical meaning. For finetuning a chatbot, that makes them weak judges.

BERTScore

Compares embeddings, not exact words. "film/movie", "great/excellent" score as near-matches. The meaning-aware fix for ROUGE/BLEU's blindness.

LLM-as-a-judge

Ask a strong model (e.g. Claude) to score your finetuned model's answers for helpfulness, correctness, tone. The dominant method today for instruction-tuned models.

Win rate / human eval

"Is answer A better than B?" across many prompts. Pairwise preference is what RLHF and Chatbot-Arena rankings use. Humans remain the gold standard for subjective quality.

The finetuning takeaway: ngram metrics (ROUGE/BLEU) are cheap and automatic — great for a fast signal and for tasks with one right answer. For open-ended chat quality, they undercount good paraphrases, so the field leans on LLM-as-judge and human preference. Match the metric to what failure you actually fear.