AI Metrics, visually!

The metrics you meet when finetuning a model β€” grouped by when you use them, made playful and interactive.

πŸ—ΊοΈ The map. Finetuning has three moments, each with its own metrics:
β‘  While training β†’ loss & perplexity (is it learning? is it overfitting?)
β‘‘ Judging labels β†’ accuracy, precision, recall, F1 (did it classify right?)
β‘’ Judging generated text β†’ ROUGE, BLEU, BERTScore (did it write well?)
The sections below follow that order. Everything ties back to the fishing net.
The one mental model

🎣 Your model is a fishing net

Every metric here is just an anxious question someone asks about that net. Hold this picture and the rest follows.

🐟 🐟 🐟 πŸ₯Ύ 🐟 🐟

Precision

"Everything I caught β€” is it actually fish, or did I also pull up boots?"

Of what I flagged positive, how much was right? Boots in the net (πŸ₯Ύ) = false positives.

Recall

"All the fish in the lake β€” did I actually catch them, or did some swim through?"

Of what was really there, how much did I get? Escaped fish = false negatives.

Why they pull against each other: want perfect recall? Cover the whole lake with your net β€” you'll get every fish, but also every boot (so precision drops). Want perfect precision? Take only the one fish you're 100% sure of β€” definitely a fish, but you missed 999 others (so recall drops). F1 exists to stop both shortcuts.
β‘  While training Β· the overfitting alarm

πŸ“‰ Loss, validation loss & perplexity

Before judging quality, you watch whether the model is learning at all. During finetuning your dashboard shows loss dropping over time. The single most important habit for a beginner: watch the validation curve, not the training curve.

Drag through training epochs:
Train loss
β€”
model fit to training data
Val loss
β€”
fit to unseen data β€” the real signal
Perplexity
β€”
exp(val loss)

What is perplexity?

perplexity = e^loss. Read it as: "how many equally-likely words is the model choosing between at each step?" Perplexity 1 = perfectly certain & correct. Perplexity 50 = as unsure as picking from 50 options. Lower is better.

How to spot overfitting

Train loss always keeps dropping β€” the model can memorize. When val loss turns back upward while train loss falls, it's memorizing, not learning. That gap is your signal to stop early.

β‘  While training Β· the headline number for language models

🎲 Perplexity & cross-entropy, in plain terms

🌀️ Think of a weather forecaster. Every day they predict tomorrow. A language model does the same β€” but predicts the next word.

Perplexity = how many options is the forecaster still guessing among?
β€’ "100% sure it'll rain" β†’ and it rains β†’ only 1 option in play. Perfect. Perplexity = 1.
β€’ "Could be sunny, rainy, cloudy, or snowy β€” no idea" β†’ 4 options in play. Perplexity = 4.
β€’ Guessing blindly β†’ thousands of options in play. Perplexity = huge.
Fewer options = more confident and correct, so lower is better. The purple "doors" further down literally draw these options.
😱 And cross-entropy loss? It's just a "surprise meter." The model bets a probability on each word; then the true word is revealed:
β€’ Bet 90% on the right word β†’ "I expected that" β†’ tiny surprise.
β€’ Bet 50% β†’ "fair enough" β†’ medium surprise.
β€’ Bet 1% β†’ "wait, WHAT?!" β†’ huge surprise.
Cross-entropy loss = the model's average surprise across all the words. Low surprise = good. That's the entire idea β€” the rest is just the math for "surprise."
Connection: perplexity = e^(loss) β€” loss is the raw training signal; perplexity is that same thing translated into "number of choices" you can picture. Like Β°C vs an obscure unit: same temperature, one you can feel.

Now the same idea with real numbers. A language model's job: given the words so far, predict the next one β€” a probability for every word in its vocabulary. Perplexity asks:

"On average, how many words is the model unsure between at each step?"

thecatsatonthe β†’ next word ismat

The true next word is mat. Drag how much probability the model gave it:

P("mat") =

The remaining probability is spread over other words. Here's the model's guess distribution:

Probability of correct word
β€”
Surprise (loss) = βˆ’ln(p)
β€”
Perplexity = esurprise
β€”

Why exponentiate the loss?

Training shows cross-entropy loss = average surprise, in abstract units (nats). perplexity = e^loss converts that surprise back into a tangible count of choices. Loss 0 β†’ ppl 1. Loss 2.3 β†’ ppl β‰ˆ 10. Loss 4.6 β†’ ppl β‰ˆ 100. Same info, friendlier units.

What's a "good" number?

It depends on vocabulary & task, so it's only meaningful relative to a baseline. A strong modern LLM on general English sits around perplexity 3–15. Random guessing across a 50k vocab β‰ˆ 50,000. You use it to compare: did finetuning lower my perplexity on my domain's text?

One important limitation: perplexity only measures next-word prediction on text you already have. It says nothing about whether answers are helpful, true, or well-formatted β€” that's what the Β§β‘’ generation metrics and LLM-as-judge are for.
β‘‘ Judging labels Β· build it by hand

The confusion matrix

Below are 12 items. The emoji is the truth: 🐟 is actually positive, πŸ₯Ύ is actually negative. Click an item to toggle your model's prediction (a blue ring = "model predicts positive / caught in net"). Watch the matrix and metrics update.

Blue ring = predicted positive. Try to catch all the fish without grabbing boots.

Predicted Positive
Predicted Negative
Actually πŸŸ
0True Positive (caught fish)
0False Negative (fish escaped)
Actually πŸ₯Ύ
0False Positive (caught a boot)
0True Negative (left boot alone)
Precision
β€”
Recall
β€”
F1
β€”
β‘‘ Judging labels Β· the trade-off

Precision vs Recall: the slider

Real models output a score (0–1), and you pick a threshold: score β‰₯ threshold β†’ predict positive. Below, 14 items have fixed scores. Drag the threshold and watch precision and recall pull in opposite directions.

Decision threshold:

Low threshold = cast a wide net (high recall, low precision). High threshold = only the sure things (high precision, low recall).

Bold = actually positive (🐟). Blue outline = predicted positive at this threshold.

Precision
β€”
Recall
β€”
F1
β€”
β‘‘ Judging labels Β· the classic beginner trap

⚠️ Why "accuracy" lies

Accuracy = "what fraction did I get right?" β€” the most intuitive metric, and the most dangerous on imbalanced data. Here's a fraud detector where only a few transactions are actually fraud. Drag how rare fraud is:

Fraud rate in the data:

Now meet the "lazy model" that just predicts "never fraud" for everything β€” it does zero real work:

Accuracy
β€”
Recall (on fraud)
β€”
caught 0 of the fraud

This is the whole reason precision, recall & F1 exist: they ignore the easy majority class and ask "did you catch the thing that matters?" A model can have 98% accuracy and be useless.

β‘‘ Reference

The formulas, with their job

Precision

TP / (TP + FP)

"Of my catch, how much is fish?"

Use when false alarms are costly. Spam filter β€” blocking real email is worse than missing some spam.

Recall

TP / (TP + FN)

"Of all the fish, how many did I catch?"

Use when misses are costly. Cancer screening β€” a false alarm beats missing a sick patient.

F1

2Β·PΒ·R / (P + R)

Harmonic mean β€” dies if either is low.

Use when you need balance and don't want a high score from maximizing just one side.

Why harmonic mean, not regular average? Regular average of P=1.0, R=0.0 is 0.5 (looks okay!). Harmonic mean is 0.0 β€” it refuses to reward a model that ignores half the problem.
β‘’ Judging generated text Β· metrics for words

ROUGE = the same idea, for words

When a model generates text (summaries, translations), the "fish in the lake" become the words in the reference. ROUGE asks: how much of the reference did the generated text catch?

Reference (the ideal / "truth")
Generated (your model's output)
ROUGE-1 (words)
ROUGE-2 (pairs)
ROUGE-L (sequence)
Recall
β€”
Precision
β€”
F1
β€”
Why ROUGE-2 / ROUGE-L matter: try Reference dog bites man and Generated man bites dog. ROUGE-1 says perfect (same 3 words!) β€” but the meaning is reversed. ROUGE-2 (word pairs) and ROUGE-L (order) catch what ROUGE-1 misses.
β‘’ Judging generated text Β· the family

ROUGE vs BLEU β€” two sides of the same idea

ROUGE and BLEU both count word overlap; they just lead with different anxieties from the fishing net:

ROUGE β†’ recall-leaning

"Did I cover everything in the reference?"

Standard for summarization β€” a summary that drops key points is the failure you fear. Misses hurt.

BLEU β†’ precision-leaning

"Is everything I generated actually correct?"

Standard for translation β€” inventing words that don't belong is the failure you fear. It adds a "brevity penalty" so you can't get a high score by outputting just one perfect word.

Same precision/recall trade-off you saw with the fishing net β€” just applied to word-chunks, with each field picking the side that matches its worst failure.
β‘’ The limitation & the modern fix

The blind spot: none of these understand meaning

ROUGE and BLEU count surface overlap. To them, "the film was great" and "the movie was excellent" share almost nothing β€” near-zero score, despite identical meaning. For finetuning a chatbot, that makes them weak judges.

BERTScore

Compares embeddings, not exact words. "film/movie", "great/excellent" score as near-matches. The meaning-aware fix for ROUGE/BLEU's blindness.

LLM-as-a-judge

Ask a strong model (e.g. Claude) to score your finetuned model's answers for helpfulness, correctness, tone. The dominant method today for instruction-tuned models.

Win rate / human eval

"Is answer A better than B?" across many prompts. Pairwise preference is what RLHF and Chatbot-Arena rankings use. Humans remain the gold standard for subjective quality.

The finetuning takeaway: ngram metrics (ROUGE/BLEU) are cheap and automatic β€” great for a fast signal and for tasks with one right answer. For open-ended chat quality, they undercount good paraphrases, so the field leans on LLM-as-judge and human preference. Match the metric to what failure you actually fear.