unspurious.calculators

Epidemiology & evidence · Classifier metrics

Confusion Matrix Calculator

Judge a binary classifier honestly. Enter the four counts — true and false positives and negatives — and read off accuracy, precision, recall, specificity, the F1 score, balanced accuracy and the Matthews correlation coefficient, with the 2×2 grid drawn out. And a clear warning about the accuracy paradox, where 99% accuracy can mean a useless model.

A positive is whatever you are trying to detect (spam, disease, fraud). TP = correctly flagged; FP = false alarm; FN = a miss; TN = correctly cleared.

Result

In plain English

A confusion matrix is the scorecard of a yes/no classifier: of everything it labelled positive or negative, how many it got right and wrong. Every metric below is just a different slice of those four numbers, built to answer a different question — and no single one tells the whole story, which is exactly how a model can look good and be useless.

accuracy
The share of all predictions that were correct, (TP + TN) ∕ total. Intuitive, but dangerously flattering when one class is rare.
precision
Of the cases flagged positive, the share that really were, TP ∕ (TP + FP). “When it says yes, how often is it right?” Falls when false alarms pile up.
recall (sensitivity)
Of the real positives, the share caught, TP ∕ (TP + FN). “Of the things I should have found, how many did I?” Falls when misses pile up.
F1 score
The harmonic mean of precision and recall — a single number that is only high when both are, so it cannot be gamed by sacrificing one for the other.
balanced accuracy & MCC
Summaries that stay honest under class imbalance: balanced accuracy averages recall and specificity; the Matthews correlation coefficient (−1 to +1) uses all four cells at once.

Frequently asked

What is the difference between precision and recall?

Precision asks: when the model predicts positive, how often is it correct? — TP ∕ (TP + FP). Recall asks: of all the actual positives, how many did the model catch? — TP ∕ (TP + FN). They trade off: lowering the decision threshold catches more real positives (higher recall) but raises false alarms (lower precision). The F1 score combines them into one figure.

What is the accuracy paradox?

When one class is rare, a model can score very high accuracy by simply predicting the majority class every time. If 1% of emails are spam, a classifier that calls everything "not spam" is 99% accurate and catches zero spam. That is the accuracy paradox — and why precision, recall, F1, balanced accuracy or the MCC are the metrics to trust on imbalanced data.

Does precision depend on how common the positive class is?

Yes. Recall and specificity are properties of the classifier, but precision (like a medical test's positive predictive value) also depends on the base rate. The same model applied where positives are rarer will throw a higher share of false alarms among its positive predictions, so its precision drops — even though nothing about the model changed. This is the base-rate fallacy in classifier form.

What is a good F1 score?

There is no universal cut-off — it depends on the costs of false positives versus false negatives and on how balanced the classes are. F1 runs from 0 to 1, higher is better, and it is only high when precision and recall are both decent. Judge it against a sensible baseline (a trivial always-predict-the-majority model, or your previous model) rather than an absolute target, and always report precision and recall alongside it — the same F1 can come from very different trade-offs.