unspurious.calculators

Epidemiology & evidence · Agreement

Cohen's Kappa Calculator

How much do two raters really agree — beyond the agreement you'd get by luck alone? Enter the 2×2 table of their verdicts to get Cohen's kappa with a confidence interval, the observed and chance-expected agreement, and a reliability band. Because "they agreed 90% of the time" can hide a kappa that says the agreement was mostly coincidence.

Two raters each judge the same N items as positive or negative. a and d are the agreements (both said the same); b and c are the disagreements.

Result

In plain English

If two people label things independently, they will sometimes agree purely by chance — flip two coins and they match half the time. Cohen's kappa asks how much of the agreement you actually saw is more than that coincidence would deliver. It rescales the raw agreement so that 0 means "no better than chance" and 1 means "perfect", which is why a 90%-agreement that mostly reflects everyone saying "no" can still earn a mediocre kappa.

observed agreement (Pₒ)
The plain proportion of items the two raters labelled the same way: (a + d) ∕ N.
expected agreement (Pₑ)
How often they would agree just by chance, given how often each uses each label — computed from the row and column totals.
kappa (κ)
(Pₒ − Pₑ) ∕ (1 − Pₑ): the share of the non-chance agreement that was actually achieved. 1 is perfect, 0 is chance level, negative is worse than chance.
reliability bands
A rough Landis & Koch guide: ≤0 poor, .01–.20 slight, .21–.40 fair, .41–.60 moderate, .61–.80 substantial, .81–1 almost perfect.
the kappa paradox
When one category is rare, kappa can be low even with very high agreement, because chance agreement is already high. Always read kappa next to the raw agreement and the marginals.

Frequently asked

What is a good Cohen's kappa value?

By the common Landis & Koch rule of thumb, kappa above 0.80 is "almost perfect", 0.61–0.80 "substantial", 0.41–0.60 "moderate", 0.21–0.40 "fair", and 0.20 or below "slight" to "poor". These labels are only a guide, not law — the kappa you should demand depends on the stakes and how easy the judgement is. And always check the confidence interval: a point estimate of 0.7 from few items can be compatible with much weaker agreement.

Why is kappa lower than the percent agreement?

Because kappa subtracts off the agreement expected by chance. If two raters agree 90% of the time but, given how often each says "yes", they would have agreed 80% of the time just by guessing, then only a quarter of the available non-chance agreement was achieved — kappa = (0.90 − 0.80) ∕ (1 − 0.80) = 0.50. Raw percent agreement flatters reliability precisely because it ignores this baseline.

What is the kappa paradox?

It is the unsettling fact that kappa can be low even when raters agree on almost everything, if one category is very common. When 95% of items are "negative", chance agreement is already huge, so there is little room for kappa to reward the raters — a tiny number of disagreements can crater it. The lesson is not to distrust kappa but to report it together with the raw agreement and the prevalence, so the picture is honest.

How do I measure agreement with more than two raters or categories?

For more than two categories, Cohen’s kappa still works on the square agreement table — use weighted kappa if the categories are ordered, so near-misses count as partial agreement. For more than two raters, switch to Fleiss’ kappa, which generalises the idea to any number of raters. The chance-correction logic is identical in each case; only the bookkeeping changes.