Engineering for Annotation in the ML Pipeline, Part 2: Creating Consensus and Quality Control
Introduction
If we rely on annotations to represent the truth, how can we test them?
A standard machine learning workflow assumes that labels are mostly correct, with only limited and measurable noise. In production annotation pipelines, that assumption is often too optimistic.
Real datasets are usually labeled by multiple annotators who may apply guidelines differently. To make the dataset usable for learning, we need two outputs:
- A robust consensus label for each case.
- A reliability estimate for each annotator.
If one of these is already known, estimating the other is straightforward. When neither is known, we get a circular dependency: quality needs ground truth, but ground truth depends on quality.
In this part, we use Expectation-Maximization (EM) to break that loop by refining consensus and annotator quality together.
Why This Matters
Producing reliable labels when disagreements are common.
Many annotation tasks do not have perfectly crisp boundaries. They depend on judgment and protocol interpretation. Pneumonia labeling from chest X-rays is a practical example where disagreement appears even in routine cases.
Suppose we want to build a robust dataset for pneumonia classification. We start from a raw annotation matrix like the one below:
| Case ID | Annotator A | Annotator B | Annotator C | Annotator D | Majority Vote | Initial Consensus |
|---|---|---|---|---|---|---|
| 001 | Pneumonia | Pneumonia | Pneumonia | Pneumonia | Pneumonia | Strong |
| 002 | No Pneumonia | No Pneumonia | No Pneumonia | No Pneumonia | No Pneumonia | Strong |
| 003 | Pneumonia | Pneumonia | No Pneumonia | Pneumonia | Pneumonia | Moderate |
| 004 | No Pneumonia | Pneumonia | No Pneumonia | No Pneumonia | No Pneumonia | Moderate |
| 005 | Pneumonia | No Pneumonia | Pneumonia | No Pneumonia | Tie | Weak |
| 006 | No Pneumonia | No Pneumonia | Pneumonia | No Pneumonia | No Pneumonia | Moderate |
| 007 | Pneumonia | Pneumonia | Pneumonia | No Pneumonia | Pneumonia | Moderate |
| 008 | No Pneumonia | Pneumonia | No Pneumonia | Pneumonia | Tie | Weak |
| 009 | Pneumonia | Pneumonia | Pneumonia | Pneumonia | Pneumonia | Strong |
| 010 | No Pneumonia | No Pneumonia | No Pneumonia | Pneumonia | No Pneumonia | Moderate |
Some rows have clear agreement, while others are borderline and noisy. The goal is to recover both a stable case-level consensus and a useful estimate of annotator quality.
How the EM Loop Works
Consensus and quality refine each other.
A practical initialization is majority vote, with a fixed rule for breaking ties. This gives a temporary estimate of the latent case labels. Using that temporary consensus, we compute an initial confusion matrix for each annotator.
| Annotator A | Consensus: Pneumonia | Consensus: No Pneumonia |
|---|---|---|
| A Labeled Pneumonia | 1.00 | 0.17 |
| A Labeled No Pneumonia | 0.00 | 0.83 |
| Annotator B | Consensus: Pneumonia | Consensus: No Pneumonia |
|---|---|---|
| B Labeled Pneumonia | 1.00 | 0.33 |
| B Labeled No Pneumonia | 0.00 | 0.67 |
| Annotator C | Consensus: Pneumonia | Consensus: No Pneumonia |
|---|---|---|
| C Labeled Pneumonia | 0.75 | 0.33 |
| C Labeled No Pneumonia | 0.25 | 0.67 |
| Annotator D | Consensus: Pneumonia | Consensus: No Pneumonia |
|---|---|---|
| D Labeled Pneumonia | 0.75 | 0.33 |
| D Labeled No Pneumonia | 0.25 | 0.67 |
These estimates are still rough because they depend on an initial majority vote. EM improves both parts iteratively:
- Given current annotator reliabilities, what is the most likely true label for each case?
- Given updated case labels, what does each annotator’s reliability look like now?
One EM iteration:
| Step | What happens |
|---|---|
| 1 | Start from current consensus labels |
| 2 | Treat them as temporary latent-label estimates |
| 3 | Update each annotator’s confusion matrix |
| 4 | Re-score each case using those annotator profiles |
| 5 | Assign refined consensus labels |
| 6 | Repeat until labels and reliability estimates stabilize |
Minimal working example (synthetic data + EM, SciPy)
The same labeling problem, but in a setting where the hidden truth is known.
The labels remain Pneumonia and No Pneumonia. In the synthetic setup, No Pneumonia is implemented as samples drawn from a standard normal distribution, while Pneumonia is implemented as samples drawn from a close but non-normal distribution.
The experiment uses four test-based annotators, four random annotators, and two bad annotators that usually vote against the strongest signal. This creates a noisy pool where majority vote is noticeably biased, but EM can still recover a reliable consensus.
import numpy as np
from scipy import stats
# ------------------------------------------------------------
# Synthetic annotation task
# 0 = No Pneumonia -> sample from a standard normal
# 1 = Pneumonia -> sample from a close but non-normal distribution
# ------------------------------------------------------------
rng = np.random.default_rng(7)
def generate_patient(label, n=100):
if label == 0:
return rng.normal(0, 1, size=n)
return stats.skewnorm.rvs(a=14, size=n, random_state=rng)
def annotate(sample):
mu = sample.mean()
sigma = sample.std(ddof=1) + 1e-12
z = (sample - mu) / sigma
# Original test-based annotators
shapiro = int(stats.shapiro(sample).pvalue < 0.05)
dagostino = int(stats.normaltest(sample).pvalue < 0.05)
jarque_bera = int(stats.jarque_bera(sample).pvalue < 0.05)
ks = int(stats.kstest(z, "norm", method="asymp").pvalue < 0.05)
labels = [shapiro, dagostino, jarque_bera, ks]
# Four random annotators
labels += [int(rng.integers(0, 2)) for _ in range(4)]
# Two bad annotators: usually flip the strongest signal
for _ in range(2):
labels.append(1 - shapiro if rng.random() < 0.8 else int(rng.integers(0, 2)))
return labels
def build_dataset(n_cases=200):
y_true = rng.integers(0, 2, size=n_cases)
X = np.array([annotate(generate_patient(y)) for y in y_true])
return X, y_true
# ------------------------------------------------------------
# Dawid-Skene style EM
# theta[r, t, l] = P(annotator r outputs label l | true label t)
# ------------------------------------------------------------
def run_em(X, n_classes=2, max_iter=12, min_iter=8, alpha=0.5, tol=1e-6):
n_items, n_annotators = X.shape
counts = np.apply_along_axis(lambda row: np.bincount(row, minlength=n_classes), 1, X)
y_init = np.argmax(counts, axis=1)
post = np.eye(n_classes)[y_init]
history = []
for it in range(max_iter):
pi = post.mean(axis=0)
theta = np.zeros((n_annotators, n_classes, n_classes))
for r in range(n_annotators):
mask0 = (X[:, r] == 0)
mask1 = ~mask0
for t in range(n_classes):
weights = post[:, t]
denom = weights.sum() + alpha * n_classes
theta[r, t, 0] = ((weights * mask0).sum() + alpha) / denom
theta[r, t, 1] = ((weights * mask1).sum() + alpha) / denom
history.append((post.argmax(axis=1).copy(), theta.copy(), post.copy()))
log_post = np.zeros((n_items, n_classes))
for t in range(n_classes):
log_post[:, t] = np.log(pi[t] + 1e-12)
for r in range(n_annotators):
log_post[:, t] += np.log(theta[r, t, X[:, r]] + 1e-12)
log_post -= log_post.max(axis=1, keepdims=True)
new_post = np.exp(log_post)
new_post /= new_post.sum(axis=1, keepdims=True)
if it + 1 >= min_iter and np.max(np.abs(new_post - post)) < tol:
post = new_post
break
post = new_post
return post.argmax(axis=1), post, history, theta
annotator_names = [
"Shapiro", "D'Agostino", "Jarque-Bera", "KS",
"Random 1", "Random 2", "Random 3", "Random 4",
"Bad 1", "Bad 2"
]
X, y_true = build_dataset(n_cases=200)
y_em, post, history, theta_final = run_em(X)
counts = np.apply_along_axis(lambda row: np.bincount(row, minlength=2), 1, X)
y_mv = np.argmax(counts, axis=1)
print("Majority-vote accuracy:", (y_mv == y_true).mean())
print("EM consensus accuracy :", (y_em == y_true).mean())
def summarize_labels(y_pred):
counts = np.bincount(y_pred, minlength=2)
return counts[0], counts[1]
print("Hidden truth counts:")
n0, n1 = summarize_labels(y_true)
print(f" No Pneumonia: {n0}")
print(f" Pneumonia : {n1}")
print("\nConsensus summary by iteration:")
for idx in [1, 2, 3, 6]:
y_iter = history[idx - 1][0]
acc = (y_iter == y_true).mean()
c0, c1 = summarize_labels(y_iter)
print(f" Iteration {idx}: accuracy={acc:.3f}, No Pneumonia={c0}, Pneumonia={c1}")
print("\nFinal consensus:")
c0, c1 = summarize_labels(y_em)
print(f" accuracy={((y_em == y_true).mean()):.3f}, No Pneumonia={c0}, Pneumonia={c1}")
print("\nAnnotator quality: first vs final iteration")
theta_first = history[0][1]
for i, name in enumerate(annotator_names):
annot_acc = (X[:, i] == y_true).mean()
print(f"\n{name}")
print(f" Accuracy vs hidden truth = {annot_acc:.3f}")
print(f" First iter: P(Pneumonia|Pneumonia) = {theta_first[i,1,1]:.3f}")
print(f" First iter: P(No Pneumonia|No Pneumonia) = {theta_first[i,0,0]:.3f}")
print(f" Final iter: P(Pneumonia|Pneumonia) = {theta_final[i,1,1]:.3f}")
print(f" Final iter: P(No Pneumonia|No Pneumonia) = {theta_final[i,0,0]:.3f}")
How to read the output
Majority vote counts annotators. EM estimates annotator quality.
In this run, the bootstrap consensus starts at 0.675. After the first EM update it reaches 0.900, then 0.975, and after additional posterior refinement it reaches 0.990. The gain is large because the annotation pool contains both weak annotators and actively bad ones.
| Stage | Iteration | Consensus accuracy | Predicted No Pneumonia | Predicted Pneumonia |
|---|---|---|---|---|
| Bootstrap majority vote | 1 | 0.675 | 148 | 52 |
| After first EM update | 2 | 0.900 | 115 | 85 |
| After second EM update | 3 | 0.975 | 98 | 102 |
| After posterior refinement | 6 | 0.990 | 95 | 105 |
| Hidden truth | — | — | 95 | 105 |
Majority vote is pulled toward No Pneumonia because the noisy annotators dominate the count. EM reduces that bias quickly, and by iteration 6 the consensus exactly matches the hidden class balance. Across the full run, EM flips 65 case labels relative to the bootstrap consensus.
The annotator profiles explain why the consensus improves. In the first iteration, even random annotators still inherit moderate-looking scores because the bootstrap consensus is noisy. By the final iteration, the strong annotators remain strong, the random annotators collapse toward chance, and the bad annotators become clearly anti-reliable.
| Annotator | Accuracy vs hidden truth | First iter: P(Pneumonia | Pneumonia) | First iter: P(No Pneumonia | No Pneumonia) | Final iter: P(Pneumonia | Pneumonia) | Final iter: P(No Pneumonia | No Pneumonia) |
|---|---|---|---|---|---|
| Shapiro | 0.990 | 0.877 | 0.587 | 0.994 | 0.972 |
| D'Agostino | 0.950 | 0.915 | 0.614 | 0.959 | 0.954 |
| Jarque-Bera | 0.960 | 0.915 | 0.628 | 0.950 | 0.965 |
| KS | 0.565 | 0.330 | 0.990 | 0.175 | 0.995 |
| Random 1 | 0.495 | 0.689 | 0.581 | 0.487 | 0.507 |
| Random 2 | 0.505 | 0.708 | 0.654 | 0.457 | 0.577 |
| Random 3 | 0.515 | 0.708 | 0.601 | 0.494 | 0.535 |
| Random 4 | 0.500 | 0.632 | 0.567 | 0.485 | 0.515 |
| Bad 1 | 0.085 | 0.217 | 0.440 | 0.081 | 0.101 |
| Bad 2 | 0.110 | 0.292 | 0.419 | 0.130 | 0.082 |
The final pattern is clear. Shapiro, D'Agostino, and Jarque-Bera emerge as highly reliable. KS remains weak and asymmetric. The random annotators settle near chance behavior. Bad 1 and Bad 2 end with extremely low reliability.
The practical point is simple. Majority vote counts annotators, while EM estimates annotator quality. When the pool contains weak or bad annotators, that difference is decisive.