Engineering for Annotation in the ML Pipeline | Part 2

Engineering for Annotation in the ML Pipeline, Part 2: Creating Consensus and Quality Control

Author: Eytan Slotnik

Date: March 11, 2026

Introduction

If we rely on annotations to represent the truth, how can we test them?

A standard machine learning workflow assumes that labels are mostly correct, with only limited and measurable noise. In production annotation pipelines, that assumption is often too optimistic.

Real datasets are usually labeled by multiple annotators who may apply guidelines differently. To make the dataset usable for learning, we need two outputs:

A robust consensus label for each case.
A reliability estimate for each annotator.

If one of these is already known, estimating the other is straightforward. When neither is known, we get a circular dependency: quality needs ground truth, but ground truth depends on quality.

In this part, we use Expectation-Maximization (EM) to break that loop by refining consensus and annotator quality together.

Why This Matters

Producing reliable labels when disagreements are common.

Many annotation tasks do not have perfectly crisp boundaries. They depend on judgment and protocol interpretation. Pneumonia labeling from chest X-rays is a practical example where disagreement appears even in routine cases.

Suppose we want to build a robust dataset for pneumonia classification. We start from a raw annotation matrix like the one below:

Case ID	Annotator A	Annotator B	Annotator C	Annotator D	Majority Vote	Initial Consensus
001	Pneumonia	Pneumonia	Pneumonia	Pneumonia	Pneumonia	Strong
002	No Pneumonia	No Pneumonia	No Pneumonia	No Pneumonia	No Pneumonia	Strong
003	Pneumonia	Pneumonia	No Pneumonia	Pneumonia	Pneumonia	Moderate
004	No Pneumonia	Pneumonia	No Pneumonia	No Pneumonia	No Pneumonia	Moderate
005	Pneumonia	No Pneumonia	Pneumonia	No Pneumonia	Tie	Weak
006	No Pneumonia	No Pneumonia	Pneumonia	No Pneumonia	No Pneumonia	Moderate
007	Pneumonia	Pneumonia	Pneumonia	No Pneumonia	Pneumonia	Moderate
008	No Pneumonia	Pneumonia	No Pneumonia	Pneumonia	Tie	Weak
009	Pneumonia	Pneumonia	Pneumonia	Pneumonia	Pneumonia	Strong
010	No Pneumonia	No Pneumonia	No Pneumonia	Pneumonia	No Pneumonia	Moderate

Some rows have clear agreement, while others are borderline and noisy. The goal is to recover both a stable case-level consensus and a useful estimate of annotator quality.

How the EM Loop Works

Consensus and quality refine each other.

A practical initialization is majority vote, with a fixed rule for breaking ties. This gives a temporary estimate of the latent case labels. Using that temporary consensus, we compute an initial confusion matrix for each annotator.

Annotator A	Consensus: Pneumonia	Consensus: No Pneumonia
A Labeled Pneumonia	1.00	0.17
A Labeled No Pneumonia	0.00	0.83

Annotator B	Consensus: Pneumonia	Consensus: No Pneumonia
B Labeled Pneumonia	1.00	0.33
B Labeled No Pneumonia	0.00	0.67

Annotator C	Consensus: Pneumonia	Consensus: No Pneumonia
C Labeled Pneumonia	0.75	0.33
C Labeled No Pneumonia	0.25	0.67

Annotator D	Consensus: Pneumonia	Consensus: No Pneumonia
D Labeled Pneumonia	0.75	0.33
D Labeled No Pneumonia	0.25	0.67

These estimates are still rough because they depend on an initial majority vote. EM improves both parts iteratively:

Given current annotator reliabilities, what is the most likely true label for each case?
Given updated case labels, what does each annotator’s reliability look like now?

One EM iteration:

Step	What happens
1	Start from current consensus labels
2	Treat them as temporary latent-label estimates
3	Update each annotator’s confusion matrix
4	Re-score each case using those annotator profiles
5	Assign refined consensus labels
6	Repeat until labels and reliability estimates stabilize

EM alternates between estimating who is reliable and using that estimate to refine the case-level consensus.

Minimal working example (synthetic data + EM, SciPy)

The same labeling problem, but in a setting where the hidden truth is known.

The labels remain Pneumonia and No Pneumonia. In the synthetic setup, No Pneumonia is implemented as samples drawn from a standard normal distribution, while Pneumonia is implemented as samples drawn from a close but non-normal distribution.

The experiment uses four test-based annotators, four random annotators, and two bad annotators that usually vote against the strongest signal. This creates a noisy pool where majority vote is noticeably biased, but EM can still recover a reliable consensus.

import numpy as np
from scipy import stats

# ------------------------------------------------------------
# Synthetic annotation task
# 0 = No Pneumonia  -> sample from a standard normal
# 1 = Pneumonia     -> sample from a close but non-normal distribution
# ------------------------------------------------------------

rng = np.random.default_rng(7)

def generate_patient(label, n=100):
    if label == 0:
        return rng.normal(0, 1, size=n)
    return stats.skewnorm.rvs(a=14, size=n, random_state=rng)

def annotate(sample):
    mu = sample.mean()
    sigma = sample.std(ddof=1) + 1e-12
    z = (sample - mu) / sigma

    # Original test-based annotators
    shapiro = int(stats.shapiro(sample).pvalue < 0.05)
    dagostino = int(stats.normaltest(sample).pvalue < 0.05)
    jarque_bera = int(stats.jarque_bera(sample).pvalue < 0.05)
    ks = int(stats.kstest(z, "norm", method="asymp").pvalue < 0.05)

    labels = [shapiro, dagostino, jarque_bera, ks]

    # Four random annotators
    labels += [int(rng.integers(0, 2)) for _ in range(4)]

    # Two bad annotators: usually flip the strongest signal
    for _ in range(2):
        labels.append(1 - shapiro if rng.random() < 0.8 else int(rng.integers(0, 2)))

    return labels

def build_dataset(n_cases=200):
    y_true = rng.integers(0, 2, size=n_cases)
    X = np.array([annotate(generate_patient(y)) for y in y_true])
    return X, y_true

# ------------------------------------------------------------
# Dawid-Skene style EM
# theta[r, t, l] = P(annotator r outputs label l | true label t)
# ------------------------------------------------------------

def run_em(X, n_classes=2, max_iter=12, min_iter=8, alpha=0.5, tol=1e-6):
    n_items, n_annotators = X.shape

    counts = np.apply_along_axis(lambda row: np.bincount(row, minlength=n_classes), 1, X)
    y_init = np.argmax(counts, axis=1)
    post = np.eye(n_classes)[y_init]
    history = []

    for it in range(max_iter):
        pi = post.mean(axis=0)
        theta = np.zeros((n_annotators, n_classes, n_classes))

        for r in range(n_annotators):
            mask0 = (X[:, r] == 0)
            mask1 = ~mask0
            for t in range(n_classes):
                weights = post[:, t]
                denom = weights.sum() + alpha * n_classes
                theta[r, t, 0] = ((weights * mask0).sum() + alpha) / denom
                theta[r, t, 1] = ((weights * mask1).sum() + alpha) / denom

        history.append((post.argmax(axis=1).copy(), theta.copy(), post.copy()))

        log_post = np.zeros((n_items, n_classes))
        for t in range(n_classes):
            log_post[:, t] = np.log(pi[t] + 1e-12)
            for r in range(n_annotators):
                log_post[:, t] += np.log(theta[r, t, X[:, r]] + 1e-12)

        log_post -= log_post.max(axis=1, keepdims=True)
        new_post = np.exp(log_post)
        new_post /= new_post.sum(axis=1, keepdims=True)

        if it + 1 >= min_iter and np.max(np.abs(new_post - post)) < tol:
            post = new_post
            break

        post = new_post

    return post.argmax(axis=1), post, history, theta

annotator_names = [
    "Shapiro", "D'Agostino", "Jarque-Bera", "KS",
    "Random 1", "Random 2", "Random 3", "Random 4",
    "Bad 1", "Bad 2"
]

X, y_true = build_dataset(n_cases=200)
y_em, post, history, theta_final = run_em(X)

counts = np.apply_along_axis(lambda row: np.bincount(row, minlength=2), 1, X)
y_mv = np.argmax(counts, axis=1)

print("Majority-vote accuracy:", (y_mv == y_true).mean())
print("EM consensus accuracy :", (y_em == y_true).mean())
def summarize_labels(y_pred):
    counts = np.bincount(y_pred, minlength=2)
    return counts[0], counts[1]

print("Hidden truth counts:")
n0, n1 = summarize_labels(y_true)
print(f"  No Pneumonia: {n0}")
print(f"  Pneumonia   : {n1}")

print("\nConsensus summary by iteration:")
for idx in [1, 2, 3, 6]:
    y_iter = history[idx - 1][0]
    acc = (y_iter == y_true).mean()
    c0, c1 = summarize_labels(y_iter)
    print(f"  Iteration {idx}: accuracy={acc:.3f}, No Pneumonia={c0}, Pneumonia={c1}")

print("\nFinal consensus:")
c0, c1 = summarize_labels(y_em)
print(f"  accuracy={((y_em == y_true).mean()):.3f}, No Pneumonia={c0}, Pneumonia={c1}")

print("\nAnnotator quality: first vs final iteration")
theta_first = history[0][1]
for i, name in enumerate(annotator_names):
    annot_acc = (X[:, i] == y_true).mean()
    print(f"\n{name}")
    print(f"  Accuracy vs hidden truth           = {annot_acc:.3f}")
    print(f"  First iter: P(Pneumonia|Pneumonia) = {theta_first[i,1,1]:.3f}")
    print(f"  First iter: P(No Pneumonia|No Pneumonia) = {theta_first[i,0,0]:.3f}")
    print(f"  Final iter: P(Pneumonia|Pneumonia) = {theta_final[i,1,1]:.3f}")
    print(f"  Final iter: P(No Pneumonia|No Pneumonia) = {theta_final[i,0,0]:.3f}")

How to read the output

Majority vote counts annotators. EM estimates annotator quality.

In this run, the bootstrap consensus starts at 0.675. After the first EM update it reaches 0.900, then 0.975, and after additional posterior refinement it reaches 0.990. The gain is large because the annotation pool contains both weak annotators and actively bad ones.

Stage	Iteration	Consensus accuracy	Predicted No Pneumonia	Predicted Pneumonia
Bootstrap majority vote	1	0.675	148	52
After first EM update	2	0.900	115	85
After second EM update	3	0.975	98	102
After posterior refinement	6	0.990	95	105
Hidden truth	—	—	95	105

Majority vote is pulled toward No Pneumonia because the noisy annotators dominate the count. EM reduces that bias quickly, and by iteration 6 the consensus exactly matches the hidden class balance. Across the full run, EM flips 65 case labels relative to the bootstrap consensus.

The bootstrap consensus is weak, but the EM updates quickly separate informative evidence from noisy evidence.

The annotator profiles explain why the consensus improves. In the first iteration, even random annotators still inherit moderate-looking scores because the bootstrap consensus is noisy. By the final iteration, the strong annotators remain strong, the random annotators collapse toward chance, and the bad annotators become clearly anti-reliable.

Annotator	Accuracy vs hidden truth	First iter: P(Pneumonia \| Pneumonia)	First iter: P(No Pneumonia \| No Pneumonia)	Final iter: P(Pneumonia \| Pneumonia)	Final iter: P(No Pneumonia \| No Pneumonia)
Shapiro	0.990	0.877	0.587	0.994	0.972
D'Agostino	0.950	0.915	0.614	0.959	0.954
Jarque-Bera	0.960	0.915	0.628	0.950	0.965
KS	0.565	0.330	0.990	0.175	0.995
Random 1	0.495	0.689	0.581	0.487	0.507
Random 2	0.505	0.708	0.654	0.457	0.577
Random 3	0.515	0.708	0.601	0.494	0.535
Random 4	0.500	0.632	0.567	0.485	0.515
Bad 1	0.085	0.217	0.440	0.081	0.101
Bad 2	0.110	0.292	0.419	0.130	0.082

The final pattern is clear. Shapiro, D'Agostino, and Jarque-Bera emerge as highly reliable. KS remains weak and asymmetric. The random annotators settle near chance behavior. Bad 1 and Bad 2 end with extremely low reliability.

The informative annotators remain strong, the random annotators stay near chance, and the bad annotators are exposed as misleading.

The practical point is simple. Majority vote counts annotators, while EM estimates annotator quality. When the pool contains weak or bad annotators, that difference is decisive.

Get in touch

Please fill the following form and our experts will be happy to reply to you soon

Engineering for Annotation in the ML Pipeline | Part 2

Engineering for Annotation in the ML Pipeline, Part 2: Creating Consensus and Quality Control

Introduction

Why This Matters

How the EM Loop Works

Minimal working example (synthetic data + EM, SciPy)

How to read the output

Related Content

Get in touch

Recent News

RSIP Vision is attending AAOS 2026

RSIP Vision will be at the Spine Summit in Phoenix, AZ

RSIP Vision will be attending the SAGES NBT Innovation weekend

Announcement – XPlan.ai Confirms Premier Precision in Peer-Reviewed Clinical Study of its 2D-to-3D Knee Reconstruction Solution

Upcoming Events

SAGES | Tampa, FL

AUA | Washington, DC

ATS | Orlando, FL

Find quick answers here

Follow us