Skip to content
  • Our Work
    • Fields
      • Cardiology
      • ENT
      • Gastro
      • Orthopedics
      • Ophthalmology
      • Pulmonology
      • Surgical
      • Urology
      • Other
    • Modalities
      • Endoscopy
      • Medical segmentation
      • Microscopy
      • Ultrasound
  • Success Stories
  • Insights
    • Computer Vision News
    • News
    • Upcoming Events
    • Blog
  • The company
    • About RSIP Vision
    • Careers
  • FAQ
Menu
  • Our Work
    • Fields
      • Cardiology
      • ENT
      • Gastro
      • Orthopedics
      • Ophthalmology
      • Pulmonology
      • Surgical
      • Urology
      • Other
    • Modalities
      • Endoscopy
      • Medical segmentation
      • Microscopy
      • Ultrasound
  • Success Stories
  • Insights
    • Computer Vision News
    • News
    • Upcoming Events
    • Blog
  • The company
    • About RSIP Vision
    • Careers
  • FAQ
Contact

Engineering for Annotation in the ML Pipeline | Part 2

Engineering for Annotation in the ML Pipeline, Part 2: Creating Consensus and Quality Control

Engineering for Annotation in the ML Pipeline, Part 2: Creating Consensus and Quality Control

Author: Eytan Slotnik
Date: March 11, 2026

Introduction

If we rely on annotations to represent the truth, how can we test them?

A standard machine learning workflow assumes that labels are mostly correct, with only limited and measurable noise. In production annotation pipelines, that assumption is often too optimistic.

Real datasets are usually labeled by multiple annotators who may apply guidelines differently. To make the dataset usable for learning, we need two outputs:

  1. A robust consensus label for each case.
  2. A reliability estimate for each annotator.

If one of these is already known, estimating the other is straightforward. When neither is known, we get a circular dependency: quality needs ground truth, but ground truth depends on quality.

In this part, we use Expectation-Maximization (EM) to break that loop by refining consensus and annotator quality together.

Why This Matters

Producing reliable labels when disagreements are common.

Many annotation tasks do not have perfectly crisp boundaries. They depend on judgment and protocol interpretation. Pneumonia labeling from chest X-rays is a practical example where disagreement appears even in routine cases.

Suppose we want to build a robust dataset for pneumonia classification. We start from a raw annotation matrix like the one below:

Case ID Annotator A Annotator B Annotator C Annotator D Majority Vote Initial Consensus
001PneumoniaPneumoniaPneumoniaPneumoniaPneumoniaStrong
002No PneumoniaNo PneumoniaNo PneumoniaNo PneumoniaNo PneumoniaStrong
003PneumoniaPneumoniaNo PneumoniaPneumoniaPneumoniaModerate
004No PneumoniaPneumoniaNo PneumoniaNo PneumoniaNo PneumoniaModerate
005PneumoniaNo PneumoniaPneumoniaNo PneumoniaTieWeak
006No PneumoniaNo PneumoniaPneumoniaNo PneumoniaNo PneumoniaModerate
007PneumoniaPneumoniaPneumoniaNo PneumoniaPneumoniaModerate
008No PneumoniaPneumoniaNo PneumoniaPneumoniaTieWeak
009PneumoniaPneumoniaPneumoniaPneumoniaPneumoniaStrong
010No PneumoniaNo PneumoniaNo PneumoniaPneumoniaNo PneumoniaModerate

Some rows have clear agreement, while others are borderline and noisy. The goal is to recover both a stable case-level consensus and a useful estimate of annotator quality.

How the EM Loop Works

Consensus and quality refine each other.

A practical initialization is majority vote, with a fixed rule for breaking ties. This gives a temporary estimate of the latent case labels. Using that temporary consensus, we compute an initial confusion matrix for each annotator.

Annotator A Consensus: Pneumonia Consensus: No Pneumonia
A Labeled Pneumonia1.000.17
A Labeled No Pneumonia0.000.83
Annotator B Consensus: Pneumonia Consensus: No Pneumonia
B Labeled Pneumonia1.000.33
B Labeled No Pneumonia0.000.67
Annotator C Consensus: Pneumonia Consensus: No Pneumonia
C Labeled Pneumonia0.750.33
C Labeled No Pneumonia0.250.67
Annotator D Consensus: Pneumonia Consensus: No Pneumonia
D Labeled Pneumonia0.750.33
D Labeled No Pneumonia0.250.67

These estimates are still rough because they depend on an initial majority vote. EM improves both parts iteratively:

  1. Given current annotator reliabilities, what is the most likely true label for each case?
  2. Given updated case labels, what does each annotator’s reliability look like now?

One EM iteration:

Step What happens
1Start from current consensus labels
2Treat them as temporary latent-label estimates
3Update each annotator’s confusion matrix
4Re-score each case using those annotator profiles
5Assign refined consensus labels
6Repeat until labels and reliability estimates stabilize
Consensus and quality refine each other Current consensus Use the current case labels as a temporary estimate of the truth. Update annotator profiles Estimate confusion matrices from the current consensus. Recompute case labels Use those profiles to refine the most likely label for each case. Repeat until labels and reliability estimates stabilize.
EM alternates between estimating who is reliable and using that estimate to refine the case-level consensus.

Minimal working example (synthetic data + EM, SciPy)

The same labeling problem, but in a setting where the hidden truth is known.

The labels remain Pneumonia and No Pneumonia. In the synthetic setup, No Pneumonia is implemented as samples drawn from a standard normal distribution, while Pneumonia is implemented as samples drawn from a close but non-normal distribution.

The experiment uses four test-based annotators, four random annotators, and two bad annotators that usually vote against the strongest signal. This creates a noisy pool where majority vote is noticeably biased, but EM can still recover a reliable consensus.

import numpy as np
from scipy import stats

# ------------------------------------------------------------
# Synthetic annotation task
# 0 = No Pneumonia  -> sample from a standard normal
# 1 = Pneumonia     -> sample from a close but non-normal distribution
# ------------------------------------------------------------

rng = np.random.default_rng(7)

def generate_patient(label, n=100):
    if label == 0:
        return rng.normal(0, 1, size=n)
    return stats.skewnorm.rvs(a=14, size=n, random_state=rng)

def annotate(sample):
    mu = sample.mean()
    sigma = sample.std(ddof=1) + 1e-12
    z = (sample - mu) / sigma

    # Original test-based annotators
    shapiro = int(stats.shapiro(sample).pvalue < 0.05)
    dagostino = int(stats.normaltest(sample).pvalue < 0.05)
    jarque_bera = int(stats.jarque_bera(sample).pvalue < 0.05)
    ks = int(stats.kstest(z, "norm", method="asymp").pvalue < 0.05)

    labels = [shapiro, dagostino, jarque_bera, ks]

    # Four random annotators
    labels += [int(rng.integers(0, 2)) for _ in range(4)]

    # Two bad annotators: usually flip the strongest signal
    for _ in range(2):
        labels.append(1 - shapiro if rng.random() < 0.8 else int(rng.integers(0, 2)))

    return labels

def build_dataset(n_cases=200):
    y_true = rng.integers(0, 2, size=n_cases)
    X = np.array([annotate(generate_patient(y)) for y in y_true])
    return X, y_true

# ------------------------------------------------------------
# Dawid-Skene style EM
# theta[r, t, l] = P(annotator r outputs label l | true label t)
# ------------------------------------------------------------

def run_em(X, n_classes=2, max_iter=12, min_iter=8, alpha=0.5, tol=1e-6):
    n_items, n_annotators = X.shape

    counts = np.apply_along_axis(lambda row: np.bincount(row, minlength=n_classes), 1, X)
    y_init = np.argmax(counts, axis=1)
    post = np.eye(n_classes)[y_init]
    history = []

    for it in range(max_iter):
        pi = post.mean(axis=0)
        theta = np.zeros((n_annotators, n_classes, n_classes))

        for r in range(n_annotators):
            mask0 = (X[:, r] == 0)
            mask1 = ~mask0
            for t in range(n_classes):
                weights = post[:, t]
                denom = weights.sum() + alpha * n_classes
                theta[r, t, 0] = ((weights * mask0).sum() + alpha) / denom
                theta[r, t, 1] = ((weights * mask1).sum() + alpha) / denom

        history.append((post.argmax(axis=1).copy(), theta.copy(), post.copy()))

        log_post = np.zeros((n_items, n_classes))
        for t in range(n_classes):
            log_post[:, t] = np.log(pi[t] + 1e-12)
            for r in range(n_annotators):
                log_post[:, t] += np.log(theta[r, t, X[:, r]] + 1e-12)

        log_post -= log_post.max(axis=1, keepdims=True)
        new_post = np.exp(log_post)
        new_post /= new_post.sum(axis=1, keepdims=True)

        if it + 1 >= min_iter and np.max(np.abs(new_post - post)) < tol:
            post = new_post
            break

        post = new_post

    return post.argmax(axis=1), post, history, theta

annotator_names = [
    "Shapiro", "D'Agostino", "Jarque-Bera", "KS",
    "Random 1", "Random 2", "Random 3", "Random 4",
    "Bad 1", "Bad 2"
]

X, y_true = build_dataset(n_cases=200)
y_em, post, history, theta_final = run_em(X)

counts = np.apply_along_axis(lambda row: np.bincount(row, minlength=2), 1, X)
y_mv = np.argmax(counts, axis=1)

print("Majority-vote accuracy:", (y_mv == y_true).mean())
print("EM consensus accuracy :", (y_em == y_true).mean())
def summarize_labels(y_pred):
    counts = np.bincount(y_pred, minlength=2)
    return counts[0], counts[1]

print("Hidden truth counts:")
n0, n1 = summarize_labels(y_true)
print(f"  No Pneumonia: {n0}")
print(f"  Pneumonia   : {n1}")

print("\nConsensus summary by iteration:")
for idx in [1, 2, 3, 6]:
    y_iter = history[idx - 1][0]
    acc = (y_iter == y_true).mean()
    c0, c1 = summarize_labels(y_iter)
    print(f"  Iteration {idx}: accuracy={acc:.3f}, No Pneumonia={c0}, Pneumonia={c1}")

print("\nFinal consensus:")
c0, c1 = summarize_labels(y_em)
print(f"  accuracy={((y_em == y_true).mean()):.3f}, No Pneumonia={c0}, Pneumonia={c1}")

print("\nAnnotator quality: first vs final iteration")
theta_first = history[0][1]
for i, name in enumerate(annotator_names):
    annot_acc = (X[:, i] == y_true).mean()
    print(f"\n{name}")
    print(f"  Accuracy vs hidden truth           = {annot_acc:.3f}")
    print(f"  First iter: P(Pneumonia|Pneumonia) = {theta_first[i,1,1]:.3f}")
    print(f"  First iter: P(No Pneumonia|No Pneumonia) = {theta_first[i,0,0]:.3f}")
    print(f"  Final iter: P(Pneumonia|Pneumonia) = {theta_final[i,1,1]:.3f}")
    print(f"  Final iter: P(No Pneumonia|No Pneumonia) = {theta_final[i,0,0]:.3f}")

How to read the output

Majority vote counts annotators. EM estimates annotator quality.

In this run, the bootstrap consensus starts at 0.675. After the first EM update it reaches 0.900, then 0.975, and after additional posterior refinement it reaches 0.990. The gain is large because the annotation pool contains both weak annotators and actively bad ones.

Stage Iteration Consensus accuracy Predicted No Pneumonia Predicted Pneumonia
Bootstrap majority vote 1 0.675 148 52
After first EM update 2 0.900 115 85
After second EM update 3 0.975 98 102
After posterior refinement 6 0.990 95 105
Hidden truth — — 95 105

Majority vote is pulled toward No Pneumonia because the noisy annotators dominate the count. EM reduces that bias quickly, and by iteration 6 the consensus exactly matches the hidden class balance. Across the full run, EM flips 65 case labels relative to the bootstrap consensus.

Consensus quality over EM iterations 0.60 0.70 0.80 0.90 1.00 1 2 3 4 5 6 7 8 Iteration 0.675 0.900 0.975 0.975 0.975 0.990 0.990 0.990
The bootstrap consensus is weak, but the EM updates quickly separate informative evidence from noisy evidence.

The annotator profiles explain why the consensus improves. In the first iteration, even random annotators still inherit moderate-looking scores because the bootstrap consensus is noisy. By the final iteration, the strong annotators remain strong, the random annotators collapse toward chance, and the bad annotators become clearly anti-reliable.

Annotator Accuracy vs hidden truth First iter: P(Pneumonia | Pneumonia) First iter: P(No Pneumonia | No Pneumonia) Final iter: P(Pneumonia | Pneumonia) Final iter: P(No Pneumonia | No Pneumonia)
Shapiro0.9900.8770.5870.9940.972
D'Agostino0.9500.9150.6140.9590.954
Jarque-Bera0.9600.9150.6280.9500.965
KS0.5650.3300.9900.1750.995
Random 10.4950.6890.5810.4870.507
Random 20.5050.7080.6540.4570.577
Random 30.5150.7080.6010.4940.535
Random 40.5000.6320.5670.4850.515
Bad 10.0850.2170.4400.0810.101
Bad 20.1100.2920.4190.1300.082

The final pattern is clear. Shapiro, D'Agostino, and Jarque-Bera emerge as highly reliable. KS remains weak and asymmetric. The random annotators settle near chance behavior. Bad 1 and Bad 2 end with extremely low reliability.

Final annotator accuracy against hidden truth 0.0 0.2 0.4 0.6 0.8 1.0 Shapiro 0.990 D'Agostino 0.950 Jarque-Bera 0.960 KS 0.565 Random 1 0.495 Random 2 0.505 Random 3 0.515 Random 4 0.500 Bad 1 0.085 Bad 2 0.110
The informative annotators remain strong, the random annotators stay near chance, and the bad annotators are exposed as misleading.

The practical point is simple. Majority vote counts annotators, while EM estimates annotator quality. When the pool contains weak or bad annotators, that difference is decisive.

Share

Share on linkedin
Share on twitter
Share on facebook

Related Content

Show all

Get in touch

Please fill the following form and our experts will be happy to reply to you soon

Recent News

RSIP Vision is attending AAOS 2026

RSIP Vision will be at the Spine Summit in Phoenix, AZ

RSIP Vision will be attending the SAGES NBT Innovation weekend

Announcement – XPlan.ai Confirms Premier Precision in Peer-Reviewed Clinical Study of its 2D-to-3D Knee Reconstruction Solution

All news
Upcoming Events

SAGES | Tampa, FL

March 25 - 28 2026

AUA | Washington, DC

May 15 - 18 2026

ATS | Orlando, FL

May 17 - 20 2026
Stay informed for our next events
Find quick answers here
FAQ
Follow us
Linkedin Twitter Facebook Youtube

contact@rsipvision.com

Terms of Use

Privacy Policy

© All rights reserved to RSIP Vision 2023

Created by Shmulik

  • Our Work
    • title-1
      • Ophthalmology
      • Uncategorized
      • Ophthalmology
      • Pulmonology
      • Cardiology
      • Orthopedics
    • Title-2
      • Orthopedics
  • Success Stories
  • Insights
  • The company
  • FAQ