From DICOM Chaos to Training-Ready Data: Our Dataset Pipeline for Medical AI – Part 3

Part 3: Prepare

The Step Where “Model Bugs” Are Usually Born

Once scans are organized and annotations exist, the temptation is to treat “prepare” as a quick preprocessing script. This is where teams get hurt because prepare rarely fails loudly.

It produces training data that is:

Plausible
Consistent-looking
But wrong

Wrong training data is worse than broken training code because it wastes time and misleads you.

What Prepare Actually Does

Prepare takes your organized data and turns it into training artifacts:

Precomputed crops / patches
Resized volumes
Resampled labels
Split manifests (train/val/test lists)
Cached augmentations
Derived targets

It also often includes decisions like:

What spacing to resample to
How to crop around anatomy
What augmentations are allowed
What “empty” masks mean
How to handle partial labels

It’s production code — and should be treated that way.

Validation in Prepare: Your Lie Detector

We validate prepare using two pillars:

1) Visual tests (random cases, always)

For random cases:

Display image + annotation overlay
Display training crop + annotation overlay
Display augmentations (rotate/flip/intensity)
Ensure the anatomy stays inside the crop
Ensure labels aren’t shifted, degraded, or resampled incorrectly

If you do nothing else: do this.

Visual inspection catches the “silent failure” class of bugs better than anything.

2) Dataset unit tests (cheap, powerful)

These are simple checks that save weeks:

Label coverage (% positive voxels)
Non-empty masks where expected
Bounding box distributions
Image/label shape match
Spacing consistency between image and label
Patient IDs don’t overlap across splits
Intensity histogram comparisons per dataset
Class balance across splits

These tests don’t need to be fancy. They just need to exist because a dataset pipeline without unit tests is an accident waiting to happen.

Next: Part 4 — Validation + Test: Keeping Your Metric Real and Your Test Set Meaningful

Engineering for Annotation in the ML Pipeline | Part 2

Engineering for Annotation in the ML Pipeline

From DICOM Chaos to Training-Ready Data: Our Dataset Pipeline for Medical AI – Part 4

From DICOM Chaos to Training-Ready Data: Our Dataset Pipeline for Medical AI – Part 2

From DICOM Chaos to Training-Ready Data: Our Dataset Pipeline for Medical AI – Part 1

Continuous Integration for AI Projects – Part 2

Engineering for Annotation in the ML Pipeline | Part 2

Engineering for Annotation in the ML Pipeline

From DICOM Chaos to Training-Ready Data: Our Dataset Pipeline for Medical AI – Part 4

From DICOM Chaos to Training-Ready Data: Our Dataset Pipeline for Medical AI – Part 2

From DICOM Chaos to Training-Ready Data: Our Dataset Pipeline for Medical AI – Part 1

Continuous Integration for AI Projects – Part 2

Get in touch

Please fill the following form and our experts will be happy to reply to you soon

From DICOM Chaos to Training-Ready Data: Our Dataset Pipeline for Medical AI – Part 3

The Step Where “Model Bugs” Are Usually Born

What Prepare Actually Does

Validation in Prepare: Your Lie Detector

1) Visual tests (random cases, always)

2) Dataset unit tests (cheap, powerful)

Related Content

Engineering for Annotation in the ML Pipeline | Part 2

Engineering for Annotation in the ML Pipeline

From DICOM Chaos to Training-Ready Data: Our Dataset Pipeline for Medical AI – Part 4

From DICOM Chaos to Training-Ready Data: Our Dataset Pipeline for Medical AI – Part 2

From DICOM Chaos to Training-Ready Data: Our Dataset Pipeline for Medical AI – Part 1

Continuous Integration for AI Projects – Part 2

Engineering for Annotation in the ML Pipeline | Part 2

Engineering for Annotation in the ML Pipeline

From DICOM Chaos to Training-Ready Data: Our Dataset Pipeline for Medical AI – Part 4

From DICOM Chaos to Training-Ready Data: Our Dataset Pipeline for Medical AI – Part 2

From DICOM Chaos to Training-Ready Data: Our Dataset Pipeline for Medical AI – Part 1

Continuous Integration for AI Projects – Part 2

Get in touch

Recent News

RSIP Vision is attending AAOS 2026

RSIP Vision will be at the Spine Summit in Phoenix, AZ

RSIP Vision will be attending the SAGES NBT Innovation weekend

Announcement – XPlan.ai Confirms Premier Precision in Peer-Reviewed Clinical Study of its 2D-to-3D Knee Reconstruction Solution

Upcoming Events

SAGES | Tampa, FL

AUA | Washington, DC

ATS | Orlando, FL

Find quick answers here

Follow us