From DICOM Chaos to Training-Ready Data: Our Dataset Pipeline for Medical AI – Part 2

Part 2: Organize

Convert Every Dataset Into One Uniform “Source of Truth”

The “organize” step is where you take raw datasets from hospitals, scanners, and protocols, all with different file structures and quirks, and convert them into a single standardized representation.

This representation must be:

Easy for algorithms to consume
Annotation-friendly
Easy to validate with tools

Everything downstream (annotation, training, evaluation, debugging) assumes this format is correct. If “organize” is inconsistent, every later stage becomes fragile.

The Pattern That Scales: One Adapter per Dataset

One of the biggest mistakes teams make is trying to build “one big converter” that handles every dataset. It feels efficient, but it never is.

Every dataset has its own weirdness:

Series naming differences
Missing metadata
Odd slice order
Multiple reconstructions
Broken tags
Inconsistently compressed images
Artifacts that only appear in one vendor’s scanner

So instead, we use a simple pattern:

Each dataset gets its own adapter

Each adapter converts raw data into a shared Uniform Data Spec, and nothing is allowed downstream unless it conforms to that spec.

This gives you:

Modularity (new dataset doesn’t break old datasets)
Traceability (you know which rules applied to which cases)
Simpler debugging

The Uniform Data Spec (What You Want as Output)

The exact format depends on your toolchain, but a good uniform output usually contains:

Standardized image data (e.g. NIfTI/H5/NPZ)
Standardized metadata (JSON/YAML)
Stable file naming
Stable identifiers (patient/study/series mapping)
Consistent coordinate conventions
Compatibility with annotations

Example structure:

/hospital_001/patient_001/series_name/

image.nii.gz

metadata.json

ids.json

This becomes the data everyone trusts
This is what annotators label
This is what training consumes
This is what QA tools validate

Welcome to the DICOM Trap

If your input starts as DICOM (and it usually does), then “organize” is where the pain begins.

And here’s the important part:

Many DICOM failures don’t crash your pipeline. They quietly corrupt your dataset.

The three most common disasters:

1) Series Selection Mistakes

One study may contain e.g. multiple reconstructions, different kernels, partial acquisitions

Your pipeline may pick “the first series in the folder” and you won’t notice until later.

So series selection must be explicit, repeatable, and audited. First, rejection rules should be used – too small scans, not good spacing, keywords in seris name (e.g. *mip*, *oblique*, etc.). This should remove most series and what is left should be examined visually on at least a big enough amount of cases. This should help in policity choosing – is randomly taking from what is left is good? Is manaul picking is needed for every case? Are more automatic rejection rules can be used?

2) Orientation Flips and Axis Confusion

Everything looks fine… until you overlay a mask and realize:

Left-right is mirrored
Superior-inferior is swapped
Slice order is inverted

This gets especially nasty when mixing DICOM to NIfTI converters and annotation tools that each interpret orientation slightly differently.

The fix isn’t clever code. The fix is validation.

3) Spacing Inconsistencies

Sometimes DICOM spacing tags:

Disagree with slice positions
Are missing
Are wrong (yes, wrong)

Your model might still train, but measurements and derived features become meaningless.

And in medical AI, meaningless measurements = meaningless model.

How to Validate Organize (Without Going Insane)

You validate it in layers.

Layer 1: Automated sanity checks

Per-case:

Correct shape/dimensions
Spacing is positive and plausible
Orientation matches standard
Intensity range is plausible
No duplicate slices
Required metadata exists

Per-dataset:

Distributions of spacing, shapes, intensity stats
Fraction missing tags
Corrupted/skipped case count
Duplicates by hash
ID collisions

This catches “hard failures.”

Layer 2: Visual inspection workflow (random sampling)

Automation doesn’t catch everything.

So you do the human thing:

Sample random cases
Open in viewer
Scroll
Verify anatomy direction
Compare against metadata

This is how you catch “it looks plausible but is wrong.”

Trust the automation – but verify the visuals.

Layer 3: Traceability is validation

Every organized output should record:

Where it came from
What rules were applied
What was skipped and why

A dataset pipeline without traceability is like a lab notebook without dates.

Next: Part 3 — Prepare: Turning Organized Data into Training Artifacts Without Creating Silent Bugs

Get in touch

Please fill the following form and our experts will be happy to reply to you soon

From DICOM Chaos to Training-Ready Data: Our Dataset Pipeline for Medical AI – Part 2

Convert Every Dataset Into One Uniform “Source of Truth”

The Pattern That Scales: One Adapter per Dataset

Each dataset gets its own adapter

The Uniform Data Spec (What You Want as Output)

Welcome to the DICOM Trap

1) Series Selection Mistakes

2) Orientation Flips and Axis Confusion

3) Spacing Inconsistencies

How to Validate Organize (Without Going Insane)

Layer 1: Automated sanity checks

Layer 2: Visual inspection workflow (random sampling)

Layer 3: Traceability is validation

Related Content

Engineering for Annotation in the ML Pipeline | Part 2

Engineering for Annotation in the ML Pipeline

From DICOM Chaos to Training-Ready Data: Our Dataset Pipeline for Medical AI – Part 4

From DICOM Chaos to Training-Ready Data: Our Dataset Pipeline for Medical AI – Part 3

From DICOM Chaos to Training-Ready Data: Our Dataset Pipeline for Medical AI – Part 1

Continuous Integration for AI Projects – Part 2

Engineering for Annotation in the ML Pipeline | Part 2

Engineering for Annotation in the ML Pipeline

From DICOM Chaos to Training-Ready Data: Our Dataset Pipeline for Medical AI – Part 4

From DICOM Chaos to Training-Ready Data: Our Dataset Pipeline for Medical AI – Part 3

From DICOM Chaos to Training-Ready Data: Our Dataset Pipeline for Medical AI – Part 1

Continuous Integration for AI Projects – Part 2

Get in touch

Recent News

RSIP Vision is attending AAOS 2026

RSIP Vision will be at the Spine Summit in Phoenix, AZ

RSIP Vision will be attending the SAGES NBT Innovation weekend

Announcement – XPlan.ai Confirms Premier Precision in Peer-Reviewed Clinical Study of its 2D-to-3D Knee Reconstruction Solution

Upcoming Events

SAGES | Tampa, FL

AUA | Washington, DC

ATS | Orlando, FL

Find quick answers here

Follow us