Skip to content
  • Our Work
    • Fields
      • Cardiology
      • ENT
      • Gastro
      • Orthopedics
      • Ophthalmology
      • Pulmonology
      • Surgical
      • Urology
      • Other
    • Modalities
      • Endoscopy
      • Medical segmentation
      • Microscopy
      • Ultrasound
  • Success Stories
  • Insights
    • Computer Vision News
    • News
    • Upcoming Events
    • Blog
  • The company
    • About RSIP Vision
    • Careers
  • FAQ
Menu
  • Our Work
    • Fields
      • Cardiology
      • ENT
      • Gastro
      • Orthopedics
      • Ophthalmology
      • Pulmonology
      • Surgical
      • Urology
      • Other
    • Modalities
      • Endoscopy
      • Medical segmentation
      • Microscopy
      • Ultrasound
  • Success Stories
  • Insights
    • Computer Vision News
    • News
    • Upcoming Events
    • Blog
  • The company
    • About RSIP Vision
    • Careers
  • FAQ
Contact

From DICOM Chaos to Training-Ready Data: Our Dataset Pipeline for Medical AI – Part 2

Part 2: Organize

Convert Every Dataset Into One Uniform “Source of Truth”

The “organize” step is where you take raw datasets from hospitals, scanners, and protocols, all with different file structures and quirks, and convert them into a single standardized representation.

This representation must be:

  • Easy for algorithms to consume
  • Annotation-friendly
  • Easy to validate with tools

Everything downstream (annotation, training, evaluation, debugging) assumes this format is correct. If “organize” is inconsistent, every later stage becomes fragile.

The Pattern That Scales: One Adapter per Dataset

One of the biggest mistakes teams make is trying to build “one big converter” that handles every dataset. It feels efficient, but it never is.

Every dataset has its own weirdness:

  • Series naming differences
  • Missing metadata
  • Odd slice order
  • Multiple reconstructions 
  • Broken tags
  • Inconsistently compressed images
  • Artifacts that only appear in one vendor’s scanner

So instead, we use a simple pattern:

Each dataset gets its own adapter

Each adapter converts raw data into a shared Uniform Data Spec, and nothing is allowed downstream unless it conforms to that spec.

This gives you:

  • Modularity (new dataset doesn’t break old datasets)
  • Traceability (you know which rules applied to which cases)
  • Simpler debugging

The Uniform Data Spec (What You Want as Output)

The exact format depends on your toolchain, but a good uniform output usually contains:

  • Standardized image data (e.g. NIfTI/H5/NPZ)
  • Standardized metadata (JSON/YAML)
  • Stable file naming
  • Stable identifiers (patient/study/series mapping)
  • Consistent coordinate conventions
  • Compatibility with annotations

Example structure:

/hospital_001/patient_001/series_name/

image.nii.gz

metadata.json

ids.json

  • This becomes the data everyone trusts
  • This is what annotators label
  • This is what training consumes
  • This is what QA tools validate

Welcome to the DICOM Trap

If your input starts as DICOM (and it usually does), then “organize” is where the pain begins.

And here’s the important part:

Many DICOM failures don’t crash your pipeline. They quietly corrupt your dataset.

The three most common disasters:

1) Series Selection Mistakes

One study may contain e.g. multiple reconstructions, different kernels, partial acquisitions

Your pipeline may pick “the first series in the folder” and you won’t notice until later.

So series selection must be explicit, repeatable, and audited. First, rejection rules should be used – too small scans, not good spacing, keywords in seris name (e.g. *mip*, *oblique*, etc.). This should remove most series and what is left should be examined visually on at least a big enough amount of cases. This should help in policity choosing – is randomly taking from what is left is good? Is manaul picking is needed for every case? Are more automatic rejection rules can be used?

2) Orientation Flips and Axis Confusion

Everything looks fine… until you overlay a mask and realize:

  • Left-right is mirrored
  • Superior-inferior is swapped
  • Slice order is inverted

This gets especially nasty when mixing DICOM to NIfTI converters and annotation tools that each interpret orientation slightly differently.

The fix isn’t clever code. The fix is validation.

3) Spacing Inconsistencies

Sometimes DICOM spacing tags:

  • Disagree with slice positions
  • Are missing
  • Are wrong (yes, wrong)

Your model might still train, but measurements and derived features become meaningless.

And in medical AI, meaningless measurements = meaningless model.

How to Validate Organize (Without Going Insane)

You validate it in layers.

Layer 1: Automated sanity checks

Per-case:

  • Correct shape/dimensions
  • Spacing is positive and plausible
  • Orientation matches standard
  • Intensity range is plausible
  • No duplicate slices
  • Required metadata exists

Per-dataset:

  • Distributions of spacing, shapes, intensity stats
  • Fraction missing tags
  • Corrupted/skipped case count
  • Duplicates by hash
  • ID collisions

This catches “hard failures.”

Layer 2: Visual inspection workflow (random sampling)

Automation doesn’t catch everything.

So you do the human thing:

  • Sample random cases
  • Open in viewer
  • Scroll
  • Verify anatomy direction
  • Compare against metadata

This is how you catch “it looks plausible but is wrong.”

Trust the automation – but verify the visuals.

Layer 3: Traceability is validation

Every organized output should record:

  • Where it came from
  • What rules were applied
  • What was skipped and why

A dataset pipeline without traceability is like a lab notebook without dates.

Next: Part 3 — Prepare: Turning Organized Data into Training Artifacts Without Creating Silent Bugs

Share

Share on linkedin
Share on twitter
Share on facebook

Related Content

Engineering for Annotation in the ML Pipeline

From DICOM Chaos to Training-Ready Data: Our Dataset Pipeline for Medical AI – Part 4

From DICOM Chaos to Training-Ready Data: Our Dataset Pipeline for Medical AI – Part 3

From DICOM Chaos to Training-Ready Data: Our Dataset Pipeline for Medical AI – Part 1

Continuous Integration for AI Projects – Part 2

Continuous Integration for AI Projects

Engineering for Annotation in the ML Pipeline

From DICOM Chaos to Training-Ready Data: Our Dataset Pipeline for Medical AI – Part 4

From DICOM Chaos to Training-Ready Data: Our Dataset Pipeline for Medical AI – Part 3

From DICOM Chaos to Training-Ready Data: Our Dataset Pipeline for Medical AI – Part 1

Continuous Integration for AI Projects – Part 2

Continuous Integration for AI Projects

Show all

Get in touch

Please fill the following form and our experts will be happy to reply to you soon

Recent News

RSIP Vision will be at the Spine Summit in Phoenix, AZ

RSIP Vision will be attending the SAGES NBT Innovation weekend

Announcement – XPlan.ai Confirms Premier Precision in Peer-Reviewed Clinical Study of its 2D-to-3D Knee Reconstruction Solution

IBD Scoring – Clario, GI Reviewers and RSIP Vision Team Up

All news
Upcoming Events

Spine Summit | Phoenix, AZ

February 26 - March 1 2026

AAOS | New Orleans, LA

March 2 - 6 2026

SAGES | Tampa, FL

March 25 - 28 2026
Stay informed for our next events
Find quick answers here
FAQ
Follow us
Linkedin Twitter Facebook Youtube

contact@rsipvision.com

Terms of Use

Privacy Policy

© All rights reserved to RSIP Vision 2023

Created by Shmulik

  • Our Work
    • title-1
      • Ophthalmology
      • Uncategorized
      • Ophthalmology
      • Pulmonology
      • Cardiology
      • Orthopedics
    • Title-2
      • Orthopedics
  • Success Stories
  • Insights
  • The company
  • FAQ