Continuous Integration for AI Projects

Itai Weiss

Continuous Integration (CI) is the practice of automatically testing your code every time you make a change. In traditional software, this works well because the same input always produces the same output. But AI systems break this easily—even just adding more training data without touching the code can cause your tests to fail. After months of dealing with random test failures, most engineers quietly adopt the same survival tactic: ignore CI.

That’s dangerous. When CI becomes unreliable, engineers stop trusting it. That’s worse than having no CI at all. This post presents a practical framework for testing an AI codebase in a way that is stable, fast, and relevant.

What Are We Testing?

Your AI codebase isn’t just a model—it’s an entire system. It includes data readers, preprocessors, training logic, inference wrappers, post-processing rules, and production APIs. This ecosystem needs software-grade testing, even if the model itself isn’t deterministic.

The 3-Layer Test Framework

1. Unit Tests

These are your anchors—the most boring, stable, predictable tests in your system. They cover things like shape checks, tokenizers, normalization operations, custom loss functions, and metric implementations. Unit tests should never depend on data randomness, model weights, or training. They run on every commit and should rarely need updates.

2. Smoke Tests

Smoke tests answer one question: does the entire pipeline still run without crashing? This includes loading a tiny dataset, running a couple of training steps, performing inference, and returning a sane output. We’re not validating accuracy—we’re validating survival. If your smoke test fails, someone broke a core flow. These catch most “oops I refactored preprocessing” bugs. They’re cheap enough to run on every merge request.

3. Golden Set Tests

This is the key layer. The idea: a tiny, fixed, hand-curated dataset combined with behavioral invariants that must always hold. This dataset doesn’t change—it’s your ground truth sanity anchor.

The invariants come from two sources. First, universal AI rules: loss should decrease when training on a small subset, the model should be able to overfit tiny data, and outputs must be sane (no NaNs, predictions aren’t constant). Second, domain-specific rules: in segmentation, golden images have objects of known size, so predictions must preserve those size constraints. In classification, the model must beat random chance.

These invariants survive metric changes, post-processing changes, and even architecture changes—because they reflect the physics of your domain, not the specifics of a model.

Conclusion

With this framework, CI becomes meaningful again. Unit tests catch code bugs, smoke tests catch integration breaks, and golden tests catch AI-specific regressions—all without the false alarms that made you stop trusting CI in the first place.

In Part 2, we’ll show how to implement this using GitLab.

Get in touch

Please fill the following form and our experts will be happy to reply to you soon

Continuous Integration for AI Projects

What Are We Testing?

The 3-Layer Test Framework

1. Unit Tests

2. Smoke Tests

3. Golden Set Tests

Conclusion

Related Content

Engineering for Annotation in the ML Pipeline | Part 2

Engineering for Annotation in the ML Pipeline

From DICOM Chaos to Training-Ready Data: Our Dataset Pipeline for Medical AI – Part 4

From DICOM Chaos to Training-Ready Data: Our Dataset Pipeline for Medical AI – Part 3

From DICOM Chaos to Training-Ready Data: Our Dataset Pipeline for Medical AI – Part 2

From DICOM Chaos to Training-Ready Data: Our Dataset Pipeline for Medical AI – Part 1

Engineering for Annotation in the ML Pipeline | Part 2

Engineering for Annotation in the ML Pipeline

From DICOM Chaos to Training-Ready Data: Our Dataset Pipeline for Medical AI – Part 4

From DICOM Chaos to Training-Ready Data: Our Dataset Pipeline for Medical AI – Part 3

From DICOM Chaos to Training-Ready Data: Our Dataset Pipeline for Medical AI – Part 2

From DICOM Chaos to Training-Ready Data: Our Dataset Pipeline for Medical AI – Part 1

Get in touch

Recent News

RSIP Vision is attending AAOS 2026

RSIP Vision will be at the Spine Summit in Phoenix, AZ

RSIP Vision will be attending the SAGES NBT Innovation weekend

Announcement – XPlan.ai Confirms Premier Precision in Peer-Reviewed Clinical Study of its 2D-to-3D Knee Reconstruction Solution

Upcoming Events

SAGES | Tampa, FL

AUA | Washington, DC

ATS | Orlando, FL

Find quick answers here

Follow us