Skip to content
  • Our Work
    • Fields
      • Cardiology
      • ENT
      • Gastro
      • Orthopedics
      • Ophthalmology
      • Pulmonology
      • Surgical
      • Urology
      • Other
    • Modalities
      • Endoscopy
      • Medical segmentation
      • Microscopy
      • Ultrasound
  • Success Stories
  • Insights
    • Computer Vision News
    • News
    • Upcoming Events
    • Blog
  • The company
    • About RSIP Vision
    • Careers
  • FAQ
Menu
  • Our Work
    • Fields
      • Cardiology
      • ENT
      • Gastro
      • Orthopedics
      • Ophthalmology
      • Pulmonology
      • Surgical
      • Urology
      • Other
    • Modalities
      • Endoscopy
      • Medical segmentation
      • Microscopy
      • Ultrasound
  • Success Stories
  • Insights
    • Computer Vision News
    • News
    • Upcoming Events
    • Blog
  • The company
    • About RSIP Vision
    • Careers
  • FAQ
Contact

Continuous Integration for AI Projects

Itai Weiss

Continuous Integration (CI) is the practice of automatically testing your code every time you make a change. In traditional software, this works well because the same input always produces the same output. But AI systems break this easily—even just adding more training data without touching the code can cause your tests to fail. After months of dealing with random test failures, most engineers quietly adopt the same survival tactic: ignore CI.

That’s dangerous. When CI becomes unreliable, engineers stop trusting it. That’s worse than having no CI at all. This post presents a practical framework for testing an AI codebase in a way that is stable, fast, and relevant.

What Are We Testing?

Your AI codebase isn’t just a model—it’s an entire system. It includes data readers, preprocessors, training logic, inference wrappers, post-processing rules, and production APIs. This ecosystem needs software-grade testing, even if the model itself isn’t deterministic.

The 3-Layer Test Framework

1. Unit Tests

These are your anchors—the most boring, stable, predictable tests in your system. They cover things like shape checks, tokenizers, normalization operations, custom loss functions, and metric implementations. Unit tests should never depend on data randomness, model weights, or training. They run on every commit and should rarely need updates.

2. Smoke Tests

Smoke tests answer one question: does the entire pipeline still run without crashing? This includes loading a tiny dataset, running a couple of training steps, performing inference, and returning a sane output. We’re not validating accuracy—we’re validating survival. If your smoke test fails, someone broke a core flow. These catch most “oops I refactored preprocessing” bugs. They’re cheap enough to run on every merge request.

3. Golden Set Tests

This is the key layer. The idea: a tiny, fixed, hand-curated dataset combined with behavioral invariants that must always hold. This dataset doesn’t change—it’s your ground truth sanity anchor.

The invariants come from two sources. First, universal AI rules: loss should decrease when training on a small subset, the model should be able to overfit tiny data, and outputs must be sane (no NaNs, predictions aren’t constant). Second, domain-specific rules: in segmentation, golden images have objects of known size, so predictions must preserve those size constraints. In classification, the model must beat random chance.

These invariants survive metric changes, post-processing changes, and even architecture changes—because they reflect the physics of your domain, not the specifics of a model.

Conclusion

With this framework, CI becomes meaningful again. Unit tests catch code bugs, smoke tests catch integration breaks, and golden tests catch AI-specific regressions—all without the false alarms that made you stop trusting CI in the first place.

In Part 2, we’ll show how to implement this using GitLab.

Share

Share on linkedin
Share on twitter
Share on facebook

Related Content

MLOps for AI in Medical Imaging

Using Generative AI to generate Synthetic Labeled medical data

GenAI in Medical Imaging

RSIP Participates in Vision Day

Annotation strategy and workflow

Data and Annotation Challenges in Medical AI Development

Improved PCNL

Improved PCNL with Computer Vision

Super-Resolution in OCT images

Super-Resolution in OCT images

MLOps for AI in Medical Imaging

Using Generative AI to generate Synthetic Labeled medical data

GenAI in Medical Imaging

RSIP Participates in Vision Day

Annotation strategy and workflow

Data and Annotation Challenges in Medical AI Development

Improved PCNL

Improved PCNL with Computer Vision

Super-Resolution in OCT images

Super-Resolution in OCT images

Show all

Get in touch

Please fill the following form and our experts will be happy to reply to you soon

Recent News

Announcement – XPlan.ai Confirms Premier Precision in Peer-Reviewed Clinical Study of its 2D-to-3D Knee Reconstruction Solution

IBD Scoring – Clario, GI Reviewers and RSIP Vision Team Up

RSIP Neph Announces a Revolutionary Intra-op Solution for Partial Nephrectomy Surgeries

Announcement – XPlan.ai by RSIP Vision Presents Successful Preliminary Results from Clinical Study of it’s XPlan 2D-to-3D Knee Bones Reconstruction

All news
Upcoming Events
Stay informed for our next events
Find quick answers here
FAQ
Follow us
Linkedin Twitter Facebook Youtube

contact@rsipvision.com

Terms of Use

Privacy Policy

© All rights reserved to RSIP Vision 2023

Created by Shmulik

  • Our Work
    • title-1
      • Ophthalmology
      • Uncategorized
      • Ophthalmology
      • Pulmonology
      • Cardiology
      • Orthopedics
    • Title-2
      • Orthopedics
  • Success Stories
  • Insights
  • The company
  • FAQ