We have learned about EndoVis at MICCAI 2019 and on a Computer Vision News medical imaging review with Stefanie Speidel. Led by Stefanie herself with Lena Maier-Hein and Danail Stoyanov, EndoVis has posted a vision CAI (Computer-Assisted Interventions) challenge to evaluate the current state of the art, gather researchers in the field and provide high quality data for validating endoscopic vision algorithms. We have asked Daniel Tomer, AI team leader at RSIP Vision, to comment the EndoVis challenge. Daniel has gained a large practical experience in algorithm development for endoscopy, having worked with many clients and developed practical applications in this field.

by Daniel Tomer

EndoVis has posted three AI sub-challenges. Solving these problems will help to dramatically improve the efficiency and accuracy of this kind of surgeries. The three sub-challenges are:
  • Robust medical instrument segmentation
  • Depth estimation from stereo camera pair
  • Surgical workflow and skill analysis

You can read about each of these three sub-challenges below.

AI solutions for Endoscopy

Robust medical instrument segmentation


One of the difficulties in the development of robotic assisted interventions systems for laparoscopic surgeries, is segmenting and detecting the surgical instruments present in the frame. In this challenge, teams try to develop robust and generalizable algorithmic models for solving that task.

The challenge is divided into three tasks:

  • Binary classification – Classifying each pixel as either background or tool.
  • Detection – Unique bounding box for each tool that is present in the frame.
  • Instance segmentation – Similar to binary segmentation but with a different label for each tool in the frame.

Technical Challenges

The main challenges in tool detection and segmentation models are two: 1) make them robust to poor quality frames (due to smoke, motion blur, blood etc.); 2) create a model which is able to produce accurate predictions on surgery data collected from a type of surgery and from a setting different than the data which was used to train it. These challenges are true for almost all AI computer vision challenges but in most cases each domain has its own unique solution.


All state of the art methods for tool segmentation involve the usage of deep neural networks. Best results for off-line (non real-time) prediction achieves an average ~95% dice score. It does so by using a U-Net model with a combination of attention blocks used at the decoding stage which allow the model to learn what area of the image it can ignore. Transfer learning (pre-training on ImageNet) and custom loss function (combination of cross entropy and dice loss) are also used to achieve this result.


Depth estimation from stereo camera pair


Depth estimation is an important component of many endoscopic navigation and augmented reality guidance system. Using a stereo camera pair (two cameras capturing the scene from two different angles) one can in principle obtain a depth image, i.e the depth coordinate of each pixel in the frame.

Technical Challenges

Classical methods, which were developed to work on natural image data, do not work well in endoscopic data due to the directional light, non-planar surfaces and subtle texture present in the data and which do not characterize natural images.


To overcome these challenges, several approaches can be taken. The more direct approach is to pre-process the data such that it is more similar to natural images. This can be done by using contrast enhancement methods, to make the texture less subtle and advanced filtering methods to get rid of directional light. After the post-processing, the frames are fed into one of the classical depth estimation methods. This approach yields good results, but it isn’t the state of the art.

In order to achieve state of the art results, researchers use a deep learning end-to-end approach. With this method, a deep convolutional neural network is developed with an architecture specifically aimed to analyze a stereo pair (PSMN for example). The model is trained on simulated data, such that when it is later fed two images, it accurately predicts their corresponding depth map.


In this sub-challenge, the submitted algorithmic models were evaluated over a test set which consisted of frames taken from an endoscopic camera with an associated ground truth for the depth (which was obtained using a structured light pattern). The metric for evaluation was the per-pixel mean squared error between the ground truth depth image and the predicted one.

First place (Trevor Zeffiro – RediMminds Inc., USA) reached an average of ~3 mm error using the deep learning approach. Second place (Jean-Claude Rosenthal – Fraunhofer Heinrich Hertz Institute, Germany) reached an average of ~3.2 mm error using the pre-process approach.


Surgical Workflow Analysis and Skill Assessment      


Analyzing the surgical workflow is an essential part of many applications in Computer-Assisted Surgery (CAS), such as specifying the tool which is the most probable to be required next by the surgeon, navigation information, or determining the remaining duration of surgery.

This challenge focuses on real-time workflow analysis of laparoscopic surgeries. Participants were challenged to develop models that are able to temporally segment the surgical videos into the surgical phases, to recognize surgical actions and instrument presence and to classify surgical skills based on video data.

Technical Challenges

As in all endoscopy-related tasks, the main challenges arise from the fact that endoscopic data is very different from natural images. Together with the fact that endoscopic datasets are usually small compared to natural image datasets, this poses a difficulty in producing robust models.

A challenge more specific to this task is the importance of the temporal dimension, i.e. one frame is not enough and it requires video analysis.

Many applications require real-time performance of the algorithm, in order to help the surgeon during the procedure. Furthermore, many of these frames will be of low quality due to occasional motion blur and overexposure. A robust model will need to know how to handle such cases without harming performance.


All current state of the art methods use deep learning models for this challenge. The general approach is to use a convolutional neural network (CNN), such as ResNet-50 or VGG16, to extract the features from each frame. The features are then used in different ways for the different tasks at hand. For tool detection, the features are fed into an additional CNN to output the location and type of tool. For action and phase recognition, the features are used as input to a Long Short Term Memory (LSTM) model which is the most commonly used recurrent neural network. The LSTM receives as input (besides the features of the current frame) the output of the previous frame, which allows it to utilize the temporal dimension of the video.

All of the highly ranked teams also used models which perform more than one task simultaneously. The rationale behind this is that when two tasks are related, they can share a feature extractor, thus enabling it to be trained on more data.

Specifically for the phase segmentation task, teams also used the fact that there is a certain prior knowledge regarding the possible order of phases (one phase can’t appear before the other) to boost their models performance. One team used the elapsed time of each frame as a direct input to the LSTM, giving it the ability to learn the prior knowledge by itself. Another team used a hidden Markov chain, with hard coded probabilities of transitions between phases, in order to directly impose the prior knowledge.


In this challenge, the submitted algorithmic models were evaluated over a test set which consisted of nine annotated videos of laparoscopic surgery. The videos were annotated by one or more experts, according to the difficulty of the task.

Models were evaluated using the F1 metric. For the tool presence detection task, the best model was able to get a 64% average F1. For the phase segmentation, the best model was able to get a 65% average F1.

Do you have a project in AI for endoscopy? Talk to the experts first!

Share The Story