Surgical workflow analysis is an important safety guard for the surgeon: with it, a computer is able to scan a video of a surgery, either offline after it has already been performed or online during the surgery itself, and automatically identify at what stage the surgery is at.
Endoscopy has revolutionized the field of surgery by allowing many procedures to be performed in a minimally invasive manner. However, success of the surgery is highly dependent on the endoscopist’s skill and experience, as tighter spatial constraints both for the camera and the surgical instruments mean that precise manual control is required.
AI and computer vision have the potential to aid the surgeon and improve the quality of the procedure. Some examples of their use are in surgical workflow analysis and surgical instrument detection. These applications bring many benefits and lay the groundwork for what will one day be the holy grail of computer-aided surgery – fully robotic surgery.
With surgical workflow analysis, a computer is able to scan a video of a surgery, either offline after it has already been performed or online during the surgery itself, and automatically identify at what stage the surgery is at. This is an important safety guard for the surgeon. If the surgery is at a specific stage and the surgeon tries to use a tool which is not suitable at that point, then the computer can alert them to that fact.
The computer can also suggest to the surgeon what the next step should be. It could suggest they use a specific tool or display pertinent information at any given time. For example, if the operation is to remove a growth and the computer identifies that the growth is visible, then it can display that information together with any other relevant information for the surgeon.
Another interesting application of workflow analysis is for assessing the remaining time of surgery. For example, in a hospital setting where there are many operating rooms and patients awaiting surgery, it is useful to have an estimate of how much more time remains for each procedure. This is something that the computer can do based on the analysis of the video being streamed.
Offline video analysis can be useful for building a database of video segments, for training and for quality assessment of the surgeon.
Initial attempts to deal with surgical workflow analysis were based on visual cues. Visual cues are where it is known that in a procedure at a certain point a specific instrument will be used, or something will be seen. These cues were programmed into a computer or algorithm so that it knew that this frame belonged to this stage and that frame belonged to another stage.
This process was specifically tailored to each procedure. However, it is much better to have something more generalized which can be used for different types of procedure. That is why deep learning is the way forward.
Work on deep learning was frame-based originally. The computer looked at an individual frame without the context of the full video. Convolutional neural networks (CNNs) were used to analyze the frame and identify to which stage it belonged. This had limited success because each frame is part of a sequence. If you are able to analyze a sequence as a whole, you have a better chance of knowing with stage you are looking at.
Another approach is to compare every two frames. A siamese neural network can compare different frames and decide which one comes before the next and this can be used to place the frames in the correct order.
A more complex approach is to use recurrent neural networks, or more specifically, Long Short-Term Memory (LSTM) networks, which analyze a temporal sequence of frames and look for specific cues in the whole temporal sequence. One of the more advanced works uses a 3D CNN to analyze the video as a whole and find correlations. Then an LSTM network is used to do the final work and decide what stage the video is at.
LSTMs are tricky to train and use – Daniel Tomer, algorithm Team Leader at RSIP Vision, comments – The recurrent nature of the model makes it harder to visualize and, as opposed to feed forward CNNs, there is more than one way to use the model for inference. You can feed the model batches of frames and get a prediction for all of them at once, or choose to use the prediction of only the last frame of each batch, and you can also feed one frame one at a time and continuously update the inner state of the model throughout the entire process. From our experience at RSIP Vision, we found that there is no better or worse method. Each task has its unique challenges that can be solved by a different inference procedure.