Computer Vision News - June 2021
14 Surgical Robotics Research Multimodal and Self-Supervised Representation Learning for Automatic Gesture Recognition in Surgical Robotics Every month, Computer Vision News selects a research paper to review. This month we review “Multimodal and Self- Supervised Representation Learning for Automatic Gesture Recognition in Surgical Robotics”. We are indebted to the authors (Aniruddha Tamhane, Jie Ying Wu, Mathias Unberath) for allowing us to use their images to illustrate this review. You can find their paper at this link. by Marica Muffoletto Today we’ll examine an extremely recent work by the trio of researchers from the John Hopkins University in USA who is looking at the exciting world of surgical robotics. Their paper focuses on developing a method to distinguish steps of the surgical process. They worked on a self-supervised multi-modal representation learning algorithm, trained on a combination of videos of surgeries and kinematics data. This model is able to learn task-agnostic surgical gesture representation from both these sources that can generalise well across multiple tasks. The idea of this work seems to come from a well-posed analysis of the state-of- the-art methods with their drawbacks and strengths. Majority of SOA methods considered are found to analyse surgical data that are based on task-specific, supervised learning from a single modality. The fact that they are trained with specific tasks leads to very narrow learning on surgical processes. The choice of using a supervised method of course enforces dependence on expert annotations, which can be tedious and defeat the purpose of having an automatic tool. Finally, some of them usually ignore multiple modalities of information, which prevents from learning generalizable, feature-rich representations. From these weaknesses, themain objectives of this paper are defined: these include the development of a DL architecture which is self-supervised and effectively learns gesture representations, the quantitative demonstration of SOA accuracy in gesture/skill-recognition, and the visualisation of these learnt representation and the formation of semantically meaningful clusters.
Made with FlippingBook
RkJQdWJsaXNoZXIy NTc3NzU=