Computer Vision News - December 2020

Jianbo Jiao is a postdoc researcher at the University of Oxford, advised by Professor Alison Noble and Professor Andrew Zisserman. His work is about self-supervised representation learning with multimodal ultrasound data. He spoke to us ahead of his oral presentation at MICCAI. This work focuses on both ultrasound video data and speech audio data from the sonographer. The method was validated on a large-scale clinical ultrasound dataset called PULSE – short for Perception Ultrasound by Learning Sonographic Experience . Currently, almost all research areas are using deep learning tools , and most of them rely on human annotations to train their models. However, with medical images which require specific expertise, these human annotations are not always easy, or even feasible, to acquire. This fact motivated Jianbo and his team to address the problem of self- supervised learning, which means learning meaningful representations or knowledge from the data itself without any manual annotations. The challenge here is to define a self-supervision signal to supervise the model so that it can learn some representations. This work starts from a basic model to build the correlations between the video and speech audio data but proposes some new techniques to address the specific challenges with medical images . Instead of simply using positive and negative pairs for training, it proposes hard-positive and hard-negative pairs to force the model to “learn harder” and learn stronger representations. “ For natural image video and its corresponding audio, there are very dense correlations,” Jianbo explains. “For example, with someone playing the piano, whenever the actions appear, the sound will appear. However, for medical 2 Paper Presentation 30 Self-supervised Contrastive Video-Speech Representation Learning for Ultrasound Jianbo Jiao Best of MICCAI 2020