Computer Vision News

16 Surgical Robotics Research Here, V i and K i are the two sources of data that are input to the model, namely videos and kinematics, T is a transformation on V that extracts the optical flow and r corresponds to the encoder function, parametrized by θ , while the decoder function D is parametrized by ϕ . After the model is trained, the encoded representations in r (T(V_i );θ ) should retain all the critical information such as: the exact surgical gesture, the identity of the surgeon and the skill with which the segment of surgery was performed. These data are paired to train an encoder-decoder model. The choice of this specific architecture comes after the review of RNNs, CNNs and other architectures in similar applications, but it is also motivated by the consideration that training a model on the alignment-based task which learns the one-to-one mapping from video to kinematics might be more complex than employing an encoder-decoder task with the objective to extract the corresponding kinematics vectors from the optical flows. Hence, the final training objective in this deep learning algorithm is to minimise the information loss (chosen as L2 norm of the difference) between the decoded representations and the kinematics. This is expressed in the function below. min , 1 ∑|| ( ( ( ); ); ) − || 22 =1 The model is trained with the encoder-decoder architecture shown above, which includes 1) an optical-flow extraction step which filters out domain-specific information such as video quality, contrast, details about the surgical instruments, 2) an encoder made of 2D CNNs which encodes information from the videos (optical flow) into the representation and parses them to 3) a decoder built as a simple FCN with ReLU activations, which is kept shallow to maximize information retention in the representations yielded by the encoded network. This outputs the kinematics that are compared through an MSE loss with the ground truth vectors provided in the JIGSAWS dataset.

Computer Vision News - June 2021