Computer Vision News - October 2021

58 Medical Imaging Technology Best of MICCAI 2021 model, CoTr and UNETR utilize volumetric inputs and hence can benefit from the spatial context of data. UNETR and TransUNet both use the transformer layers of the ViT model whereas CoTr leverages a deformable transformer layer that narrows down the self-attention to a small set of key positions in an input sequence. In addition, each of these models utilize the transformer layers differently in their architecture. TransUNet uses the transformer layers in the bottleneck of a UNet, while CoTr utilizes them in between the CNN encoder and decoder by connecting them in different scales via skip connections. On the other hand, UNETR uses the transformer layers as the encoder of a U-shaped architecture and generates input sequences by directly utilizing the tokenized patches. The transformer layers of UNETR are connected to a CNN decoder via skip connections in multiple scales. Conclusion Convolutional neural networks (CNNs) have been the de facto standard for 3D medical image segmentation so far. However, Transformers have the potential to bring a fundamental paradigm shift with their strong innate self-attention mechanisms and hold the potential to serve as strong encoders for medical image segmentation tasks. The pre-trained embedding can then be adapted for various down-stream tasks (example, segmentation, classification & detection). In the years to come, we will see new breakthroughs powered by Transformers for medical imaging - the future is exciting, so we should brace ourselves. Overview of CoTr architecture. It consists of CNN and DeTrans encoders as well as a decoder. Multi-scale features are extracted from the CNN encoder, projected to embeddings and processed in DeTrans encoder to capture long range dependencies. The decoder processes features from the DeTrans encoder to compute the final segmentation output.