Computer Vision News - November 2020

3 Pramit Saha 19 Best of MICCAI 2020 given the continuously varying tongue positions and shapes. The predicted formants are fed into the synthesizer to generate speech sounds. Ultrasound- to-formant mapping helps a user gain control over speed in a way that would not be possible through more direct methods. “ Saliency maps from the penultimate layer of the network reveal that it can automatically identify the tongue contour as an internal representation without any explicit annotation,” he tells us. “The loss function that we use is the mean absolute error in a formant frequencyspaceandnot the imagespace. This helps us to get rid of any manual contour annotation and supports real- time implementation .” The work proposes a 3D convolution technique known to be well suited for extracting speech from video. It develops a novel spatio-temporal feature extraction strategy to extract the tongue movements continuously from the ultrasound videos, using spatial information to identify the tongue contour, as well as tracking the tongue contour over the frames one after another. Both need to be done by the neural network and a 3D CNN is generally a good choice for this. However, 3D CNNs lack in the temporal side and are very computationally expensive. “To solve this, we partitioned the 3D CNN into three different parts,” Pramit explains. “In one part, we kept the 3D CNN as it is. In the other two parts we removed the spatial kernels and the temporal kernels . We split the 2D CNN kernel orthogonally into parallel branches – a spatial branch with 1 X 3 X 3 kernels and a temporal branch with 3 X 1 X 1 kernels. This meant we could constrain the network to figure out which temporal features to understand and which spatial features to understand in particular blocks. This produces a lot of computational complexity. Later, we combined all those features together with the help of shuffling, which facilitates cross-group feature and information exchange .”