Computer Vision News

3 Summary First Order Motion Model... 1 costly ground-truth data annotations. Hence, the need for a method with the ability to animate any object and discard the dependence on prior information or specific training procedures for object instances. “First Order Motion Model for Image Animations” aims to fill this gap. Until this paper, the two main state-of-the-art methods for model-free image animation were: 1) X2Face which uses a dense motion field in order to generate the output video via image warping, and 2) MonkeyNet which uses sparse keypoints trajectories. Building up on the latter of these two, the work proposed by the authors has the following fundamental points of novelty . First, it doesn’t require an explicit reference pose for a canonical representation of the object, which results into the images looking a lot more realistic than they do with X2Face. The motion is not only described as a set of keypoints displacements but also as local affine transformations, which allows for higher performance in large object pose changes. This guarantees more variability in the range of movements that can be reproduced, instantly scoring a point in the comparison with MonkeyNet. Moreover, the introduction of an occlusion-aware generator, employed in order to estimate object parts that are not visible in the source image and that should be inferred from the context. And, last but not least, the release of a new high- resolution dataset, Thai-Chi-HD, suitable for evaluating frameworks for image animation and video generation. The whole pipeline is shown below: the method takes in a source image S and a frame of a driving video D. It consists of a self-supervised strategy that learns a representation of motion as a combination of sparse keypoints and local affine transformations with respect to the reference frame R. It builds upon two modules: - The motion estimation module, which first locates the keypoints in D and S using an encoder-decoder network and models the motion in the neighbourhood of the keypoints with local affine transformations represented by Taylor expansion. Such generated mappings between the reference frame with D and S are used as input to the dense motion model which combines them to produce a motion field , employing backward optical flow, and an occlusion mask that indicates not visible parts in the source image. - The image generation module, consisting of a generator network G that uses the motion field and the occlusion mask , to render the target output. ̂ ← ̂ ← ̂ ←

Computer Vision News - June 2020