Computer Vision News - May 2022

15 (Jacob) Zhiyuan Fang footprints at the time of inference, as well as the inconvenient two-stage design which requires the use of object detector as visual feature extractor. These all limits their deployment to resource-constrained edge devices for real-world applications. I study how to train small and efficient VL models from the perspective of model compression. To further extend these advancements to the real world, a novel one-stage VL architecture is designed recently in my work to tackle the inference bottleneck and the inconvenient two- stage training, which brings great training flexibility. Extensive discussions have been conducted on several critical aspects that prominently influence the performances of compact VL models. how to effectively mine the hidden visual-textual associations at scale for representation learning. My work in two other papers studies the knowledge distillation (KD) technique for generic V and VL Representation Learning which proves to bring substantial performance gain than the regular representation learning schema. Empirical studies show that our KD assisted representation learning method is more data efficient and brings better performances. 3) Architectural Efficiency . Deploying the VL model on edge devices is notoriously challenging due to their cumbersome architectures. Much of the existing VL models focus on large models that suffer from high latency and large memory Figure 1: VL tasks like referring expression and video moment localization require explicit annotations (i.e., bounding boxes of target objects or temporal boundaries for the moments) which requires human annotations.

RkJQdWJsaXNoZXIy NTc3NzU=