Computer Vision News - November 2022

57 Metrics Reloaded base their diagnosis on pixel values, ” Paul continues. “ They would not say to a patient – hey, you have a cancerous pixel here! At MICCAI , 80% of papers are new U-Net variants , and they all validate with Dice score segmentation metrics . But most often, the clinical need is not on the pixel scale, and rather decisions are made on objects or entire structures. ” Work has been ongoing in this direction for several years. Annika Reinke and a group of eminent co-signers impressed with their position paper at MICCAI 2018 , showing how “ security holes ” related to challenge design and organization could be used to manipulate rankings. They proposed best practice recommendations to remove opportunities for cheating. Once the problem was identified, a large international consortium of image analysis experts came together. Using a multi- stage Delphi process , with questionnaires and expert group meetings for consensus building, they began to figure out the best approach for fostering the convergence of validation methodology and changing common practice. Metrics Reloaded, a comprehensive framework guiding researchers toward choosing metrics in a problem-aware manner, grew out of all this work. “ We want to make sure that people who use our recommendation framework find out the best metrics for them, ” Minu states. “ They should be made aware of the pitfalls and educated transparently about what they’re doing, rather than just putting something in and getting something out of a black box and not understanding why these metrics would be adequate. It takes time to change common practice and ideal – for example, using two metrics with different names but the samemathematical formula while believing that they are two separate metrics with two separate statements. They also suggest that the computer vision community has developed certain practices that may be disconnected from application over the years. There are many great algorithms and research successes, but not so many end up in a clinical setting. Part of the reason for this gap might be that algorithms are not directly evaluated for clinical translation. “ We don’t have specific examples of where it’s gone wrong in practice because it’s still in the research bubble, ” Paul points out. “ We don’t get to see many models being tried and failing, so we can’t provide this empirical evidence that it failed because there was a wrong validation. More abstractly, there are two big bottlenecks provoked by inappropriate metrics . One is the general translation into clinics because the numbers just don’t add up. Everyone does segmentation, but we talk to clinicians who don’t want it. Also, research progress itself is impeded. If competitions use inappropriate metrics, then winning models are selected that aren’t the best for the job. That could spark research in the wrong direction. If we can’t select the best models, we can’t use them again as baselines to develop new methods. The whole scientific progress is affected. ” Lena and Annika Reinke did some analysis and found that 70% of competitions rely on semantic segmentation, where each pixel is classified individually. “ From a clinician’s perspective, when they diagnose something in the clinic, they don’t