Evaluating large language models is harder than it first appears, because their answers are generated as freeform text rather than selected from a fixed set of labels. The challenge becomes even greater in highly specialized visual domains where models must distinguish between extremely similar categories. “Whether it’s birds or flowers or insects, evaluating these niche domains is really, really hard,” Logan confirms. “The choice count is in the hundreds of thousands.” In this work, Logan explores what happens when multimodal large language models are applied to these fine-grained classification problems. One of the motivations for this came from his advisor, Grant Van Horn, who is widely known for the iNaturalist challenges and for his work connecting machine learning research with real-world ecological applications. “He works with ecologists, scientists, and the Cornell Lab of Ornithology to deploy AI and ML to natural world recognition systems,” Logan reveals. “Ecologists and scientists have a real need for integrating AI into their products.” He traces the origins of the project back to earlier work on visionlanguage models such as CLIP. As increasingly powerful multimodal systems began attracting attention and investment, he wanted to understand how useful they are in these specialized settings. “Everyone’s putting tons of money into these models,” Logan says. “But how well do they perform on these tasks?” 4 DAILY WACV Sunday Oral Presentation Logan Lawrence is a secondyear PhD student at UMass Amherst. His paper examines how multimodal large language models perform on fine-grained visual recognition tasks such as species identification. Logan speaks to us ahead of his oral and poster presentations this afternoon. You May Speak Freely: Improving the Fine-Grained Visual Recognition Capabilities of Multimodal Large Language Models with Answer Extraction
RkJQdWJsaXNoZXIy NTc3NzU=