6 DAILY WACV Sunday Oral Presentation Through a series of experiments, the team began to narrow down several contributing factors. One issue comes from the scale of the classification space. When a model must choose from hundreds of species, the prompt context can grow very large, making it harder for it to keep track of which option it is selecting. Logan describes this as a situation in which the model struggles to maintain the connection among the question, the choices, and the final answer. Perhaps the most surprising finding was a form of brittleness in the models’ responses. “If you're an ornithologist and you're using an LLM to ask, ‘what is the species of this bird?’,” he says. “Very small changes in that input – for instance, you say, ‘identify the species of bird in this image’ – can result in very different predictions.” He gives an example on the gull subset of the CUB200 dataset, where slight variations in phrasing can shift the model’s prediction from one species to another, and affect overall benchmark performance (see image). Another challenge was computational. Evaluating models across very large sets of candidate labels requires an enormous number of forward passes through the model. For researchers without the resources of large technology companies, this quickly becomes impractical. However, Logan and his collaborators ran their experiments on their own research cluster rather than relying on commercial APIs. Also, instead of calculating full probability sequences for every possible class name, the team truncated the process once a class could be uniquely distinguished by its early tokens. By stopping the calculation once a species name becomes uniquely identifiable, the number of required computations can be dramatically reduced. “CUB200 has 200 bird species, so if you wanted to get the probability that an LLM would say a specific species out of those 200, it would require around 687 forward passes, which is not tenable for us,” he explains. “But if you just need to tell which is the most likely, it's much,
RkJQdWJsaXNoZXIy NTc3NzU=