PixTalk tackles a simple but powerful question: how can we make text-based image editing fast and accessible, without the need for complex diffusion models or powerful GPUs? As photographers, what are the core things that you want to do to a photo from the photography point of view? Marcos and co-authors made the list and started to design a neural network able to do all those things, guided by your text instructions. The key technology behind PixTalk is - instead of all the diffusion-based complex models that require very expensive GPUs - a model to tackle these particular photography operations in real time. You can process even on the classical Google Colab GPUs everything up to 24 megapixels, which is more than a 4K resolution. And everything probably will happen in real time, if there are no delays and the GPU is working properly. Amongst these operations, the paper shows, you can control the colors, the illumination, presets, color grading, anything that is important from the photographic point of view and even from the cinematic point of view, like post-production. “The original idea behind this work,” Marcos explains, “was to say OK, we have Adobe. But Adobe is quite difficult to use for the regular user. You have all these sliders, all these buttons and options. What if we can make a neural network that can do all of this and you control it with language, with text? That's all! For this particular set of operations - more than 40 - we are basically like Adobe, but accessible for everyone!” Marcos got the inspiration from Adobe Lightroom, which is the main tool for photography, at least for the professional photographers. Usually, you can edit the white balance of the images, the exposure, the illumination. You can 15 DAILY ICCV Wednesday Marcos Conde
RkJQdWJsaXNoZXIy NTc3NzU=