ICCV Daily 2023 - Thursday

a system that can predict what’s going to happen in the video, then you can use that system for planning. I’ve been playing with this idea for almost 10 years. We started working on video prediction at FAIR in 2014/15. We had some papers on this. Then, we weren’t moving very fast. We had Mikael Henaff and Alfredo Canziani working on a model of this type that could help plan a trajectory for self-driving cars, which was somewhat successful. But then, we made progress. We realized that predicting everything in a video was not just useless but probably impossible and even hurtful. I came up with this new idea derived from experimental results. The results are such that if you want to use self-supervised learning from images to train a system to run good representations of images, the generative methods don’t work. The methods are based on essentially corrupting an image and then training a neural network to recover the original image. Large language models are trained this way. You take a text, corrupt it, and then train a system to reconstruct it. When you do this with images, it doesn’t work very well. There are a number of techniques to do this, but they don’t work very well. The most successful is probably MAE, which means masked autoencoder. Some of my colleagues at Meta did that. What really works are those joint embedding architectures. You take an image and a corrupted version of the image, run them through encoders, and train the encoders to produce identical representations for those two images so that the representation produced from the corrupted image is identical to that from the uncorrupted image. In the case of a video, you take a segment of video and the following segment, you run them through encoders, and you want to predict the representation of the following segment from the representation of the previous segment. It’s no longer a generative model because you’re not predicting all the missing pixels; you’re predicting a representation of them. The trick is, how do you train something like this while preventing it from collapsing? It’s easy for this system to collapse, ignore the input, and always predict the same thing. That’s the question. So, we did not get to solve the exact problem we wanted? It was the wrong problem to solve. The real problem is to learn how the world works from video. The original approach was a generative model that predicts the next video frames. We couldn’t get this to work. Then, we discovered a bunch of methods that allow one of those joint embedding systems to learn when they’recollapsing. There are a number 3 DAILY ICCV Thursday Yann LeCun “This is, I think, the future of AI systems. Computer vision has a very important role to play there!”

RkJQdWJsaXNoZXIy NTc3NzU=