Computer Vision News - August 2017

Anna Rohrbach

did her PhD at the

Max Planck Institute for Informatics

, and the

focus of her research is video description with natural language.

While prior work focused on describing

a given video by producing a sentence,

in Anna’s current work, they are trying

to extend this to a more rich and more

interesting predictions. They want to

answer questions like: What are the

people wearing in the scene? What are

their genders? Have we seen them

previously? Where exactly are they?

I.e., they want to localise objects and

resolve visual co-references.

“

We are tackling a very complex

problem

”, Anna says, “

we describe the

video with richer person-specific labels

like gender, and we also localise them

”.

The advantage of this is that it allows

them on the one hand to get a

visualisation of what the model is

doing, and also to inspect the errors

which the model

makes and

understand what is going on in the

video. The architecture they used for

the model is very complex and includes

many steps. They first need to detect

people in movies with different view-

angles and conditions - which is quite

challenging on its own already. They

also track people in the video and on

top of this, they have to learn to

associate the names with the visual

appearances. “

And finally, we come to

the actual problem we are trying to

address

”, Anna explained, “

where we

have to do this description along with

all this meta-information.

”

The most challenging thing for her and

Anna Rohrbach

Monday

Generating Descriptions With Grounded and CoReferenced People

BEST OF CVPR