Anna Rohrbach
did her PhD at the
Max Planck Institute for Informatics
, and the
focus of her research is video description with natural language.
While prior work focused on describing
a given video by producing a sentence,
in Anna’s current work, they are trying
to extend this to a more rich and more
interesting predictions. They want to
answer questions like: What are the
people wearing in the scene? What are
their genders? Have we seen them
previously? Where exactly are they?
I.e., they want to localise objects and
resolve visual co-references.
“
We are tackling a very complex
problem
”, Anna says, “
we describe the
video with richer person-specific labels
like gender, and we also localise them
”.
The advantage of this is that it allows
them on the one hand to get a
visualisation of what the model is
doing, and also to inspect the errors
which the model
makes and
understand what is going on in the
video. The architecture they used for
the model is very complex and includes
many steps. They first need to detect
people in movies with different view-
angles and conditions - which is quite
challenging on its own already. They
also track people in the video and on
top of this, they have to learn to
associate the names with the visual
appearances. “
And finally, we come to
the actual problem we are trying to
address
”, Anna explained, “
where we
have to do this description along with
all this meta-information.
”
The most challenging thing for her and
Anna Rohrbach
42
MondayGenerating Descriptions With Grounded and CoReferenced People
BEST OF CVPR




