Angela Dai
, a fourth-year PhD
student at
Stanford University
,
will give a spotlight talk today on
ScanNet
, a large scale RGB-D
dataset of richly annotated 3D
reconstructions of indoor scenes.
Angela told us that “
the idea is that we
want to power data-hungry machine
learning algorithms like deep learning
on 3D data
”. In 3D there is more
information than what you get in 2D
because you have scale, and you know
how far away everything is from each
other. However, since 3D data is more
difficult to capture than images, and
because it is also more difficult to
annotate, there doesn’t exist much 3D
data.
With ScanNet, they first wanted to
build a scalable data-acquisition
framework. This means first collecting
the 3D reconstructions and to then
annotate them in an efficient way, to
be able to collect more than thousands
of these scans. In the current version
they have about 1,500 scans (video
sequences of RGB-D), which they
collected with users that were
equipped with an iPad app and a depth
sensor attached to it. After the videos
are collected, they are uploaded to the
servers, where they are automatically
reconstructed. They are then pushed
to an Amazon Mechanical Turk
interface, which Angela and her team
used to crowdsource the labelling of
semantic segmentations. The task
(which is usually done by non-experts)
is that given a 3D mesh of a scene, to
paint over this mesh in order to label
instances. E.g., paint over a chair, a
table, or a computer to tell what the
objects are, and where they are in the
space. Angela told us that it usually
takes about five non-experts to get a
good annotation per image. This can
then be used for ground truth
information when training for tasks like
object classification: you cut out one of
the objects that was labelled, and you
try to train an algorithm in order to
determine what object it is.
The idea of the ScanNet dataset is
further to enable training algorithms
directly on the 3D representation. For
example, if you have a robot going
around the room it should recognise
what objects are in the room around it:
you want to be able to identify not
only that there’s something three
metres away from it, but also that this
object is a chair.
Angela and her team also have several
scene understanding benchmark tasks
“It usually takes about five
non-experts to get a good
annotation per image”
Angela Dai
16
SundayScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes
BEST OF CVPR




