Previous Page  16 / 56 Next Page
Information
Show Menu
Previous Page 16 / 56 Next Page
Page Background

Angela Dai

, a fourth-year PhD

student at

Stanford University

,

will give a spotlight talk today on

ScanNet

, a large scale RGB-D

dataset of richly annotated 3D

reconstructions of indoor scenes.

Angela told us that “

the idea is that we

want to power data-hungry machine

learning algorithms like deep learning

on 3D data

”. In 3D there is more

information than what you get in 2D

because you have scale, and you know

how far away everything is from each

other. However, since 3D data is more

difficult to capture than images, and

because it is also more difficult to

annotate, there doesn’t exist much 3D

data.

With ScanNet, they first wanted to

build a scalable data-acquisition

framework. This means first collecting

the 3D reconstructions and to then

annotate them in an efficient way, to

be able to collect more than thousands

of these scans. In the current version

they have about 1,500 scans (video

sequences of RGB-D), which they

collected with users that were

equipped with an iPad app and a depth

sensor attached to it. After the videos

are collected, they are uploaded to the

servers, where they are automatically

reconstructed. They are then pushed

to an Amazon Mechanical Turk

interface, which Angela and her team

used to crowdsource the labelling of

semantic segmentations. The task

(which is usually done by non-experts)

is that given a 3D mesh of a scene, to

paint over this mesh in order to label

instances. E.g., paint over a chair, a

table, or a computer to tell what the

objects are, and where they are in the

space. Angela told us that it usually

takes about five non-experts to get a

good annotation per image. This can

then be used for ground truth

information when training for tasks like

object classification: you cut out one of

the objects that was labelled, and you

try to train an algorithm in order to

determine what object it is.

The idea of the ScanNet dataset is

further to enable training algorithms

directly on the 3D representation. For

example, if you have a robot going

around the room it should recognise

what objects are in the room around it:

you want to be able to identify not

only that there’s something three

metres away from it, but also that this

object is a chair.

Angela and her team also have several

scene understanding benchmark tasks

“It usually takes about five

non-experts to get a good

annotation per image”

Angela Dai

16

Sunday

ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes

BEST OF CVPR