Generating Labeled Data at Home

When working on ML projects by yourself a big question is how you get interesting labeled datasets. For many projects there are good standard datasets that exist (e.g ImageNet, Cifar-100, etc). However for projects like robotics where your robot sees a very specific view of the world, sometimes you need to label data that you’ve recorded yourself.

Or say your robot has multiple sensor inputs and you want to leverage them all; you’ll need to acquire ground truth that includes the sensor values.

Acquiring Labeled Data Using a Labeler Workforce

There are many third party services you can use to label your data.

DIY

Mechanical Turk
ODesk

For these services you need to build tools that labelers can use to label your data. You need to build the pipeline that feeds your raw data in and gets labeled output out. You need a quality control stage. The plumbing isn’t there automatically. If you want them to label 3D laser points, you’ll have to build a web-based 3D laser labeling tool to enable your label workforce. Sadface.

Boutique

Scale (https://www.scaleapi.com) used by Voyage, Cruise, Drive.ai, Uber
Hive (https://thehive.ai/data)

These services provide a lot of the plumbing you need for some common labeling tasks. For example Scale has a 3D laser labeling tool. They also have plumbing for things like quality control.

Labeling DonkeyCar Tubs

My personal use case for labeling data is my DonkeyCar.

DonkeyCar is an autonomous RC vehicle platform, mostly used to make autonomous racing toy cars. Their data format is custom (called tub) and consists of image frames, throttle / steering information, and optionally input from an IMU (though most people who build these don’t use an IMU). Currently the process for training a DonkeyCar is as follows:

Calibrate the car.
Drive the car on a track you want to learn, perfectly, a number of times.
Edit your footage/data to remove mistake driving.
Train a net using the ground truth driving.

I’m interested in:

Not having to drive on an actual track to train the car (use a simulator instead).
Quality control (editing bad footage / training data).
Training in enough environments to make the driving general.
Utilizing the lidar I got (laser labeling).