Clustered Embeddings for Audio Models

This paper was announced yesterday via Amazon Alexa research. I took some notes because it’s related to some things I’m working on.

Audio classification is generally performed in this fashion:

Convert the raw audio into spectrograms
Classify the spectrograms in chunks using a temporal model (e.g LSTM) if long-term dependencies are desired (e.g speech recognition) or a simple convnet if long-term dependencies are not needed (e.g detecting dog barking, baby crying, gunshots).

A common problem faced with audio classification is that individual sounds are rare. If you record life continuously while walking around, most of the time sound is “background noise” (though it is very rarely completely silent).

So Amazon Alexa faced the common ML problem of having lots of examples of background noise and very few examples of classes they wanted to detect. What to do?

The classic ML response would be “oversample the underrepresented classes.” This can be as simple as stuffing copies of the underrepresented examples back in during training so that the model sees more of them. You can get fancier and instead of explicitly copying the examples you can use a weighted loss function to weight certain example classes higher.

You can do the opposite too (undersample the background noise class). In practice you play around with these things (there’s no easy answer to the question: “what should my positive to negative ratio be”).

The folks at amazon decided they wanted to train their net to distinguish more strongly between classes; that is, their goal was to maximize distance between generated embeddings of different classes. The way they approached this was:

Pretrain the net up to the embedding using a setup & loss function that encourages the embeddings to cluster themselves.
Then train use the pretrained net to train just the logits after the clustered embeddings (using weighted cross entropy), and also train another net end-to-end and use whichever one is better.

They found that their technique has more advantage the greater the data imbalance (e.g if you have a ton of negative examples they say this will work better).

I’d be curious to see how their clustering method actually compares to other clustered embeddings. For example, is this roughly equivalent to / better / worse than pretraining embeddings using a siamese network?