Some notes from reading this paper from last August. It’s a well-written paper that covers a lot of different things (information retrieval, cross-modal learning, self-supervision, shortcut prevention, etc).

What was least intuitive to me on first read was how the localization part worked in an unsupervised manner and without any temporal information. At each global step, the dot product that represents the maximum “similarity” between the audio embedding and individual feature vectors is taken to indicate correspondence. So during the backward pass, only one region is encouraged to respond highly (which enables localization). When you do this enough times, image features that correspond to certain audio embeddings are encouraged to be closer together, so you’ll get high activations when visual features corresponding to audio features are observed.

The network learns which visual features correspond to which sounds as a side effect of learning whether the audio clip / image frame correspond. Positive examples are when the 1-second audio clip and image frame overlap. Negative is when they don’t. Of course, this assumes the input training data doesn’t have weird examples & has sufficient examples of silent background objects. (You can think a bit about what degeneracies you might face if you didn’t provide enough data or how latent biases in the dataset could affect performance.)

Notice that these are globally learned correspondences, and the net you train can be fooled since it’s not learning from stereo audio or anything that would give it actual audio localization in 3D space. In their examples, when a drummer is drumming but not hitting the hi-hat, the net still localizes sound as coming from the hi-hat, since it’s only learning what visual features are likely to be active in a given frame, not determining a hard correspondence between motion and sound. When using this for say, robot localization, it’s interesting to think about how it compares to using multi-mics + cross-channel correlation to determine where the sound is coming from. It’s sort of like lidar vs vision; if you encounter things you haven’t seen before, lidar will still reliably say “there’s something there,” whereas vision may choke. Similarly relying on an understanding of the underlying geometry instead of a recognition approach predicated on deep-sampling the wrong feature space will give us more overall reliability here.

Of course they also tried using more temporal information, and it helps for determining if there is audio-visual correspondence but it doesn’t help with cross-modal retrieval (which I find slightly puzzling). They tried 25 frames of multi-frame data as well as single-frame + 10 frames of optical flow (dual-stream).