The original paper1 is from ICCV 2017 and very simple. So in a traditional convnet, there are lots of conv layers with batchnorm layers between them. Batchnorm layers have a scale param (gamma). If you add an L1 loss term on the scale param you can ask the net to enforce sparsity on the scale. Then after you’ve trained your net, you can see how many channels “survived” (have scale values larger than zero). At that point you retrain with this smaller number of channels, which will make the model have faster runtime when deployed.

This kind of network pruning is like very light architecture search. You only need to do it once, you pay a small performance penalty, and it’s easy to implement. That’s basically why you’d use it; if you have the resources to do NAS you can do that instead and you’ll get a performance boost, and NAS can also tune other params like network depth, what kind of layers to use (e.g conv vs separable conv), etc. But NAS is expensive.