In this paper, we show that ImageNet-Pretrained standard deep CNN models can be used as strong baseline networks for audio classification . Even though there is a significant difference between audio Spectrogram and standard ImageNet image samples, transfer learning assumptions still hold firmly . An ensemble of ImageNet pretrained DenseNet achieves 92.89% validation accuracy on the ESC-50 dataset and 87.42% on the UrbanSound8K dataset which is the current state-of-the-art on both of these datasets . This variance in performance is due to the random initialization of linear classification layer and random mini-batch orderings in multiple runs . This brings significant diversity to build stronger ensemble models with an overall improvement in accuracy. An ensemble of Imagenet pretrained . models achieve 92.99% validation Accuracy on the Esc-50 datasets and . the Urban Sound8K datasets which are the . current state of the . state- of theart on . both of

Links: PDF - Abstract

Code :


Keywords : models - imagenet - datasets - ensemble - state -

Leave a Reply

Your email address will not be published. Required fields are marked *


Enjoy this blog? Please spread the word :)