Modern audio-visual datasets contain biases that undermine the real-world performance of models trained on that data . We introduce Spoken ObjectNet to remove some of these biases . This dataset expands upon ObjectNet, which is a bias-controlled image dataset . We detail our datacollection pipeline, which features several methods to improve caption quality, including automated language model checks . Lastly, we show baseline results onimage retrieval and audio retrieval tasks. These results show that modelstrained on other datasets and then evaluated on SpokenObjectNet tend to perform poorly due to biases in other datasets that the models have learned. We also show evidence that the performance decrease is due to the datasetcontrols, and not the transfer setting. We

Author(s) : Ian Palmer, Andrew Rouditchenko, Andrei Barbu, Boris Katz, James Glass

Links : PDF - Abstract

Code :

Keywords : objectnet - dataset - show - datasets - biases -

Leave a Reply

Your email address will not be published. Required fields are marked *