We propose a method to measure the degree of non-linearity of different elements of transformers . We focus our investigation on the feed-forward networks (FFN) insidetransformers, which contain 2/3 of the model parameters and have so far not received much attention . We find that FFNs are an inefficient yet important architectural element and that they cannot simply be replaced by attentionblocks without a degradation in performance . We study theinteractions between layers in BERT and show that, while the layers exhibitsome hierarchical structure, they extract features in a fuzzy manner . Our results suggest that BERT has an inductive bias towards layer commutativity,which we find is mainly due to the skip connections .
Author(s) : Sumu Zhao, Damian Pascual, Gino Brunner, Roger WattenhoferLinks : PDF - Abstract
Code :
https://github.com/oktantod/RoboND-DeepLearning-Project
Keywords : bert - commutativity - linearity - layers - find -