We propose a method to measure the degree of non-linearity of different elements of transformers . We focus our investigation on the feed-forward networks (FFN) insidetransformers, which contain 2/3 of the model parameters and have so far not received much attention . We find that FFNs are an inefficient yet important architectural element and that they cannot simply be replaced by attentionblocks without a degradation in performance . We study theinteractions between layers in BERT and show that, while the layers exhibitsome hierarchical structure, they extract features in a fuzzy manner . Our results suggest that BERT has an inductive bias towards layer commutativity,which we find is mainly due to the skip connections .

Author(s) : Sumu Zhao, Damian Pascual, Gino Brunner, Roger Wattenhofer

Links : PDF - Abstract

Code :



Keywords : bert - commutativity - linearity - layers - find -

Leave a Reply

Your email address will not be published. Required fields are marked *