On Position Embeddings in BERT

Research output: Contribution to conference › Paper › Research

Standard

On Position Embeddings in BERT. / Wang, Benyou ; Shan, Lifeng ; Lioma, Christina; Jiang, Xin; Yang, Hao; Liu, Qun; Simonsen, Jakob Grue.

2021. 1-21 Paper presented at 9th International Conference on Learning Representations - ICLR 2021, Virtual.

Research output: Contribution to conference › Paper › Research

Harvard

Wang, B, Shan, L, Lioma, C, Jiang, X, Yang, H, Liu, Q & Simonsen, JG 2021, 'On Position Embeddings in BERT', Paper presented at 9th International Conference on Learning Representations - ICLR 2021, Virtual, 03/05/2021 - 07/05/2021 pp. 1-21.

APA

Wang, B., Shan, L., Lioma, C., Jiang, X., Yang, H., Liu, Q., & Simonsen, J. G. (2021). On Position Embeddings in BERT. 1-21. Paper presented at 9th International Conference on Learning Representations - ICLR 2021, Virtual.

Vancouver

Wang B, Shan L, Lioma C, Jiang X, Yang H, Liu Q et al. On Position Embeddings in BERT. 2021. Paper presented at 9th International Conference on Learning Representations - ICLR 2021, Virtual.

Author

Wang, Benyou ; Shan, Lifeng ; Lioma, Christina ; Jiang, Xin ; Yang, Hao ; Liu, Qun ; Simonsen, Jakob Grue. / On Position Embeddings in BERT. Paper presented at 9th International Conference on Learning Representations - ICLR 2021, Virtual.

Bibtex

@conference{285f5312f76c4db69cd033b8d507ddf8,

title = "On Position Embeddings in BERT",

abstract = "Various Position Embeddings (PEs) have been proposed in Transformer based architectures~(e.g. BERT) to model word order. These are empirically-driven and perform well, but no formal framework exists to systematically study them. To address this, we present three properties of PEs that capture word distance in vector space: translation invariance, monotonicity, and symmetry. These properties formally capture the behaviour of PEs and allow us to reinterpret sinusoidal PEs in a principled way.Moreover, we propose a new probing test (called `identical word probing') and mathematical indicators to quantitatively detect the general attention patterns with respect to the above properties. An empirical evaluation of seven PEs (and their combinations) for classification (GLUE) and span prediction (SQuAD) shows that: (1) both classification and span prediction benefit from translation invariance and local monotonicity, while symmetry slightly decreases performance;(2) The fully-learnable absolute PE performs better in classification, while relative PEs perform better in span prediction. We contribute the first formal and quantitative analysis of desiderata for PEs, and a principled discussion about their correlation to the performance of typical downstream tasks.",

author = "Benyou Wang and Lifeng Shan and Christina Lioma and Xin Jiang and Hao Yang and Qun Liu and Simonsen, {Jakob Grue}",

year = "2021",

language = "English",

pages = "1--21",

note = "9th International Conference on Learning Representations - ICLR 2021 ; Conference date: 03-05-2021 Through 07-05-2021",

}

RIS

TY - CONF

T1 - On Position Embeddings in BERT

AU - Wang, Benyou

AU - Shan, Lifeng

AU - Lioma, Christina

AU - Jiang, Xin

AU - Yang, Hao

AU - Liu, Qun

AU - Simonsen, Jakob Grue

PY - 2021

Y1 - 2021

N2 - Various Position Embeddings (PEs) have been proposed in Transformer based architectures~(e.g. BERT) to model word order. These are empirically-driven and perform well, but no formal framework exists to systematically study them. To address this, we present three properties of PEs that capture word distance in vector space: translation invariance, monotonicity, and symmetry. These properties formally capture the behaviour of PEs and allow us to reinterpret sinusoidal PEs in a principled way.Moreover, we propose a new probing test (called `identical word probing') and mathematical indicators to quantitatively detect the general attention patterns with respect to the above properties. An empirical evaluation of seven PEs (and their combinations) for classification (GLUE) and span prediction (SQuAD) shows that: (1) both classification and span prediction benefit from translation invariance and local monotonicity, while symmetry slightly decreases performance;(2) The fully-learnable absolute PE performs better in classification, while relative PEs perform better in span prediction. We contribute the first formal and quantitative analysis of desiderata for PEs, and a principled discussion about their correlation to the performance of typical downstream tasks.

AB - Various Position Embeddings (PEs) have been proposed in Transformer based architectures~(e.g. BERT) to model word order. These are empirically-driven and perform well, but no formal framework exists to systematically study them. To address this, we present three properties of PEs that capture word distance in vector space: translation invariance, monotonicity, and symmetry. These properties formally capture the behaviour of PEs and allow us to reinterpret sinusoidal PEs in a principled way.Moreover, we propose a new probing test (called `identical word probing') and mathematical indicators to quantitatively detect the general attention patterns with respect to the above properties. An empirical evaluation of seven PEs (and their combinations) for classification (GLUE) and span prediction (SQuAD) shows that: (1) both classification and span prediction benefit from translation invariance and local monotonicity, while symmetry slightly decreases performance;(2) The fully-learnable absolute PE performs better in classification, while relative PEs perform better in span prediction. We contribute the first formal and quantitative analysis of desiderata for PEs, and a principled discussion about their correlation to the performance of typical downstream tasks.

M3 - Paper

SP - 1

EP - 21

T2 - 9th International Conference on Learning Representations - ICLR 2021

Y2 - 3 May 2021 through 7 May 2021

ER -

ID: 300919719